# Tutorial 9: Regression Continued

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and assignment work, you will be able to:

* Recognize situations where a simple regression analysis would be appropriate for making predictions.
* Explain the $k$-nearest neighbour regression algorithm and describe how it differs from k-nn classification.
* Interpret the output of a $k$-nn regression.
* In a dataset with two variables, perform $k$-nearest neighbour regression in Python using `scikit-learn` to predict the values for a test dataset.
* Execute cross-validation in Python to choose the number of neighbours.
* Using Python, evaluate $k$-nn regression prediction accuracy using a test data set and an appropriate metric (*e.g.*, root means square prediction error).
* In a dataset with > 2 variables, perform $k$-nn regression in Python using `scikit-learn` to predict the values for a test dataset.
* In the context of $k$-nn regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
* Describe advantages and disadvantages of the $k$-nearest neighbour regression approach.
* Perform ordinary least squares regression in Python using `scikit-learn` to predict the values for a test dataset.
* Compare and contrast predictions obtained from $k$-nearest neighbour regression to those obtained using simple ordinary least squares regression from the same dataset.

This tutorial covers parts of [Chapter 8](https://python.datasciencebook.ca/regression2) of the online textbook. You should read this chapter before attempting this assignment. Any place you see `___`, you must fill in the function, variable, or data to complete the code. Substitute the `raise NotImplementedError` with your completed code and answers then proceed to run the cell.

In [None]:
### Run this cell before continuing.
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

# Output dataframes instead of arrays
set_config(transform_output="pandas")

## Predicting credit card balance

<img src='https://media.giphy.com/media/LCdPNT81vlv3y/giphy-downsized-large.gif' align="left" width='400'>

Source: https://media.giphy.com/media/LCdPNT81vlv3y/giphy-downsized-large.gif

Here in this assignment we will work with a simulated data set that contains information that we can use to create a model to predict customer credit card balance. A bank might use such information to predict which customers might be the most profitable to lend to (customers who carry a balance, but do not default, for example). Specifically, we wish to build a model to predict credit card balance (`Balance` column) based on income (`Income` column) and credit rating (`Rating` column).

We access this data set by reading it from the `data` folder. The unnamed column in the data is the `ID` of each of the customers, so we rename it as such.

In [None]:
credit_original = (
    pd.read_csv("data/Credit.csv")
    .rename(columns={"Unnamed: 0": "ID"})
)
credit_original.head()

**Question 1.1**
<br> {points: 1}

Select only the columns of data we are interested in using for our prediction (both the predictors and the response variable). 

*Note: We could alternatively just leave these variables in and specify them as predictors and response in data splitting steps. But for this worksheet, let's select the relevant columns first.*

*Assign the modified data frame to a variable named `credit`.*

In [None]:
# your code here
raise NotImplementedError
credit

In [None]:
from hashlib import sha1
assert sha1(str(type(credit)).encode("utf-8")+b"79aa").hexdigest() == "7f14a0661e8152ac645fcf739950a148b1e3e8a5", "type of type(credit) is not correct"

assert sha1(str(type(credit.shape)).encode("utf-8")+b"79ab").hexdigest() == "682686ad11993a0962e231173ace4f5c7eebfe63", "type of credit.shape is not tuple. credit.shape should be a tuple"
assert sha1(str(len(credit.shape)).encode("utf-8")+b"79ab").hexdigest() == "4716fbed56b644f0ee319a6180fae97bdb2eeb92", "length of credit.shape is not correct"
assert sha1(str(sorted(map(str, credit.shape))).encode("utf-8")+b"79ab").hexdigest() == "26274d7832f33fc97b0f5ac5ea1184296ac41d14", "values of credit.shape are not correct"
assert sha1(str(credit.shape).encode("utf-8")+b"79ab").hexdigest() == "bdaf6b488b757ee5f1864a82462e09aa5e2e28e3", "order of elements of credit.shape is not correct"

assert sha1(str(type("Income" in credit.columns)).encode("utf-8")+b"79ac").hexdigest() == "0b6bb82887a64653db56f34b09cb2fbd519e8989", "type of \"Income\" in credit.columns is not bool. \"Income\" in credit.columns should be a bool"
assert sha1(str("Income" in credit.columns).encode("utf-8")+b"79ac").hexdigest() == "0aa73defc5c176620f2934eeca054b54b493f9de", "boolean value of \"Income\" in credit.columns is not correct"

assert sha1(str(type("Balance" in credit.columns)).encode("utf-8")+b"79ad").hexdigest() == "e62a2a57b8f3cf0310f15c492020b2c35b26b296", "type of \"Balance\" in credit.columns is not bool. \"Balance\" in credit.columns should be a bool"
assert sha1(str("Balance" in credit.columns).encode("utf-8")+b"79ad").hexdigest() == "1ab5380f25da06c827873bd595106711b2109b76", "boolean value of \"Balance\" in credit.columns is not correct"

assert sha1(str(type("Rating" in credit.columns)).encode("utf-8")+b"79ae").hexdigest() == "bc931b18e206834bc66e1ba54d5bc3d2cdf3e97a", "type of \"Rating\" in credit.columns is not bool. \"Rating\" in credit.columns should be a bool"
assert sha1(str("Rating" in credit.columns).encode("utf-8")+b"79ae").hexdigest() == "f74105efe6125682f426c2183b5fbff2e3990b72", "boolean value of \"Rating\" in credit.columns is not correct"

assert sha1(str(type("X1" not in credit.columns)).encode("utf-8")+b"79af").hexdigest() == "3ebb9d798a931bff247bf4c7ef1617c1923ebdc3", "type of \"X1\" not in credit.columns is not bool. \"X1\" not in credit.columns should be a bool"
assert sha1(str("X1" not in credit.columns).encode("utf-8")+b"79af").hexdigest() == "cb4efad96e2bbd1a901ecee0e9be1d3a6ab446ca", "boolean value of \"X1\" not in credit.columns is not correct"

print('Success!')

**Question 1.2**
<br> {points: 1}

**Before** we perform exploratory data analysis, we should create our training and testing data sets. First, split the `credit` data set. Use 60% of the data and assign your answer to objects called `credit_training` and `credit_testing`. Please set the `random_state` as 2000. 

Set the variable we want to predict as the `y_train` and the predictors as `X_train` for `credit_training`. Set the variable we want to predict as the `y_test` and the predictors as `X_test` for `credit_testing`.  

*Assign your answers to variables named `credit_training`, `credit_testing`, `X_train`, `y_train`, `X_test`, and `y_test`.*

In [None]:
# _____, _____ = train_test_split(
#    ____, ____, random_state=2000           # do not change the random_state
# )

# X_train = ___
# y_train = ___

# X_test = ___
# y_test = ___

# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(credit_training is None)).encode("utf-8")+b"5b6b6").hexdigest() == "0643e7828f98abdd0904911b71990c43510b20b3", "type of credit_training is None is not bool. credit_training is None should be a bool"
assert sha1(str(credit_training is None).encode("utf-8")+b"5b6b6").hexdigest() == "6c58811182bd3cd2a15a93d721a55836593e2c65", "boolean value of credit_training is None is not correct"

assert sha1(str(type(credit_training.shape)).encode("utf-8")+b"5b6b7").hexdigest() == "906073ae13ed5cdb8e2a80b6be53ba8adecf7492", "type of credit_training.shape is not tuple. credit_training.shape should be a tuple"
assert sha1(str(len(credit_training.shape)).encode("utf-8")+b"5b6b7").hexdigest() == "31d69b19db724af89ae14661ede5a982d4116e49", "length of credit_training.shape is not correct"
assert sha1(str(sorted(map(str, credit_training.shape))).encode("utf-8")+b"5b6b7").hexdigest() == "c8e5bc653aa0be47c05ba7fade2d12f5c388e9c1", "values of credit_training.shape are not correct"
assert sha1(str(credit_training.shape).encode("utf-8")+b"5b6b7").hexdigest() == "683450f01a1502c4e5a80ce49cb5c6c7b204eee3", "order of elements of credit_training.shape is not correct"

assert sha1(str(type(sum(credit_training.Balance))).encode("utf-8")+b"5b6b8").hexdigest() == "95633c0b56cc2dbe0bad4fa8ec62465684ab8f31", "type of sum(credit_training.Balance) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(credit_training.Balance)).encode("utf-8")+b"5b6b8").hexdigest() == "c0ea6a7dfd046d3e68ef426e756f20c264899bed", "value of sum(credit_training.Balance) is not correct"

assert sha1(str(type(sum(credit_training.Income))).encode("utf-8")+b"5b6b9").hexdigest() == "bf48b4c5d6d0692f3e57efdace6e0a38e2f9cd18", "type of sum(credit_training.Income) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(credit_training.Income), 2)).encode("utf-8")+b"5b6b9").hexdigest() == "16daef4e51ec9fc615a9dcbb532486a3095ac981", "value of sum(credit_training.Income) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(credit_training.Rating))).encode("utf-8")+b"5b6ba").hexdigest() == "128f575f6cfa2ad3035587d92a4c291b74de12e1", "type of sum(credit_training.Rating) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(credit_training.Rating)).encode("utf-8")+b"5b6ba").hexdigest() == "830d061e041f5b2b8350834515c123eed78b0e4e", "value of sum(credit_training.Rating) is not correct"

assert sha1(str(type(credit_testing is None)).encode("utf-8")+b"5b6bb").hexdigest() == "03698b90613c3de9e6d5b5a783c3a827fe7c5eb2", "type of credit_testing is None is not bool. credit_testing is None should be a bool"
assert sha1(str(credit_testing is None).encode("utf-8")+b"5b6bb").hexdigest() == "c333b9d73c0bff7e9066f164cfc9c0329050e9eb", "boolean value of credit_testing is None is not correct"

assert sha1(str(type(credit_testing.shape)).encode("utf-8")+b"5b6bc").hexdigest() == "be4374999948f978c6d38168ad1e02033fd7b02b", "type of credit_testing.shape is not tuple. credit_testing.shape should be a tuple"
assert sha1(str(len(credit_testing.shape)).encode("utf-8")+b"5b6bc").hexdigest() == "811fee65d3b546a8381b938365999c8493caa903", "length of credit_testing.shape is not correct"
assert sha1(str(sorted(map(str, credit_testing.shape))).encode("utf-8")+b"5b6bc").hexdigest() == "28186a462c5ea6dac71f24f80e4141a6d71bb705", "values of credit_testing.shape are not correct"
assert sha1(str(credit_testing.shape).encode("utf-8")+b"5b6bc").hexdigest() == "c82fb4bdbfac99e205051b2bffb13c29da2d1a08", "order of elements of credit_testing.shape is not correct"

assert sha1(str(type(sum(credit_testing.Balance))).encode("utf-8")+b"5b6bd").hexdigest() == "bcffdf806aec8c156f06d6b3f5db826937faa462", "type of sum(credit_testing.Balance) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(credit_testing.Balance)).encode("utf-8")+b"5b6bd").hexdigest() == "028ad075f3bd0125cd8fc8ed5ac5156120eec122", "value of sum(credit_testing.Balance) is not correct"

assert sha1(str(type(sum(credit_testing.Income))).encode("utf-8")+b"5b6be").hexdigest() == "c14a7d0ca215c88565b1708081f5531e2740139d", "type of sum(credit_testing.Income) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(credit_testing.Income), 2)).encode("utf-8")+b"5b6be").hexdigest() == "5bd6c6195a0ead22b877be2ad0a4fde3baf34641", "value of sum(credit_testing.Income) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(credit_testing.Rating))).encode("utf-8")+b"5b6bf").hexdigest() == "85bb188fbe826269d3c32d3b378e9ec2e9fd4e72", "type of sum(credit_testing.Rating) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(credit_testing.Rating)).encode("utf-8")+b"5b6bf").hexdigest() == "1dce6998bfd32fba33d9cd4a512f26f17ba81dad", "value of sum(credit_testing.Rating) is not correct"

assert sha1(str(type(X_train.columns.values)).encode("utf-8")+b"5b6c0").hexdigest() == "b149934c168d766f04cc4d425bf6ac2fa2f00c91", "type of X_train.columns.values is not correct"
assert sha1(str(X_train.columns.values).encode("utf-8")+b"5b6c0").hexdigest() == "245f0d4ffc032e6a765e687e316342749ef4f015", "value of X_train.columns.values is not correct"

assert sha1(str(type(X_train.shape)).encode("utf-8")+b"5b6c1").hexdigest() == "563045ad0714b38009b5ec6ea852d6fd49c2b951", "type of X_train.shape is not tuple. X_train.shape should be a tuple"
assert sha1(str(len(X_train.shape)).encode("utf-8")+b"5b6c1").hexdigest() == "41b16d15ed633c26254fca98382205bb8f3dfed8", "length of X_train.shape is not correct"
assert sha1(str(sorted(map(str, X_train.shape))).encode("utf-8")+b"5b6c1").hexdigest() == "8a9bbbd1284f43218d04cc4cedb4652564bb0047", "values of X_train.shape are not correct"
assert sha1(str(X_train.shape).encode("utf-8")+b"5b6c1").hexdigest() == "c2c2761afc1e6d04d7839fb3e129c9ed0d827644", "order of elements of X_train.shape is not correct"

assert sha1(str(type(y_train.name)).encode("utf-8")+b"5b6c2").hexdigest() == "ac801b80c84560d52acfb00645e5ea2e977a6873", "type of y_train.name is not str. y_train.name should be an str"
assert sha1(str(len(y_train.name)).encode("utf-8")+b"5b6c2").hexdigest() == "3a752ec4910d27c15a979f9da8393694241290e6", "length of y_train.name is not correct"
assert sha1(str(y_train.name.lower()).encode("utf-8")+b"5b6c2").hexdigest() == "da28db7742b68afc58a1f064d9e79962eab2e14a", "value of y_train.name is not correct"
assert sha1(str(y_train.name).encode("utf-8")+b"5b6c2").hexdigest() == "38d073666728f4d19e8c9f06fe6cecf78677a9bc", "correct string value of y_train.name but incorrect case of letters"

assert sha1(str(type(y_train.shape)).encode("utf-8")+b"5b6c3").hexdigest() == "832727290287c713a26a814ea8d042afd5ed347e", "type of y_train.shape is not tuple. y_train.shape should be a tuple"
assert sha1(str(len(y_train.shape)).encode("utf-8")+b"5b6c3").hexdigest() == "657261ad0918701cddb18f7af595dea4150597cf", "length of y_train.shape is not correct"
assert sha1(str(sorted(map(str, y_train.shape))).encode("utf-8")+b"5b6c3").hexdigest() == "b28f94cbfc8169d863cb84ab255d38818d45167e", "values of y_train.shape are not correct"
assert sha1(str(y_train.shape).encode("utf-8")+b"5b6c3").hexdigest() == "a7aee2bab4046c92c592f65c0b7ca5c1fb6caa76", "order of elements of y_train.shape is not correct"

assert sha1(str(type(X_test.columns.values)).encode("utf-8")+b"5b6c4").hexdigest() == "207ce30876cd4414bb6b8083450032e49684c771", "type of X_test.columns.values is not correct"
assert sha1(str(X_test.columns.values).encode("utf-8")+b"5b6c4").hexdigest() == "c20cfee816e588134da866e4ab3d1c8018dad053", "value of X_test.columns.values is not correct"

assert sha1(str(type(X_test.shape)).encode("utf-8")+b"5b6c5").hexdigest() == "39e27560d78cc9089778d7b38e06986b5046af11", "type of X_test.shape is not tuple. X_test.shape should be a tuple"
assert sha1(str(len(X_test.shape)).encode("utf-8")+b"5b6c5").hexdigest() == "c328d2056fda57741caf1a750fd9e683eb49bac4", "length of X_test.shape is not correct"
assert sha1(str(sorted(map(str, X_test.shape))).encode("utf-8")+b"5b6c5").hexdigest() == "b2ab8d8c244d68b931e435c55c2932c8309e70aa", "values of X_test.shape are not correct"
assert sha1(str(X_test.shape).encode("utf-8")+b"5b6c5").hexdigest() == "6c6483bd826440262ce80439f0d79bf473aeaf4b", "order of elements of X_test.shape is not correct"

assert sha1(str(type(y_test.name)).encode("utf-8")+b"5b6c6").hexdigest() == "a74eb7360eb0de9b3cbcf2797eee3738fd6cdc38", "type of y_test.name is not str. y_test.name should be an str"
assert sha1(str(len(y_test.name)).encode("utf-8")+b"5b6c6").hexdigest() == "1f97cc105ff094fd4eec0160ebcb9d1c97d4ba01", "length of y_test.name is not correct"
assert sha1(str(y_test.name.lower()).encode("utf-8")+b"5b6c6").hexdigest() == "4cdc0492dcf9f60d98979d3598fd9a4cb816b172", "value of y_test.name is not correct"
assert sha1(str(y_test.name).encode("utf-8")+b"5b6c6").hexdigest() == "123dc585975579792de5da28ac3a0493f8c74df4", "correct string value of y_test.name but incorrect case of letters"

assert sha1(str(type(y_test.shape)).encode("utf-8")+b"5b6c7").hexdigest() == "2ee6925409d3c57110e7d1bd0b1943cdadd6e78a", "type of y_test.shape is not tuple. y_test.shape should be a tuple"
assert sha1(str(len(y_test.shape)).encode("utf-8")+b"5b6c7").hexdigest() == "280792b3ef063a4aa63280024a9eb9ee4c6bf497", "length of y_test.shape is not correct"
assert sha1(str(sorted(map(str, y_test.shape))).encode("utf-8")+b"5b6c7").hexdigest() == "359bcb738ca5db660eb890b2b70f0b39c901692c", "values of y_test.shape are not correct"
assert sha1(str(y_test.shape).encode("utf-8")+b"5b6c7").hexdigest() == "386094d7bec2a94e337637b4b208e9e5f5d33ca7", "order of elements of y_test.shape is not correct"

print('Success!')

**Question 1.3**
<br> {points: 1}

Using only the observations in the training data set, create a pairplot (also called "scatter plot matrix") of all the columns we are interested in including in our model. Since we have not covered how to create these in the textbook, we have provided you with most of the code below and you just need to provide a list out the column names we are interesting in plotting.

The pairplot contains a scatter plot of each pair of columns that you are plotting, so that you can explore the pairwise relationship between the variables. The diagonal is not of value, as it compares each column against itself. Also note that the plots above and below the diagonal have flipped axes; so you can study either the upper or lower set of plots, but not both.

*Name the plot object `credit_pairplot`.*

In [None]:
# columns_to_plot = ___
# 
# credit_pairplot = alt.Chart(credit_training).mark_point().encode(
#     alt.X(alt.repeat("row"), type="quantitative"),
#     alt.Y(alt.repeat("column"), type="quantitative"),
# ).properties(
#     width=200,
#     height=200
# ).repeat(
#     column=columns_to_plot,
#     row=columns_to_plot
# )

# your code here
raise NotImplementedError
credit_pairplot

In [None]:
from hashlib import sha1
assert sha1(str(type(credit_pairplot is None)).encode("utf-8")+b"28657").hexdigest() == "5bd6f78f6d9d91addb73078cb8b53e51eae3139a", "type of credit_pairplot is None is not bool. credit_pairplot is None should be a bool"
assert sha1(str(credit_pairplot is None).encode("utf-8")+b"28657").hexdigest() == "380e8396744559d6884906ec7e5b7eb9484e840d", "boolean value of credit_pairplot is None is not correct"

assert sha1(str(type(credit_pairplot)).encode("utf-8")+b"28658").hexdigest() == "dee60d2af170b4d15438b0ae2f82bcda2e6af195", "type of type(credit_pairplot) is not correct"

print('Success!')

**Question 1.4**
<br> {points: 1} 

Looking at the pairplot above, which of the following statements is **incorrect**?

A. There is a strong positive relationship between the response variable (`Balance`) and the `Rating` predictor

B. There is a strong positive relationship between the two predictors (`Income` and `Rating`)

C. There is a strong positive relationship between the response variable (`Balance`) and the `Income` predictor

D. None of the above statements are incorrect

*Assign your answer to an object called `answer1_4`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError
answer1_4

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_4)).encode("utf-8")+b"a7864").hexdigest() == "39bfe7803b3f5f8f25c82c2c27666043b6e1687e", "type of answer1_4 is not str. answer1_4 should be an str"
assert sha1(str(len(answer1_4)).encode("utf-8")+b"a7864").hexdigest() == "2858a1f50abfbec50df9738efe6f22dbcd885178", "length of answer1_4 is not correct"
assert sha1(str(answer1_4.lower()).encode("utf-8")+b"a7864").hexdigest() == "c734019739169b4d02d2fc8e33cce31f76051d69", "value of answer1_4 is not correct"
assert sha1(str(answer1_4).encode("utf-8")+b"a7864").hexdigest() == "9d0659ac142a4dd94e15c705d4112d1f3f2061fe", "correct string value of answer1_4 but incorrect case of letters"

print('Success!')

**Question 1.5**
<br> {points: 1}

Now that we have our training data, we will fit a linear regression model. Create a `LinearRegression` object.

*Assign the object to a variable named `lm`.*

In [None]:
# lm = ___()

# your code here
raise NotImplementedError
lm

In [None]:
from hashlib import sha1
assert sha1(str(type(lm is None)).encode("utf-8")+b"80879").hexdigest() == "a4d135c07965eed6831af05e96c8754fb133d589", "type of lm is None is not bool. lm is None should be a bool"
assert sha1(str(lm is None).encode("utf-8")+b"80879").hexdigest() == "27ec70fc9ebbe0252de51e6fc53ad3755d96e2ae", "boolean value of lm is None is not correct"

assert sha1(str(type(type(lm))).encode("utf-8")+b"8087a").hexdigest() == "0f7397288c89c8da318697b23e16c71c20e0b8c4", "type of type(lm) is not correct"
assert sha1(str(type(lm)).encode("utf-8")+b"8087a").hexdigest() == "41915031a8b1d5e2c453389bf76025614e184e89", "value of type(lm) is not correct"

print('Success!')

**Question 1.6**
<br> {points: 1}

Now that we have our model, fit our simple linear regression model. 

*Assign the fit to an object called `credit_fit`.*

In [None]:
# your code here
raise NotImplementedError
print(credit_fit.coef_)
print(credit_fit.intercept_)

In [None]:
from hashlib import sha1
assert sha1(str(type(credit_fit is None)).encode("utf-8")+b"551b2").hexdigest() == "f95a1e12a8d41945889865cb18ea74e98b30cacf", "type of credit_fit is None is not bool. credit_fit is None should be a bool"
assert sha1(str(credit_fit is None).encode("utf-8")+b"551b2").hexdigest() == "d1e6f0ae616ab70a649a2758a4697763012d1332", "boolean value of credit_fit is None is not correct"

assert sha1(str(type(type(credit_fit))).encode("utf-8")+b"551b3").hexdigest() == "811d8f5c5f0a23291a92b582f542b68192646cd3", "type of type(credit_fit) is not correct"
assert sha1(str(type(credit_fit)).encode("utf-8")+b"551b3").hexdigest() == "086b9221d5902c90b94fe523b9da7c96b1222d47", "value of type(credit_fit) is not correct"

assert sha1(str(type(round(credit_fit.coef_[0], 2))).encode("utf-8")+b"551b4").hexdigest() == "f6e1c542bc2655ef0512328d63eeeb78b52fdba5", "type of round(credit_fit.coef_[0], 2) is not correct"
assert sha1(str(round(credit_fit.coef_[0], 2)).encode("utf-8")+b"551b4").hexdigest() == "a6815745da23638f47856b3837a9838c612279e1", "value of round(credit_fit.coef_[0], 2) is not correct"

assert sha1(str(type(round(credit_fit.coef_[1], 2))).encode("utf-8")+b"551b5").hexdigest() == "215b2365885129d242de9af49ab3ff982a324a1e", "type of round(credit_fit.coef_[1], 2) is not correct"
assert sha1(str(round(credit_fit.coef_[1], 2)).encode("utf-8")+b"551b5").hexdigest() == "a04c99bc6e71e22558bf1f84dc0a088bbe59d6f3", "value of round(credit_fit.coef_[1], 2) is not correct"

assert sha1(str(type(round(credit_fit.fit_intercept, 2))).encode("utf-8")+b"551b6").hexdigest() == "54099bf4f8e1fb2b265fe7000fc0fb307cb04357", "type of round(credit_fit.fit_intercept, 2) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(round(credit_fit.fit_intercept, 2)).encode("utf-8")+b"551b6").hexdigest() == "2b366d8fdc167ec79fd38b7d3b06b90bd755d13f", "value of round(credit_fit.fit_intercept, 2) is not correct"

assert sha1(str(type(credit_fit.n_features_in_)).encode("utf-8")+b"551b7").hexdigest() == "7da1701547a743887bf350ffd692fa9a06cfba0c", "type of credit_fit.n_features_in_ is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(credit_fit.n_features_in_).encode("utf-8")+b"551b7").hexdigest() == "7b6d83d9f7612f6f5abf8f7d54d03d73a5b18926", "value of credit_fit.n_features_in_ is not correct"

print('Success!')

**Question 1.7**
<br> {points: 1}

Examine the `credit_fit` to obtain the slopes/coefficients for each predictor and intercept. Which of the following mathematical equations is correct for your prediction model?

A. $credit\: card \: balance = -530.674 -7.190*income  + 3.914*credit\: card\: rating$

B. $credit\: card \: balance = -530.674 + 3.914*income  -7.190*credit\: card\: rating$

C. $credit\: card \: balance = 530.674 -7.190*income  - 3.914*credit\: card\: rating$

D. $credit\: card \: balance = 530.674 - 3.914*income  + 7.190*credit\: card\: rating$

*Assign your answer to an object called `answer1_7`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError
answer1_7

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_7)).encode("utf-8")+b"a1ff6").hexdigest() == "dd15c298e5f75c0a5a5fc8d50fcc7f2453235cca", "type of answer1_7 is not str. answer1_7 should be an str"
assert sha1(str(len(answer1_7)).encode("utf-8")+b"a1ff6").hexdigest() == "e61872b236019acb84a411f696aa3b6dd917601b", "length of answer1_7 is not correct"
assert sha1(str(answer1_7.lower()).encode("utf-8")+b"a1ff6").hexdigest() == "e965efe73c947084363d66198310612b55c681a1", "value of answer1_7 is not correct"
assert sha1(str(answer1_7).encode("utf-8")+b"a1ff6").hexdigest() == "ebe19764498b055e96523af351080071b27a8df6", "correct string value of answer1_7 but incorrect case of letters"

print('Success!')

**Question 1.8**
<br> {points: 1}

Evaluate the RMSE to assess goodness of fit on `credit_fit` (remember this is how well it predicts on the **training data** used to fit the model). First, calculate the predicted values for the test data and saved it in an object called `credit_predictions`. Next, use the `mean_squared_error` function to calculate the mean squared error between the predicted value and the training response variables. Finally, calculate the RMSE .

*Assign your answer to a single numerical value called `lm_rmse`.*

In [None]:
# your code here
raise NotImplementedError
lm_rmse

In [None]:
from hashlib import sha1
assert sha1(str(type(round(lm_rmse, 2))).encode("utf-8")+b"1afe").hexdigest() == "abf61a98ebcb8fc705a5e63886e9b053b46aee5c", "type of round(lm_rmse, 2) is not correct"
assert sha1(str(round(lm_rmse, 2)).encode("utf-8")+b"1afe").hexdigest() == "cddf54d8ba66186196ab1e21239ab11e38d9528c", "value of round(lm_rmse, 2) is not correct"

print('Success!')

**Question 1.9**
<br> {points: 1}

Now evaluate the RMSPE on the test data. Take the same steps as before, but compare the model's predictions and actual response variable values on the **test data.**

*Assign your answer to a single numerical value called `lm_rmspe`.*

In [None]:
# your code here
raise NotImplementedError
lm_rmse

In [None]:
from hashlib import sha1
assert sha1(str(type(round(lm_rmse, 2))).encode("utf-8")+b"6ad1a").hexdigest() == "22fc8930eb920d1dd9adb575f3740f856057cb28", "type of round(lm_rmse, 2) is not correct"
assert sha1(str(round(lm_rmse, 2)).encode("utf-8")+b"6ad1a").hexdigest() == "a0ea00b725785a35fb451b826e590cefcaca7264", "value of round(lm_rmse, 2) is not correct"

print('Success!')

**Question 1.9.1**
<br> {points: 3}

Redo this analysis using $k$-nn regression instead of linear regression. Use the same predictors and train - test data splits as you used for linear regression. If you need help, follow the step-by-step instructions below. 

1. Create the $k$-nn regression model called `credit_knn`.

2. Create a parameter grid for `n_neighbors`. Assign the parameter grid to an object called `gridvals`. 

3. Tune the # of neighbors using `GridSearchCV` for 5 cross validations and use `neg_root_mean_squared_error` as scoring method. Assign your answer to an object called `credit_tuned`.

4. Fit your tuned model to `X_train` and `y_train`. 

5.  From your results, find the lowest value of RMSPE (the mean cross-validated RMSPE of the best estimator) and assign it to a single numerical value called `knn_rmspe`. Remember to add a negative sign `-` to the best score output by the cross validation, as it uses negative RMSPE internally!

In [None]:
np.random.seed(2020) # Do not change!

# your code here
raise NotImplementedError
knn_rmspe

**Question 1.9.2** 
<br> {points: 3}

Discuss which model, linear regression versus $k$-nn regression, gives better predictions and why you think that might be happening.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

## 2. Ames Housing Prices

<img src="https://media.giphy.com/media/xUPGGuzpmG3jfeYWIg/giphy.gif" width = "600"/>

Source: https://media.giphy.com/media/xUPGGuzpmG3jfeYWIg/giphy.gif

If we take a look at the Business Insider report [What do millenials want in a home?](https://www.businessinsider.com/what-do-millennials-want-in-a-home-2017-2), we can see that millenials like newer houses that have their own defined spaces. Today we are going to be looking at housing data to understand how the sale price of a house is determined. Finding highly detailed housing data with the final sale prices is very hard, however researchers from Truman State Univeristy have studied and made available a dataset containing multiple variables for the city of Ames, Iowa. The data set describes the sale of individual residential property in Ames, Iowa
from 2006 to 2010. You can read more about the data set [here](http://jse.amstat.org/v19n3/decock.pdf). Today we will be looking at 5 different variables to predict the sale price of a house. These variables are: 

- Lot Area: `lot_area`
- Year Built: `year_built`
- Basement Square Footage: `bsmt_sf`
- First Floor Square Footage: `first_sf`
- Second Floor Square Footage: `second_sf`

First, load the data with the script given below. 

In [None]:
# run this cell

ames_data = (
    pd.read_csv("data/ames.csv")
    [[
        "Lot.Area",
        "Year.Built",
        "Total.Bsmt.SF",
        "X1st.Flr.SF",
        "X2nd.Flr.SF",
        "SalePrice",
    ]]
    .rename(
        columns={
            "Lot.Area": "lot_area",
            "Year.Built": "year_built",
            "Total.Bsmt.SF": "bsmt_sf",
            "X1st.Flr.SF": "first_sf",
            "X2nd.Flr.SF": "second_sf",
            "SalePrice": "sale_price",
        }
    )
    .dropna()
)

ames_data

**Question 2.1**
<br> {points: 3}

Split the data into a train dataset and a test dataset, based on a 70%-30% train-test split. Use `random_state = 2019`. Remember that we want to predict the `sale_price` based on all of the other variables. 

*Assign the objects to `ames_training`, `ames_testing`, `X_train_ames`, `y_train_ames`, `X_test_ames` and `y_test_ames` respectively.*

In [None]:
# ___, ___ = train_test_split(
#     ___, ___, random_state=2019  # do not change the random state
# )

# X_train_ames = ...
# y_train_ames = ...
# X_test_ames = ...
# y_test_ames = ...


# your code here
raise NotImplementedError

In [None]:
# Note: this question has hidden tests! You can't see these while you're working.
# We list the objects you must create here to make sure you actually created them.

X_train_ames
y_train_ames
X_test_ames
y_test_ames

print('Success!')

**Question 2.2**
<br> {points: 3}

Let's start by exploring the training data. Create a scatter plot matrix (or pairplot) of all the columns to explore the relationships between the different variables. You can copy the code from Q1.3 as a starting point and modify the data used, which columns to plot, and the size of the plots as needed.

*Assign your plot object to a variable named `answer2_2`.*

In [None]:
# your code here
raise NotImplementedError
answer2_2

In [None]:
# Note: this question has hidden tests! You can't see these while you're working.
# We list the objects you must create here to make sure you actually created them.

answer2_2

print('Success!')

**Question 2.3**
<br> {points: 1}

Now that we have seen all the relationships between the variables, which of the following variables would be the strongest predictor for `sale_price`?

A. `bsmt_sf`

B. `year_built`

C. `first_sf`

D. `lot_area`

E. `second_sf`

F. It isn't clear from these plots

*Assign your answer to an object called `answer2_3`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError
answer2_3

In [None]:
# Note: this question has hidden tests! You can't see these while you're working.
# We list the objects you must create here to make sure you actually created them.

answer2_3

print('Success!')

**Question 2.4**
<br> {points: 3}

Fit a linear regression model from `scikit-learn` with `X_train_ames` and `y_train_ames`. 

- create a linear regression model called `lm`
- fit the model to `X_train_ames` and `y_train_ames` and name it as `ames_fit`

*Save the fit to an object called `ames_fit`.*

In [None]:
# your code here
raise NotImplementedError
ames_fit

In [None]:
# Note: this question has hidden tests! You can't see these while you're working.
# We list the objects you must create here to make sure you actually created them.

lm
ames_fit

print('Success!')

**Question 2.5**
<br> {points: 1}

True or false: "Aside from the intercept, all the variables have a positive relationship with the `sale_price`. As the value of the variables decrease, the prices of the houses increase."

*Assign your answer to an object called `answer2_5`. Make sure your answer is either `True` or `False`.*

In [None]:
# run this cell
print(f"coefficients: {ames_fit.coef_.tolist()}")
print(f"intercept: {ames_fit.intercept_}")

In [None]:
# your code here
raise NotImplementedError
answer2_5

In [None]:
# Note: this question has hidden tests! You can't see these while you're working.
# We list the objects you must create here to make sure you actually created them.

answer2_5


print('Success!')

**Question 2.6**
<br> {points: 3}

Looking at the coefficients and intercept produced from the cell block above, write down the equation for the linear model.

Make sure to use correct math typesetting syntax (surround your answer with dollar signs, e.g. `$0.5 * a$`)

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.


**Question 2.7**
<br> {points: 1}

Why can we not easily visualize the model above as a line or a plane in a single plot?

A. This is not true, we can actually easily visualize the model

B. The intercept is much larger (6 digits) than the coefficients (single/double digits)

C. There are more than 2 predictors

D. None of the above

*Assign your answer to an object called `answer2_7`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError
answer2_7

In [None]:
# Note: this question has hidden tests! You can't see these while you're working.
# We list the objects you must create here to make sure you actually created them.

answer2_7

print('Success!')

**Question 2.8**
<br> {points: 3}

We need to evaluate how well our model is doing. For this question, calculate the RMSPE (a single numerical value) of the linear regression model using the test data set.

*Assign your answer to an object named `ames_rmspe`.*

In [None]:
np.random.seed(2020) # DO NOT REMOVE

# ames_predictions = ...
# ames_rmspe = ...

# your code here
raise NotImplementedError
ames_rmspe

In [None]:
# Note: this question has hidden tests! You can't see these while you're working.
# We list the objects you must create here to make sure you actually created them.

ames_rmspe

print('Success!')

**Question 2.9**
<br> {points: 1}

Which of the following statements is **incorrect**?

A. RMSE is a measure of goodness of fit 

B. RMSE measures how well the model predicts on data it was trained with 

C. RMSPE measures how well the model predicts on data it was not trained with 

D. RMSPE measures how well the model predicts on data it was trained with

*Assign your answer to an object called `answer2_9`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError
answer2_9

In [None]:
# Note: this question has hidden tests! You can't see these while you're working.
# We list the objects you must create here to make sure you actually created them.

answer2_9

print('Success!')