# Worksheet 8 - Regression

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:
- Recognize situations where a simple regression analysis would be appropriate for making predictions.
- Explain the k-nearest neighbour ($k$-nn) regression algorithm and describe how it differs from $k$-nn classification.
- Interpret the output of a $k$-nn regression.
- In a dataset with two variables, perform k-nearest neighbour regression in Python using `scikit-learn` to predict the values for a test dataset.
- Using Python, execute hyperparameter tuning in Python to choose the number of neighbours.
- Using Python, evaluate $k$-nn regression prediction accuracy using a test data set and an appropriate metric (root mean squared error).
- In the context of $k$-nn regression, compare and contrast goodness of fit and prediction properties (RMSE versus RMSPE).
- Describe advantages and disadvantages of the $k$-nearest neighbour regression approach.

In [None]:
### Run this cell before continuing.
import altair as alt
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

**Question 0.0** 
<br> {points: 1}

To predict a value of $Y$ for a new observation using $k$-nn **regression**, we identify the $k$-nearest neighbours and then:

A. Assign it the median of the $k$-nearest neighbours as the predicted value

B. Assign it the mean of the $k$-nearest neighbours as the predicted value

C. Assign it the mode of the $k$-nearest neighbours as the predicted value

D. Assign it the majority vote of the $k$-nearest neighbours as the predicted value

*Save the letter of the answer you think is correct to a variable named `answer0_0`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer0_0)).encode("utf-8")+b"3005b27b6f8e87a6").hexdigest() == "134ef66ec90e786d45adc1b317c20db9a38dfb8f", "type of answer0_0 is not str. answer0_0 should be an str"
assert sha1(str(len(answer0_0)).encode("utf-8")+b"3005b27b6f8e87a6").hexdigest() == "9ec8da7280439076279edc7995779b3da38efd68", "length of answer0_0 is not correct"
assert sha1(str(answer0_0.lower()).encode("utf-8")+b"3005b27b6f8e87a6").hexdigest() == "d55d9e4fb9a0716de33dbc18445449567ef0d885", "value of answer0_0 is not correct"
assert sha1(str(answer0_0).encode("utf-8")+b"3005b27b6f8e87a6").hexdigest() == "60afc5097cae07670b4f193decda4b18edf388bf", "correct string value of answer0_0 but incorrect case of letters"

print('Success!')

**Question 0.1**
<br> {points: 1}

Of those shown below, which is the correct formula for root mean squared error (RMSE)?


A. $\text{RMSE} = \sqrt{\frac{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2}{1 - n}}$

B. $\text{RMSE} = \sqrt{\frac{1}{n - 1}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2}$

C. $\text{RMSE} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2}$

D. $\text{RMSE} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})}$ 

*Save the letter of your answer to a variable named `answer0_1`. Make sure you put quotations around the letter and pay attention to case.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer0_1)).encode("utf-8")+b"122d02523b576c3b").hexdigest() == "e9e82f17baf27eeda9953a8d55a06d8dd274078d", "type of answer0_1 is not str. answer0_1 should be an str"
assert sha1(str(len(answer0_1)).encode("utf-8")+b"122d02523b576c3b").hexdigest() == "d6eb72cf0fbbda0533c575495b5bcb2e3351d334", "length of answer0_1 is not correct"
assert sha1(str(answer0_1.lower()).encode("utf-8")+b"122d02523b576c3b").hexdigest() == "a102b83ac0ddea3a46e55fb87d75c7801d6b51ce", "value of answer0_1 is not correct"
assert sha1(str(answer0_1).encode("utf-8")+b"122d02523b576c3b").hexdigest() == "48bea0dc4c3278456744ede36f33c3b0894d304e", "correct string value of answer0_1 but incorrect case of letters"

print('Success!')

**Question 0.2**
<br> {points: 1}

The plot below is a very simple k-nn regression example, where the black dots are the data observations and the blue line is the predictions from a $k$-nn regression model created from this data where $k=2$.

Using the formula for root mean squared error (given in the reading), and the graph below, by hand (pen and paper or use Python as a calculator) calculate the RMSE for this model. **Use one decimal place of precision when inputting the heights of the black dots and blue line.** 

*Save your answer to a variable named `answer0_2`*

<img align="left" src="img/k-nn_RMSE.jpeg" />

In [None]:
# your code here
raise NotImplementedError
answer0_2

In [None]:
from hashlib import sha1
assert sha1(str(type(answer0_2)).encode("utf-8")+b"bbd1ef3efddd094a").hexdigest() == "2a480e7359a929c043c3966ecfb4c93fdee7f717", "type of answer0_2 is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(answer0_2, 2)).encode("utf-8")+b"bbd1ef3efddd094a").hexdigest() == "9c6bbacee23d5086fd10d65d1874645b8e8913c7", "value of answer0_2 is not correct (rounded to 2 decimal places)"

print('Success!')

### RMSPE Definition

**Question 0.3** 
<br> {points: 1}

What does RMSPE stand for?


A. root mean squared prediction error

B. root mean squared percentage error 

C. root mean squared performance error 

D. root mean squared preference error 

*Save the letter of your answer to a variable named `answer0_3`. Make sure you put quotations around the letter and pay attention to case.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer0_3)).encode("utf-8")+b"27a1112b2c6edc3e").hexdigest() == "32e43be91c086b0452138a4c571a2d07994387e1", "type of answer0_3 is not str. answer0_3 should be an str"
assert sha1(str(len(answer0_3)).encode("utf-8")+b"27a1112b2c6edc3e").hexdigest() == "0621ad7c9e3debdf67b74c93ea2c9af6d972b066", "length of answer0_3 is not correct"
assert sha1(str(answer0_3.lower()).encode("utf-8")+b"27a1112b2c6edc3e").hexdigest() == "7e77e88a4ee771136e05d3b2a7376259c74fef02", "value of answer0_3 is not correct"
assert sha1(str(answer0_3).encode("utf-8")+b"27a1112b2c6edc3e").hexdigest() == "c4c77c5febbb5a8f7b9a3c371798c6e316c038cd", "correct string value of answer0_3 but incorrect case of letters"

print('Success!')

## Marathon Training

<img src='https://media.giphy.com/media/nUN6InE2CodRm/giphy.gif' width='400'>

Source: https://media.giphy.com/media/nUN6InE2CodRm/giphy.gif

What predicts which athletes will perform better than others? Specifically, we are interested in marathon runners, and looking at how the maximum distance ran per week (in miles) during race training predicts the time it takes a runner to finish the race. For this, we will be looking at the `marathon.csv` file in the `data/` folder.

**Question 1.0** 
<br> {points: 1}

Load the data and assign it to an object called `marathon`. 

In [None]:
# your code here
raise NotImplementedError
marathon

In [None]:
from hashlib import sha1
assert sha1(str(type(marathon is None)).encode("utf-8")+b"f6e59b2488f821cc").hexdigest() == "2592efab4a331136de6e1c7a26cb532f68d702e3", "type of marathon is None is not bool. marathon is None should be a bool"
assert sha1(str(marathon is None).encode("utf-8")+b"f6e59b2488f821cc").hexdigest() == "1aac3f886b5d0da653abbdad918e4fe626b4f049", "boolean value of marathon is None is not correct"

assert sha1(str(type(marathon)).encode("utf-8")+b"c8d3a54f058f3c98").hexdigest() == "252a0eebfa5536f4b09bedad5fd8adeedda5c578", "type of type(marathon) is not correct"

assert sha1(str(type(marathon.shape)).encode("utf-8")+b"c350acd7410c54fb").hexdigest() == "e536f521296c33394513f950065a31f50fe038c5", "type of marathon.shape is not tuple. marathon.shape should be a tuple"
assert sha1(str(len(marathon.shape)).encode("utf-8")+b"c350acd7410c54fb").hexdigest() == "9ac3a3ef86138eaba03e5119149597919bd82188", "length of marathon.shape is not correct"
assert sha1(str(sorted(map(str, marathon.shape))).encode("utf-8")+b"c350acd7410c54fb").hexdigest() == "1e2ce2fa23e4203a434f822c13dc3912bc07d5d5", "values of marathon.shape are not correct"
assert sha1(str(marathon.shape).encode("utf-8")+b"c350acd7410c54fb").hexdigest() == "b4c62832f2cb330a0afe41e60d5fe9ff81fdbf28", "order of elements of marathon.shape is not correct"

assert sha1(str(type("time_hrs" in marathon.columns)).encode("utf-8")+b"2a9fcbd171f71499").hexdigest() == "4009690ab30de10f7b22011b34e2d94e5bfa84aa", "type of \"time_hrs\" in marathon.columns is not bool. \"time_hrs\" in marathon.columns should be a bool"
assert sha1(str("time_hrs" in marathon.columns).encode("utf-8")+b"2a9fcbd171f71499").hexdigest() == "209423111ba2479b5c42e6957464cf51febfc448", "boolean value of \"time_hrs\" in marathon.columns is not correct"

assert sha1(str(type("max" in marathon.columns)).encode("utf-8")+b"40b9bdad979ae4ba").hexdigest() == "8046dd9bddde2dffcc63ed76769bb7e6e180e645", "type of \"max\" in marathon.columns is not bool. \"max\" in marathon.columns should be a bool"
assert sha1(str("max" in marathon.columns).encode("utf-8")+b"40b9bdad979ae4ba").hexdigest() == "7dade57baf617379adcab08a4eb22add689e923b", "boolean value of \"max\" in marathon.columns is not correct"

print('Success!')

**Question 2.0**
<br> {points: 1}

We want to predict race time (in hours) (`time_hrs`) given a particular value of maximum distance ran per week (in miles) during race training (`max`). Let's take a subset of size 50 individuals of our marathon data and assign it to an object called `marathon_50`. With this subset, plot a scatterplot to assess the relationship between these two variables. Put `time_hrs` on the y-axis and `max` on the x-axis.  Discuss, with a classmate, the relationship between race time and maximum distance ran per week during training based on the scatterplot you create below. 

*Hint: To take a subset of your data you can use the `sample` function*

*Assign your plot to an object called `answer2`.*

In [None]:
# ___ = ___.sample(___, random_state=300) # Do not change the random_state


# your code here
raise NotImplementedError
answer2

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2 is None)).encode("utf-8")+b"374c4a8995cb1194").hexdigest() == "41d58d353a38252ed5a4d4d2c6ef1b8c6287b481", "type of answer2 is None is not bool. answer2 is None should be a bool"
assert sha1(str(answer2 is None).encode("utf-8")+b"374c4a8995cb1194").hexdigest() == "1b07dd0e4f7f0780f344626cedb66c3271b9356e", "boolean value of answer2 is None is not correct"

assert sha1(str(type(marathon_50.shape)).encode("utf-8")+b"396bdb3445ffa91b").hexdigest() == "8eaec804a26abf042c5de5fda7f56fbfd3c02151", "type of marathon_50.shape is not tuple. marathon_50.shape should be a tuple"
assert sha1(str(len(marathon_50.shape)).encode("utf-8")+b"396bdb3445ffa91b").hexdigest() == "b8b1fe989da6c3dd87e21d79e20ea00ec9578f26", "length of marathon_50.shape is not correct"
assert sha1(str(sorted(map(str, marathon_50.shape))).encode("utf-8")+b"396bdb3445ffa91b").hexdigest() == "e66ff0916b9cbcf954ad2efaed08b45ce16793c8", "values of marathon_50.shape are not correct"
assert sha1(str(marathon_50.shape).encode("utf-8")+b"396bdb3445ffa91b").hexdigest() == "d726e63cfd24175f3b2d4d57d61ea15f611c438a", "order of elements of marathon_50.shape is not correct"

assert sha1(str(type(answer2.data.equals(marathon_50))).encode("utf-8")+b"bb3e6f0e7e675456").hexdigest() == "c960f59f9e92daa027f329867fb1701bbbd3efd7", "type of answer2.data.equals(marathon_50) is not bool. answer2.data.equals(marathon_50) should be a bool"
assert sha1(str(answer2.data.equals(marathon_50)).encode("utf-8")+b"bb3e6f0e7e675456").hexdigest() == "7a88baecf776675d4f54d1326f56f870e7718cb2", "boolean value of answer2.data.equals(marathon_50) is not correct"

assert sha1(str(type(answer2.encoding.x.field)).encode("utf-8")+b"e1f973a5b71064d0").hexdigest() == "b619e9bcd843a0b504e4aa25a6508a6f05d28fcf", "type of answer2.encoding.x.field is not str. answer2.encoding.x.field should be an str"
assert sha1(str(len(answer2.encoding.x.field)).encode("utf-8")+b"e1f973a5b71064d0").hexdigest() == "e81e90d97f28b3a9ed0e5bdf4b8e0a7e80492a33", "length of answer2.encoding.x.field is not correct"
assert sha1(str(answer2.encoding.x.field.lower()).encode("utf-8")+b"e1f973a5b71064d0").hexdigest() == "ba0112b77d55d65bc73d2ba45dbc7e79a116658b", "value of answer2.encoding.x.field is not correct"
assert sha1(str(answer2.encoding.x.field).encode("utf-8")+b"e1f973a5b71064d0").hexdigest() == "ba0112b77d55d65bc73d2ba45dbc7e79a116658b", "correct string value of answer2.encoding.x.field but incorrect case of letters"

assert sha1(str(type(answer2.encoding.y.field)).encode("utf-8")+b"7e99db2d0be81679").hexdigest() == "ea1bb6a5ced6b770e9acf759306fdc5bd2500829", "type of answer2.encoding.y.field is not str. answer2.encoding.y.field should be an str"
assert sha1(str(len(answer2.encoding.y.field)).encode("utf-8")+b"7e99db2d0be81679").hexdigest() == "677a4269509059fcd8a3e6ed391ce24e1cf61457", "length of answer2.encoding.y.field is not correct"
assert sha1(str(answer2.encoding.y.field.lower()).encode("utf-8")+b"7e99db2d0be81679").hexdigest() == "b9e210c51b493ed7176a5994d960deb7292716b8", "value of answer2.encoding.y.field is not correct"
assert sha1(str(answer2.encoding.y.field).encode("utf-8")+b"7e99db2d0be81679").hexdigest() == "b9e210c51b493ed7176a5994d960deb7292716b8", "correct string value of answer2.encoding.y.field but incorrect case of letters"

assert sha1(str(type(answer2.mark.type)).encode("utf-8")+b"1d273b2839f624c6").hexdigest() == "49e64628e44e1cb43764110a0c419159b994b540", "type of answer2.mark.type is not str. answer2.mark.type should be an str"
assert sha1(str(len(answer2.mark.type)).encode("utf-8")+b"1d273b2839f624c6").hexdigest() == "f6caa7e18430c9200f9e5f6d2e1e387303af0ec3", "length of answer2.mark.type is not correct"
assert sha1(str(answer2.mark.type.lower()).encode("utf-8")+b"1d273b2839f624c6").hexdigest() == "7609b74ce453376b7cfea8b936e26a4959ccbc2c", "value of answer2.mark.type is not correct"
assert sha1(str(answer2.mark.type).encode("utf-8")+b"1d273b2839f624c6").hexdigest() == "7609b74ce453376b7cfea8b936e26a4959ccbc2c", "correct string value of answer2.mark.type but incorrect case of letters"

assert sha1(str(type(answer2.encoding.x.title != answer2.encoding.x.field)).encode("utf-8")+b"7551b2b36a062139").hexdigest() == "3da97c85500c0b35be0a429ce4214abdbae78e93", "type of answer2.encoding.x.title != answer2.encoding.x.field is not bool. answer2.encoding.x.title != answer2.encoding.x.field should be a bool"
assert sha1(str(answer2.encoding.x.title != answer2.encoding.x.field).encode("utf-8")+b"7551b2b36a062139").hexdigest() == "e44d4aea1e5e40a81dfb6f145f2a46386c021c50", "boolean value of answer2.encoding.x.title != answer2.encoding.x.field is not correct"

assert sha1(str(type(answer2.encoding.y.title != answer2.encoding.y.field)).encode("utf-8")+b"ba6b9094c10d0448").hexdigest() == "988d91d423b3fbd473b33fa610dfb6ddaaa87a84", "type of answer2.encoding.y.title != answer2.encoding.y.field is not bool. answer2.encoding.y.title != answer2.encoding.y.field should be a bool"
assert sha1(str(answer2.encoding.y.title != answer2.encoding.y.field).encode("utf-8")+b"ba6b9094c10d0448").hexdigest() == "fef99960144832f5d9126bdb82402c53cd4bbc6a", "boolean value of answer2.encoding.y.title != answer2.encoding.y.field is not correct"

print('Success!')

**Question 3.0**
<br> {points: 1}

Suppose we want to predict the race time for someone who ran a maximum distance of 100 miles per week during training. In the plot below we can see that no one has run a maximum distance of exactly 100 miles per week. But, if we are interested in prediction, how can we predict with this data? We can use $k$-nn regression! To do this we get the $Y$ values (target/response variable) of the nearest $k$ values and then take their average and use that as the prediction. 

For this question predict the race time based on the 4 closest neighbors to the 100 miles per week during training.

*Fill in the scaffolding below and assign your answer to an object named `answer3`.*

In [None]:
# run this cell to see a visualization of the 4 nearest neighbours
circle_plot = (
    alt.Chart(marathon_50)
    .mark_circle(opacity=0.4)
    .encode(
        x=alt.X("max", title="Maximum Distance Ran per Week During Training (miles)"),
        y=alt.Y("time_hrs", title="Race Time (hours)", scale=alt.Scale(zero=False)),
    )
)
overlay = pd.DataFrame({"x": [100]})
rule = alt.Chart(overlay).mark_rule().encode(x="x")
line1_df = pd.DataFrame({"x": [100, 110], "y": [2.63, 2.63]})
line1 = alt.Chart(line1_df).mark_line(color="orange").encode(x="x", y="y")
line2_df = pd.DataFrame({"x": [100, 104], "y": [2.8, 2.8]})
line2 = alt.Chart(line2_df).mark_line(color="orange").encode(x="x", y="y")
line3_df = pd.DataFrame({"x": [100, 90], "y": [3.28, 3.28]})
line3 = alt.Chart(line3_df).mark_line(color="orange").encode(x="x", y="y")
line4_df = pd.DataFrame({"x": [100, 86], "y": [2.43, 2.43]})
line4 = alt.Chart(line4_df).mark_line(color="orange").encode(x="x", y="y")

(circle_plot + rule + line1 + line2 + line3 + line4).configure_axis(
    labelFontSize=20, titleFontSize=20
).properties(width=400, height=300)

In [None]:
# ___ = (
#     ___.assign(diff=abs(100 - ___))
#     .___("diff")
#     .iloc[___:___]
#     .mean()[___]
# )

# your code here
raise NotImplementedError
answer3

In [None]:
from hashlib import sha1
assert sha1(str(type(answer3)).encode("utf-8")+b"d14ae019a339c3d7").hexdigest() == "19dd7c0e861cf5343642e61f969515b13b899de4", "type of answer3 is not correct"
assert sha1(str(answer3).encode("utf-8")+b"d14ae019a339c3d7").hexdigest() == "00d2729cd74dfd109211d15c3325586bab4feff7", "value of answer3 is not correct"

print('Success!')

**Question 4.0**
<br> {points: 1}

For this question, let's instead predict the race time based on the 2 closest neighbors to the 100 miles per week during training.

*Assign your answer to an object named `answer4`.*

In [None]:
# your code here
raise NotImplementedError
answer4

In [None]:
from hashlib import sha1
assert sha1(str(type(answer4)).encode("utf-8")+b"2879f178fb780111").hexdigest() == "f2e4752eb53eb2c8e4937640d7a371b7b366723a", "type of answer4 is not correct"
assert sha1(str(answer4).encode("utf-8")+b"2879f178fb780111").hexdigest() == "f764388decb890cbccc2aa1f526369d4350e25b3", "value of answer4 is not correct"

print('Success!')

**Question 5.0**
<br> {points: 1}

So far you have calculated the $k$ nearest neighbors predictions manually based on values of $k$ we have told you to use. However, last week we learned how to use a better method to choose the best $k$ for classification. 

Based on what you learned last week and what you have learned about $k$-nn regression so far this week, which method would you use to choose the $k$ (in the situation where we don't tell you which $k$ to use)?

- A) Choose the $k$ that excludes most outliers
- B) Choose the $k$ with the lowest training error
- C) Choose the $k$ with the lowest cross-validation error
- D) Choose the $k$ that includes the most data points
- E) Choose the $k$ with the lowest testing error

*Assign your answer to an object called `answer5`.  Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError
answer5

In [None]:
from hashlib import sha1
assert sha1(str(type(answer5)).encode("utf-8")+b"6de12a1adf36c65f").hexdigest() == "adf121e5931614687776fb05e243a2f076e6679c", "type of answer5 is not str. answer5 should be an str"
assert sha1(str(len(answer5)).encode("utf-8")+b"6de12a1adf36c65f").hexdigest() == "e4dbe9863e7fd82d88a6b6e05b60871cc1c05b8f", "length of answer5 is not correct"
assert sha1(str(answer5.lower()).encode("utf-8")+b"6de12a1adf36c65f").hexdigest() == "6746d17826a323da5c8ac44e83faf7e1fbf063ae", "value of answer5 is not correct"
assert sha1(str(answer5).encode("utf-8")+b"6de12a1adf36c65f").hexdigest() == "8924dd3f1fe96f318eba1c9a47074d4d2e1dda58", "correct string value of answer5 but incorrect case of letters"

print('Success!')

**Question 6.0**
<br> {points: 1}

We have just seen how to perform k-nn regression manually, now we will apply it to the whole dataset using the `scikit-learn` package. To do so, we will first need to create the training and testing datasets. Split the data using *75%* of the `marathon` data as your training set and stored the training set as `marathon_training` and the testing set as `marathon_testing`. Remember we won't touch the test dataset until the end. 

Next, set the `time_hrs` as the target (y) and `max` as the feature (X). Store the features as `X_train` and `X_test` and targets as `y_train` and `y_test` respectively for the `marathon_training` and `marathon_testing`.

*Assign your answers to objects named `marathon_training`, `marathon_testing`, `X_train`, `y_train`, `X_test`, and `y_test`.*

In [None]:
# ___, ___ = train_test_split(
#     ___, test_size=___, random_state=2000 # do not change the random_state
# )

# ___ = pd.DataFrame(___["___"]) # the KNeighborsRegressor needs an input of shape (n_sample, n_features)
# ___ = ___["___"]

# ___ = pd.DataFrame(___["___"]) # the KNeighborsRegressor needs an input of shape (n_sample, n_features)
# ___ = ___["___"]

# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(marathon_training is None)).encode("utf-8")+b"2d0215003f51b735").hexdigest() == "efb60e57d507496037da2636266968b2acae0d0c", "type of marathon_training is None is not bool. marathon_training is None should be a bool"
assert sha1(str(marathon_training is None).encode("utf-8")+b"2d0215003f51b735").hexdigest() == "ece567cc5c3d47827d12145c4ca57fc8e5b9a369", "boolean value of marathon_training is None is not correct"

assert sha1(str(type(marathon_training.shape)).encode("utf-8")+b"78bb09bb3fa7ad1f").hexdigest() == "cd136cdc489ea52bdee6f81cac6d647757642685", "type of marathon_training.shape is not tuple. marathon_training.shape should be a tuple"
assert sha1(str(len(marathon_training.shape)).encode("utf-8")+b"78bb09bb3fa7ad1f").hexdigest() == "5c234a6d227ebb811fbdb5fd3b4091039b6934a0", "length of marathon_training.shape is not correct"
assert sha1(str(sorted(map(str, marathon_training.shape))).encode("utf-8")+b"78bb09bb3fa7ad1f").hexdigest() == "acad672dcd0c882754059e010f31c80a4c1c6984", "values of marathon_training.shape are not correct"
assert sha1(str(marathon_training.shape).encode("utf-8")+b"78bb09bb3fa7ad1f").hexdigest() == "0f5842c06b57d5177c25d3f9ab4637e9557fff9f", "order of elements of marathon_training.shape is not correct"

assert sha1(str(type(sum(marathon_training.age))).encode("utf-8")+b"5a4519151a4c57ff").hexdigest() == "da75c27876017e345861f51c80aa1b1179c59c73", "type of sum(marathon_training.age) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(marathon_training.age)).encode("utf-8")+b"5a4519151a4c57ff").hexdigest() == "0c17f53ea476b9606056c5899c4d9ccf9bd91aca", "value of sum(marathon_training.age) is not correct"

assert sha1(str(type(marathon_testing is None)).encode("utf-8")+b"3ef4a4bf2f16a328").hexdigest() == "a58ea52641ebb4fdb65a1e685ef7878a768b3ac9", "type of marathon_testing is None is not bool. marathon_testing is None should be a bool"
assert sha1(str(marathon_testing is None).encode("utf-8")+b"3ef4a4bf2f16a328").hexdigest() == "d570cbd676b81193a85e21c10fdee148014d7a9d", "boolean value of marathon_testing is None is not correct"

assert sha1(str(type(marathon_testing.shape)).encode("utf-8")+b"4d4bd53a0ca69f38").hexdigest() == "8f450802fa490bbbd0dcfa72a3afc277363dfc8a", "type of marathon_testing.shape is not tuple. marathon_testing.shape should be a tuple"
assert sha1(str(len(marathon_testing.shape)).encode("utf-8")+b"4d4bd53a0ca69f38").hexdigest() == "e2d004c938095e0e1195f9df8c7a552d6c6f2006", "length of marathon_testing.shape is not correct"
assert sha1(str(sorted(map(str, marathon_testing.shape))).encode("utf-8")+b"4d4bd53a0ca69f38").hexdigest() == "bead22fda56c5ae39c68c4a906d667e1a813bfd2", "values of marathon_testing.shape are not correct"
assert sha1(str(marathon_testing.shape).encode("utf-8")+b"4d4bd53a0ca69f38").hexdigest() == "5508546a2083a465f22127314cef6bea49db3932", "order of elements of marathon_testing.shape is not correct"

assert sha1(str(type(sum(marathon_testing.age))).encode("utf-8")+b"ab47d19fb8a39c3e").hexdigest() == "c5430e79d2e97d2dfacd745ad10ecd389a728309", "type of sum(marathon_testing.age) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(marathon_testing.age)).encode("utf-8")+b"ab47d19fb8a39c3e").hexdigest() == "bef00fce32a67d7c5f63638a06eedad58b2746f0", "value of sum(marathon_testing.age) is not correct"

assert sha1(str(type(X_train.columns.values)).encode("utf-8")+b"e700ea52d696a95f").hexdigest() == "5419cf47c6d1b42e252ec259957ef26a3f5b6d60", "type of X_train.columns.values is not correct"
assert sha1(str(X_train.columns.values).encode("utf-8")+b"e700ea52d696a95f").hexdigest() == "16682d1bc6b1247365c06fea3adc1d2078bd77b1", "value of X_train.columns.values is not correct"

assert sha1(str(type(X_train.shape)).encode("utf-8")+b"d3473bdce1692e96").hexdigest() == "810c5a9b9fd181fb2dfb9e20452353c9d8c68eaa", "type of X_train.shape is not tuple. X_train.shape should be a tuple"
assert sha1(str(len(X_train.shape)).encode("utf-8")+b"d3473bdce1692e96").hexdigest() == "fbeb7030f6de44856f2321dbdd519a13d39412d0", "length of X_train.shape is not correct"
assert sha1(str(sorted(map(str, X_train.shape))).encode("utf-8")+b"d3473bdce1692e96").hexdigest() == "e40090e7a8e8aacf91907c44dc12af17b7c0306c", "values of X_train.shape are not correct"
assert sha1(str(X_train.shape).encode("utf-8")+b"d3473bdce1692e96").hexdigest() == "4065150c318ea6076e491816c457e9a0ad22db67", "order of elements of X_train.shape is not correct"

assert sha1(str(type(y_train.name)).encode("utf-8")+b"ffc0bb1452d82cff").hexdigest() == "71914a475d2adce801eb63c2f1193e4bc74bf02d", "type of y_train.name is not str. y_train.name should be an str"
assert sha1(str(len(y_train.name)).encode("utf-8")+b"ffc0bb1452d82cff").hexdigest() == "4f05c4a3d46cf47f16f6c02d94616602ee8fd06c", "length of y_train.name is not correct"
assert sha1(str(y_train.name.lower()).encode("utf-8")+b"ffc0bb1452d82cff").hexdigest() == "7862e332be856daaf03673aa6174f96c4cddd81a", "value of y_train.name is not correct"
assert sha1(str(y_train.name).encode("utf-8")+b"ffc0bb1452d82cff").hexdigest() == "7862e332be856daaf03673aa6174f96c4cddd81a", "correct string value of y_train.name but incorrect case of letters"

assert sha1(str(type(y_train.shape)).encode("utf-8")+b"62b178816d00d1ed").hexdigest() == "d08ac0f85f75443cfa506780780ba3093a1a8d26", "type of y_train.shape is not tuple. y_train.shape should be a tuple"
assert sha1(str(len(y_train.shape)).encode("utf-8")+b"62b178816d00d1ed").hexdigest() == "d951355c67ec803cd51b22582993694c8c2739c3", "length of y_train.shape is not correct"
assert sha1(str(sorted(map(str, y_train.shape))).encode("utf-8")+b"62b178816d00d1ed").hexdigest() == "28aaae70c53d85328796da9f9c857a11cc74cb64", "values of y_train.shape are not correct"
assert sha1(str(y_train.shape).encode("utf-8")+b"62b178816d00d1ed").hexdigest() == "149b2181794cbdbd6262152d1e6608404df4baa4", "order of elements of y_train.shape is not correct"

assert sha1(str(type(X_test.columns.values)).encode("utf-8")+b"d62c2a4cd5a199e5").hexdigest() == "f70a55ae371e7a4f3123c2cbb95c124da71889bf", "type of X_test.columns.values is not correct"
assert sha1(str(X_test.columns.values).encode("utf-8")+b"d62c2a4cd5a199e5").hexdigest() == "28d87816312aa71bd767d6e863fdfffd6fa3c917", "value of X_test.columns.values is not correct"

assert sha1(str(type(X_test.shape)).encode("utf-8")+b"c8e7b508af17b17d").hexdigest() == "448ea563a31dd0a5f33a3ec96106529ff4abd78b", "type of X_test.shape is not tuple. X_test.shape should be a tuple"
assert sha1(str(len(X_test.shape)).encode("utf-8")+b"c8e7b508af17b17d").hexdigest() == "cfe349a593a8552ca94521c3050fc26dccf3cbe1", "length of X_test.shape is not correct"
assert sha1(str(sorted(map(str, X_test.shape))).encode("utf-8")+b"c8e7b508af17b17d").hexdigest() == "85d3b03d03b84161960edab12913eae3514fafda", "values of X_test.shape are not correct"
assert sha1(str(X_test.shape).encode("utf-8")+b"c8e7b508af17b17d").hexdigest() == "e51978c50cfb6a9353beb2e7ccfa278b40e6bf8c", "order of elements of X_test.shape is not correct"

assert sha1(str(type(y_test.name)).encode("utf-8")+b"af149d631972d8d4").hexdigest() == "b0f20e02ca2a0ccad5ce4fa82fe30d8d9c63aa80", "type of y_test.name is not str. y_test.name should be an str"
assert sha1(str(len(y_test.name)).encode("utf-8")+b"af149d631972d8d4").hexdigest() == "4ddd04455e1658c81683bbdb1d662d2f7e33bf0e", "length of y_test.name is not correct"
assert sha1(str(y_test.name.lower()).encode("utf-8")+b"af149d631972d8d4").hexdigest() == "a2bf6b6d696ac5583da8d97fb7fa6f70b27feeed", "value of y_test.name is not correct"
assert sha1(str(y_test.name).encode("utf-8")+b"af149d631972d8d4").hexdigest() == "a2bf6b6d696ac5583da8d97fb7fa6f70b27feeed", "correct string value of y_test.name but incorrect case of letters"

assert sha1(str(type(y_test.shape)).encode("utf-8")+b"615227d22cfd7da7").hexdigest() == "811347c6a927b047321d2037bc3afba5c4f39130", "type of y_test.shape is not tuple. y_test.shape should be a tuple"
assert sha1(str(len(y_test.shape)).encode("utf-8")+b"615227d22cfd7da7").hexdigest() == "cc100a980c5ff4275ff9e87bef9dc47a2d376940", "length of y_test.shape is not correct"
assert sha1(str(sorted(map(str, y_test.shape))).encode("utf-8")+b"615227d22cfd7da7").hexdigest() == "c0636594632b4ee4b13f127dd8b7411ba5c748c6", "values of y_test.shape are not correct"
assert sha1(str(y_test.shape).encode("utf-8")+b"615227d22cfd7da7").hexdigest() == "77e10059872588b273e7b2e920813ad88519ae27", "order of elements of y_test.shape is not correct"

print('Success!')

**Question 7.0**
<br> {points: 1}

Next, we’ll use cross-validation on our **training data** to choose $k$. In $k$-nn classification, we used accuracy to see how well our predictions matched the true labels. In the context of $k$-nn *regression*, we will use RMSPE as the scoring instead. Interpreting the RMSPE value can be tricky but generally speaking, if the prediction values are very close to the true values, the RMSPE will be small. Conversely, if the prediction values are *not* very close to the true values, the RMSPE will be quite large. 

Let's perform a cross-validation and choose the optimal $k$!

First, create a pipeline for $k$-nn. We are still using the $k$-nearest neighbours algorithm, and we will also use the `StandardScaler` to standardize the numerical values. Store your pipeline in an object called `marathon_pipe`. Finally, perform a cross-validation with 5 folds using the `cross_validate` function. Remember that since the `cross_validate` function always maximizes its "score", and here we're using RMSPE (lower is better!), we need to specify that we're using the *negative* RMSPE (`"neg_root_mean_squared_error"`).

*Store the output of the cross validation in an object called `marathon_cv`.*

In [None]:
# ___ = Pipeline(
#     steps=[
#         ("scaler", ___),
#         ("knn", ___),
#     ]
# )
# marathon_cv = cross_validate(
#     ___, ___, ___, scoring=___, return_train_score=True
# )

# your code here
raise NotImplementedError
marathon_cv

In [None]:
from hashlib import sha1
assert sha1(str(type(marathon_pipe is None)).encode("utf-8")+b"004bfe8ac2b68e95").hexdigest() == "148b8cd61158d497f492c85dfbfc27caa0bb65ea", "type of marathon_pipe is None is not bool. marathon_pipe is None should be a bool"
assert sha1(str(marathon_pipe is None).encode("utf-8")+b"004bfe8ac2b68e95").hexdigest() == "adee140afe4d0982d84752a7c860674deb970b40", "boolean value of marathon_pipe is None is not correct"

assert sha1(str(type(marathon_pipe.steps[1][1].n_neighbors)).encode("utf-8")+b"730df2a1ac40b794").hexdigest() == "bc2dfd666cc086a906910d6cc0db52f8f6cb989c", "type of marathon_pipe.steps[1][1].n_neighbors is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(marathon_pipe.steps[1][1].n_neighbors).encode("utf-8")+b"730df2a1ac40b794").hexdigest() == "0d87ada4434abd774bd8cc6c6604fb8d361303d9", "value of marathon_pipe.steps[1][1].n_neighbors is not correct"

assert sha1(str(type(marathon_pipe.steps[1][1].weights)).encode("utf-8")+b"53332f23a4e7ff1f").hexdigest() == "4b924b38e50c97b08dc57f252cd49fb43bf173fc", "type of marathon_pipe.steps[1][1].weights is not str. marathon_pipe.steps[1][1].weights should be an str"
assert sha1(str(len(marathon_pipe.steps[1][1].weights)).encode("utf-8")+b"53332f23a4e7ff1f").hexdigest() == "d3f04d4ea30494e70f5fe7271738c7e7ec449663", "length of marathon_pipe.steps[1][1].weights is not correct"
assert sha1(str(marathon_pipe.steps[1][1].weights.lower()).encode("utf-8")+b"53332f23a4e7ff1f").hexdigest() == "b798d23ea1208deedea01949b850688cf17fbd65", "value of marathon_pipe.steps[1][1].weights is not correct"
assert sha1(str(marathon_pipe.steps[1][1].weights).encode("utf-8")+b"53332f23a4e7ff1f").hexdigest() == "b798d23ea1208deedea01949b850688cf17fbd65", "correct string value of marathon_pipe.steps[1][1].weights but incorrect case of letters"

assert sha1(str(type(marathon_pipe.steps[0][1])).encode("utf-8")+b"34869adf80d555d1").hexdigest() == "f34a15e667daac8a99f2bf9f623a0079bf6bff9e", "type of marathon_pipe.steps[0][1] is not correct"
assert sha1(str(marathon_pipe.steps[0][1]).encode("utf-8")+b"34869adf80d555d1").hexdigest() == "4c130aecabacdf6d9aeff65a219f1ca4353c7c0e", "value of marathon_pipe.steps[0][1] is not correct"

assert sha1(str(type(marathon_cv is None)).encode("utf-8")+b"74ba46a5d6a9e051").hexdigest() == "2d86052fdaf16914bafcc55a4455f6c14e376864", "type of marathon_cv is None is not bool. marathon_cv is None should be a bool"
assert sha1(str(marathon_cv is None).encode("utf-8")+b"74ba46a5d6a9e051").hexdigest() == "03fd20fdd99ad1f7b476da950c87d051621e03eb", "boolean value of marathon_cv is None is not correct"

assert sha1(str(type(len(marathon_cv['train_score']))).encode("utf-8")+b"497979df1228c509").hexdigest() == "d64541d6c2d152a13a56829026f6479096f0b9c8", "type of len(marathon_cv['train_score']) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(len(marathon_cv['train_score'])).encode("utf-8")+b"497979df1228c509").hexdigest() == "20aae20e2a47475637c0bbde60150a446985e3ab", "value of len(marathon_cv['train_score']) is not correct"

assert sha1(str(type(sum(marathon_cv['train_score']))).encode("utf-8")+b"f411b54df97f08b8").hexdigest() == "912f349c82d038743d3f5352944784f53d4bb4dd", "type of sum(marathon_cv['train_score']) is not correct"
assert sha1(str(sum(marathon_cv['train_score'])).encode("utf-8")+b"f411b54df97f08b8").hexdigest() == "b452ac0cf29c0dddc56c99f2c36026e4b3a8efc3", "value of sum(marathon_cv['train_score']) is not correct"

assert sha1(str(type(sum(marathon_cv['test_score']))).encode("utf-8")+b"13662c9f7469be92").hexdigest() == "1a9e42146016b444de4067d66373307cf0d845e5", "type of sum(marathon_cv['test_score']) is not correct"
assert sha1(str(sum(marathon_cv['test_score'])).encode("utf-8")+b"13662c9f7469be92").hexdigest() == "82ce0b02658f4bd1c29a1db4202d97b114d32720", "value of sum(marathon_cv['test_score']) is not correct"

print('Success!')

**Question 8.0**
<br> {points: 1}

The major difference compared to other models from Chapters 6 and 7 is that we are running a *regression* rather than a *classification*. Using `KNeighborsRegressor` essentially tells `scikit-learn` that we need to use different metrics (`neg_root_mean_squared_error` rather than accuracy) for tuning and evaluation. 

Now, let's use the `neg_root_mean_squared_error` to find the best setting for $k$ from our model. Let's test 200 values of $k$. 

First, create a parameter grid called `param_grid` that contains values of range 1 to 200. 

Next, tune your model such that it tests all the values in `range(1, 201, 1)` using `GridSearchCV` function with `cv=5` and `n_jobs=-1` and save the tuned model as `marathon_tuned`. Finally, fit the tuned model to the training dataset and save the `cv_results_` in a dataframe. 

*Assign your answer to an object called `marathon_results`.*

In [None]:
np.random.seed(2019) # DO NOT CHANGE

# param_grid = _____
# marathon_tuned = GridSearchCV(___, ___, ___, ___, ___)
# marathon_results = pd.DataFrame(____.fit(____, ____).____) 

# your code here
raise NotImplementedError
marathon_results

In [None]:
from hashlib import sha1
assert sha1(str(type(param_grid is None)).encode("utf-8")+b"30d0d3662229d44f").hexdigest() == "da553ee69003b9c9aada1e7a438fa90a7922fdc2", "type of param_grid is None is not bool. param_grid is None should be a bool"
assert sha1(str(param_grid is None).encode("utf-8")+b"30d0d3662229d44f").hexdigest() == "4b0d0e9ab6112746b82babf4b47d3b0a300deee7", "boolean value of param_grid is None is not correct"

assert sha1(str(type(param_grid)).encode("utf-8")+b"893d5d976eee972f").hexdigest() == "1cfc7b36cd2dfee5cdfe7ec48158483e48eaa1e5", "type of type(param_grid) is not correct"

assert sha1(str(type("knn__n_neighbors" in param_grid)).encode("utf-8")+b"5d064e5ac067ef36").hexdigest() == "240cc970cb10d9c7001ef209b5b3f8b24abd0f37", "type of \"knn__n_neighbors\" in param_grid is not bool. \"knn__n_neighbors\" in param_grid should be a bool"
assert sha1(str("knn__n_neighbors" in param_grid).encode("utf-8")+b"5d064e5ac067ef36").hexdigest() == "3328dc836663830e698e2f98ee2d5f8151a3574e", "boolean value of \"knn__n_neighbors\" in param_grid is not correct"

assert sha1(str(type(sum(i for i in param_grid['knn__n_neighbors']))).encode("utf-8")+b"1386e4746ae8c0af").hexdigest() == "eb116977e22049b4037f6e9b3eee91911bb6c0c4", "type of sum(i for i in param_grid['knn__n_neighbors']) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(i for i in param_grid['knn__n_neighbors'])).encode("utf-8")+b"1386e4746ae8c0af").hexdigest() == "799bc8fd4658c132300352880d3228c9bc9f32d1", "value of sum(i for i in param_grid['knn__n_neighbors']) is not correct"

assert sha1(str(type(marathon_tuned is None)).encode("utf-8")+b"86af93feb3d70b60").hexdigest() == "5accafa1127bb098a15d34342130b6ccd31c000d", "type of marathon_tuned is None is not bool. marathon_tuned is None should be a bool"
assert sha1(str(marathon_tuned is None).encode("utf-8")+b"86af93feb3d70b60").hexdigest() == "a40336f78b05274bbba2cfdd93fddddec845a334", "boolean value of marathon_tuned is None is not correct"

assert sha1(str(type(marathon_tuned.n_splits_)).encode("utf-8")+b"74d08a6999774b56").hexdigest() == "201daa1d39593236174285f23b0e6ea0a1cc54c1", "type of marathon_tuned.n_splits_ is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(marathon_tuned.n_splits_).encode("utf-8")+b"74d08a6999774b56").hexdigest() == "d5a69e2068d68f05a1e1b13ea92915eb9984450f", "value of marathon_tuned.n_splits_ is not correct"

assert sha1(str(type(marathon_tuned.estimator[0])).encode("utf-8")+b"530aa19fa4f7a5b1").hexdigest() == "528b142c6afdac39faa8551cd9cf88f33a5fa1f7", "type of marathon_tuned.estimator[0] is not correct"
assert sha1(str(marathon_tuned.estimator[0]).encode("utf-8")+b"530aa19fa4f7a5b1").hexdigest() == "36d2cfe2a9b09d79175b89078a6996f2caa6dc0b", "value of marathon_tuned.estimator[0] is not correct"

assert sha1(str(type(marathon_tuned.estimator[1])).encode("utf-8")+b"c7081be95a4c18d7").hexdigest() == "84e1c624db4321abae6c20709e63e35b66aa8661", "type of marathon_tuned.estimator[1] is not correct"
assert sha1(str(marathon_tuned.estimator[1]).encode("utf-8")+b"c7081be95a4c18d7").hexdigest() == "e5145ace6f0dd10425fe513c9fb485e224ff7cbe", "value of marathon_tuned.estimator[1] is not correct"

assert sha1(str(type(marathon_tuned.param_grid == param_grid)).encode("utf-8")+b"146e88dd624557b9").hexdigest() == "de0c5f75e2d432cfc776d2922e97c9e98915b9c6", "type of marathon_tuned.param_grid == param_grid is not bool. marathon_tuned.param_grid == param_grid should be a bool"
assert sha1(str(marathon_tuned.param_grid == param_grid).encode("utf-8")+b"146e88dd624557b9").hexdigest() == "adb3c253c4a21a2d0cd9b5bc7efdd4b278d4825f", "boolean value of marathon_tuned.param_grid == param_grid is not correct"

assert sha1(str(type(marathon_results is None)).encode("utf-8")+b"83af76c86c139484").hexdigest() == "470227cb3db4566678d6bf37199f11f47b0da93b", "type of marathon_results is None is not bool. marathon_results is None should be a bool"
assert sha1(str(marathon_results is None).encode("utf-8")+b"83af76c86c139484").hexdigest() == "f23c208dcf76f11e7b65bfe8b7ee409583ae9d5e", "boolean value of marathon_results is None is not correct"

assert sha1(str(type(marathon_results)).encode("utf-8")+b"eb0803bf929fb8fc").hexdigest() == "0852072f4b56d08257c8679b7369258c35d6482c", "type of type(marathon_results) is not correct"

assert sha1(str(type(marathon_results.shape)).encode("utf-8")+b"9cb4442962896aa7").hexdigest() == "9aa0580a0826901120b9f9db916204d6db8b7791", "type of marathon_results.shape is not tuple. marathon_results.shape should be a tuple"
assert sha1(str(len(marathon_results.shape)).encode("utf-8")+b"9cb4442962896aa7").hexdigest() == "187617b6641697c42d0db6c2767585b0b8773b83", "length of marathon_results.shape is not correct"
assert sha1(str(sorted(map(str, marathon_results.shape))).encode("utf-8")+b"9cb4442962896aa7").hexdigest() == "6b0c8217e7f8470c23d037c814c6cee4591d712c", "values of marathon_results.shape are not correct"
assert sha1(str(marathon_results.shape).encode("utf-8")+b"9cb4442962896aa7").hexdigest() == "0ebea9a661cfee1c8f4ddc07d45aaba560afe4a2", "order of elements of marathon_results.shape is not correct"

assert sha1(str(type(sum(marathon_results.param_knn__n_neighbors))).encode("utf-8")+b"d51b1bb73f693baf").hexdigest() == "4901aab671f06ea4405c3ffdfc4c700f096da295", "type of sum(marathon_results.param_knn__n_neighbors) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(marathon_results.param_knn__n_neighbors)).encode("utf-8")+b"d51b1bb73f693baf").hexdigest() == "f9ea5b95ba67757a952e8532863e9fe50bc1b39d", "value of sum(marathon_results.param_knn__n_neighbors) is not correct"

assert sha1(str(type(sum(marathon_results.mean_test_score))).encode("utf-8")+b"5d59f13985b1f723").hexdigest() == "0f83e7c185abd4a4b68418a78a638185818ca9cd", "type of sum(marathon_results.mean_test_score) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(marathon_results.mean_test_score), 2)).encode("utf-8")+b"5d59f13985b1f723").hexdigest() == "4cc2eb8a30659364cf51f7731df652c0fdddd95a", "value of sum(marathon_results.mean_test_score) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(marathon_results.std_test_score))).encode("utf-8")+b"299d014150fd3d0b").hexdigest() == "ba58d653b00639c815178cfdf2d5f7673749f3d2", "type of sum(marathon_results.std_test_score) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(marathon_results.std_test_score), 2)).encode("utf-8")+b"299d014150fd3d0b").hexdigest() == "2a64dd4fecace666a851b6dccaddb3b31e129d70", "value of sum(marathon_results.std_test_score) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 8.1**
<br> {points: 1}

Great! Now find the number of neighbors that will serve as our best $k$ value by calling the `best_params_` attribute of the model `marathon_tuned`. Your answer should simply be a dictionary with one key-value pair. 

Also, find the score for the best model by calling the `best_score_` attribute of the model `marathon_tuned`. Make sure to convert the negative RMSPE score into a positive RMSPE score by using a `-` sign.

*Assign your best parameters to an object called `marathon_min`, and assign your best RMSPE to an object called `marathon_best_RMSPE`.* 


In [None]:
# ___ = ___.best_params_
# ___ = -___.best_score_

# your code here
raise NotImplementedError
print(marathon_min)
print(marathon_best_RMSPE)

In [None]:
from hashlib import sha1
assert sha1(str(type(marathon_min is None)).encode("utf-8")+b"8f0710d86b1357e2").hexdigest() == "ccd1682e325f32b6badf612958aaead8ac3067e7", "type of marathon_min is None is not bool. marathon_min is None should be a bool"
assert sha1(str(marathon_min is None).encode("utf-8")+b"8f0710d86b1357e2").hexdigest() == "7eeb19e9665ef61d235c6c9c0fed0de529565b9f", "boolean value of marathon_min is None is not correct"

assert sha1(str(type(marathon_min)).encode("utf-8")+b"c304831dc39a2e9d").hexdigest() == "f989ed787872884b24f3ef8b2cebd04e0828188c", "type of type(marathon_min) is not correct"

assert sha1(str(type(marathon_min)).encode("utf-8")+b"ae047a4e0ac8d30e").hexdigest() == "ff0bcd8b02aa3d5818e2dec1b3a2a5e9c3268349", "type of marathon_min is not dict. marathon_min should be a dict"
assert sha1(str(len(list(marathon_min.keys()))).encode("utf-8")+b"ae047a4e0ac8d30e").hexdigest() == "e2ad2f7c10c5798a2dd757f6547aed8f58bf956e", "number of keys of marathon_min is not correct"
assert sha1(str(sorted(map(str, marathon_min.keys()))).encode("utf-8")+b"ae047a4e0ac8d30e").hexdigest() == "c180b1e6303d5c23ad87834a9285bb4c56b10f5e", "keys of marathon_min are not correct"
assert sha1(str(sorted(map(str, marathon_min.values()))).encode("utf-8")+b"ae047a4e0ac8d30e").hexdigest() == "717e8d8c68fc1ce757789b507e539fe81caa846b", "correct keys, but values of marathon_min are not correct"
assert sha1(str(marathon_min).encode("utf-8")+b"ae047a4e0ac8d30e").hexdigest() == "ca06db33515706a48690d3af7e26291a66278192", "correct keys and values, but incorrect correspondence in keys and values of marathon_min"

assert sha1(str(type(marathon_best_RMSPE is None)).encode("utf-8")+b"1ada1e7b7f2d83d8").hexdigest() == "80ab6a92d33d756053f690af3bd9e2336cf98aa8", "type of marathon_best_RMSPE is None is not bool. marathon_best_RMSPE is None should be a bool"
assert sha1(str(marathon_best_RMSPE is None).encode("utf-8")+b"1ada1e7b7f2d83d8").hexdigest() == "0fb4ad940b5aae550fbf21d6b38e6ee442748200", "boolean value of marathon_best_RMSPE is None is not correct"

assert sha1(str(type(marathon_best_RMSPE)).encode("utf-8")+b"e1df708d2fb29060").hexdigest() == "fa4eb1268acceef1e01ca82ac8d95398605f1dcf", "type of marathon_best_RMSPE is not correct"
assert sha1(str(marathon_best_RMSPE).encode("utf-8")+b"e1df708d2fb29060").hexdigest() == "c52b637749cf38afbb2e9c9ac0435220efb1e3da", "value of marathon_best_RMSPE is not correct"

print('Success!')

**Question 8.2**
<br> {points: 1}

To assess how well our model might do at predicting on unseen data, we will assess its RMSPE on the test data.

To start, get the best `best_estimator_` from `marathon_tuned` and store the object in a name called `marathon_best_model`. 

Next, we will use the predict function to make predictions on the test data and store the predictions `marathon_prediction`.

Finally, we will compute the RMSPE on the test data using the `mean_squared_error` function. Don't forget to take the square root to obtain the RMSPE!

*Note: `scikit-learn` also has a `score` function for the `KNeighborsRegressor`. The `score` function returns the coefficient of determination (often called $R^2$) of the fit, not the RMSPE.*


*Assign your answer in an object called `marathon_summary`.*


In [None]:
np.random.seed(1234) # DO NOT CHANGE

# marathon_best_model = ___.best_estimator_
# ___ = ___.____(___)
# ___ = mean_squared_error(___, ___) ** (1 / 2)


# your code here
raise NotImplementedError
marathon_summary

In [None]:
from hashlib import sha1
assert sha1(str(type(marathon_best_model is None)).encode("utf-8")+b"198f671d0953237e").hexdigest() == "d0e1e9dde4365bf9808d3365ee75d381e9cf1b7a", "type of marathon_best_model is None is not bool. marathon_best_model is None should be a bool"
assert sha1(str(marathon_best_model is None).encode("utf-8")+b"198f671d0953237e").hexdigest() == "79729a7cdaf177e41cad8931fc1b9c10a9ed5bdf", "boolean value of marathon_best_model is None is not correct"

assert sha1(str(type(marathon_best_model.steps[0][1])).encode("utf-8")+b"983f02a580ce406b").hexdigest() == "b741f745c7f506274b1ac3abb8398a6d0640c4a5", "type of marathon_best_model.steps[0][1] is not correct"
assert sha1(str(marathon_best_model.steps[0][1]).encode("utf-8")+b"983f02a580ce406b").hexdigest() == "121a5e291be6e25f39808fa26b424ff69a5184f2", "value of marathon_best_model.steps[0][1] is not correct"

assert sha1(str(type(marathon_best_model.steps[1][1])).encode("utf-8")+b"b534f7cd371f562d").hexdigest() == "3e3362aea6d663806a23f53ed08235383572bf3f", "type of marathon_best_model.steps[1][1] is not correct"
assert sha1(str(marathon_best_model.steps[1][1]).encode("utf-8")+b"b534f7cd371f562d").hexdigest() == "896e86b87da262b2d19fa170395053706ecb9428", "value of marathon_best_model.steps[1][1] is not correct"

assert sha1(str(type(marathon_best_model.steps[1][1].n_neighbors)).encode("utf-8")+b"71efd85cc1d148e2").hexdigest() == "e127113da834d251fefe174ee0ea4947b022cc2d", "type of marathon_best_model.steps[1][1].n_neighbors is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(marathon_best_model.steps[1][1].n_neighbors).encode("utf-8")+b"71efd85cc1d148e2").hexdigest() == "55dee0bd6eb13a6386f471d5512a303753dfcb17", "value of marathon_best_model.steps[1][1].n_neighbors is not correct"

assert sha1(str(type(marathon_best_model.steps[1][1].weights)).encode("utf-8")+b"a78b50d38a4d4d74").hexdigest() == "863b7a1cb11aa08863fe72e7d3b284a777d24b40", "type of marathon_best_model.steps[1][1].weights is not str. marathon_best_model.steps[1][1].weights should be an str"
assert sha1(str(len(marathon_best_model.steps[1][1].weights)).encode("utf-8")+b"a78b50d38a4d4d74").hexdigest() == "90fe36b762abcf82df8454390e3690bf073d1c91", "length of marathon_best_model.steps[1][1].weights is not correct"
assert sha1(str(marathon_best_model.steps[1][1].weights.lower()).encode("utf-8")+b"a78b50d38a4d4d74").hexdigest() == "90ee8436df9dfa5d764a9efa5ea2e9c210fdeabf", "value of marathon_best_model.steps[1][1].weights is not correct"
assert sha1(str(marathon_best_model.steps[1][1].weights).encode("utf-8")+b"a78b50d38a4d4d74").hexdigest() == "90ee8436df9dfa5d764a9efa5ea2e9c210fdeabf", "correct string value of marathon_best_model.steps[1][1].weights but incorrect case of letters"

assert sha1(str(type(marathon_prediction is None)).encode("utf-8")+b"ee7942bbbae711a5").hexdigest() == "84a5f9ade72e5cb756cefaf7dfe316e4e12c97f9", "type of marathon_prediction is None is not bool. marathon_prediction is None should be a bool"
assert sha1(str(marathon_prediction is None).encode("utf-8")+b"ee7942bbbae711a5").hexdigest() == "6aa4b5b0af09085b9b5dfec3d208ccae9ba036d0", "boolean value of marathon_prediction is None is not correct"

assert sha1(str(type(marathon_prediction)).encode("utf-8")+b"46ddbf02086ef96c").hexdigest() == "2a85478ab7f5b5111890c844c48c637da9502ee1", "type of type(marathon_prediction) is not correct"

assert sha1(str(type(marathon_prediction.sum())).encode("utf-8")+b"43b66e33b17dd080").hexdigest() == "4ea06ca8839f44e6fbbde0c1b9d452e84bea1959", "type of marathon_prediction.sum() is not correct"
assert sha1(str(marathon_prediction.sum()).encode("utf-8")+b"43b66e33b17dd080").hexdigest() == "06b0554bc2c02bbaeb33c8b43022a035364086e1", "value of marathon_prediction.sum() is not correct"

assert sha1(str(type(marathon_summary is None)).encode("utf-8")+b"1814e7bddd01583a").hexdigest() == "d8de3d94e4647b3bcfd7db74a97b49925387e512", "type of marathon_summary is None is not bool. marathon_summary is None should be a bool"
assert sha1(str(marathon_summary is None).encode("utf-8")+b"1814e7bddd01583a").hexdigest() == "64a587a7ac75fb84614901621721f6fe3fd4ba44", "boolean value of marathon_summary is None is not correct"

assert sha1(str(type(marathon_summary)).encode("utf-8")+b"1a1bbbaf223615b4").hexdigest() == "5738d639c804db980d92ec015994e6422f004308", "type of marathon_summary is not correct"
assert sha1(str(marathon_summary).encode("utf-8")+b"1a1bbbaf223615b4").hexdigest() == "daec3f439e65e466e2ad1f2f807cfe15555d2201", "value of marathon_summary is not correct"

print('Success!')

What does this RMSPE mean? RMSPE is measured in the units of the target/response variable, so it can sometimes be a bit hard to interpret. But in this case, we know that a typical marathon race time is somewhere between 3 - 5 hours. So this model allows us to predict a runner's race time up to about +/-0.6 of an hour, or +/- 36 minutes. This is not *fantastic*, but not *terrible* either. We can certainly use the model to determine roughly whether an athlete will have a bad, good, or excellent race time, but probably cannot reliably distinguish between athletes of a similar caliber.

For now, let’s consider this approach to thinking about RMSPE from our testing data set: as long as its not significantly worse than the cross-validation RMSPE of our best model (**Question 8.1**), then we can say that we’re not doing too much worse on the test data than we did on the training data. In future courses on statistical/machine learning, you will learn more about how to interpret RMSPE from testing data and other ways to assess models.  

**Question 8.3**
<br>{points: 1}

The RMSPE from our testing data set is *much worse* than the cross-validation RMSPE of our best model. 

*Assign your answer to an object named `answer8_3`. Make sure your answer is either `True` or `False`.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer8_3)).encode("utf-8")+b"93b17f89d4575f4c").hexdigest() == "39bd0bb8b038921a8f0eae3decdbe1baa6eaa865", "type of answer8_3 is not bool. answer8_3 should be a bool"
assert sha1(str(answer8_3).encode("utf-8")+b"93b17f89d4575f4c").hexdigest() == "d3420f2b3e07daca57a1c752f5da37150d145ed2", "boolean value of answer8_3 is not correct"

print('Success!')

**Question 9.0**
<br> {points: 1}

Let's visualize what the relationship between `max` and `time_hrs` looks like with our best $k$ value to ultimately explore how the $k$ value affects $k$-nn regression.

To do so, use the `predict` function on the `marathon_best_model` that utilizes the best $k$ value to create predictions for the `marathon_training` data. Then, add the column of predictions to the `marathon_training` data frame using the `assign` function. Name the resulting data frame `marathon_preds`.

Next, create a scatterplot with the maximum distance ran per week against the marathon time from `marathon_preds`. Assign your plot to an object called `marathon_plot`. **Plot the predictions as a blue line over the data points.** Remember the fundamentals of effective visualizations such as having a **title** and **human-readable axes**. 

*Assign the data frame from the first part to an object called `marathon_preds`, and the plot to an object called `marathon_plot`.*

In [None]:
np.random.seed(2019) # DO NOT CHANGE

# marathon_preds = ____.assign(
#     predictions= _____.predict(____)
# )
# marathon_plot = ...

# your code here
raise NotImplementedError
marathon_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(marathon_preds is None)).encode("utf-8")+b"c7704d7569905c1b").hexdigest() == "2947ee2d2b614561cea315d200e8e6af3d2083b4", "type of marathon_preds is None is not bool. marathon_preds is None should be a bool"
assert sha1(str(marathon_preds is None).encode("utf-8")+b"c7704d7569905c1b").hexdigest() == "8abfff0f31f9dc86974172132d5a24485f28ca5d", "boolean value of marathon_preds is None is not correct"

assert sha1(str(type(marathon_preds)).encode("utf-8")+b"e9bee26273771d62").hexdigest() == "08e14038ae51a6f8894b9ca8c4bdb3b234fdd3a0", "type of type(marathon_preds) is not correct"

assert sha1(str(type(marathon_preds.shape)).encode("utf-8")+b"41fc11bd901353d0").hexdigest() == "1ccb0f81f8710491af937fb6fb12f53eb0dc6c72", "type of marathon_preds.shape is not tuple. marathon_preds.shape should be a tuple"
assert sha1(str(len(marathon_preds.shape)).encode("utf-8")+b"41fc11bd901353d0").hexdigest() == "a3333185a9239c0444b4282f0869a0a337bc1fba", "length of marathon_preds.shape is not correct"
assert sha1(str(sorted(map(str, marathon_preds.shape))).encode("utf-8")+b"41fc11bd901353d0").hexdigest() == "5b69c997c2bb41cbae3e3c95ebaed7896f848852", "values of marathon_preds.shape are not correct"
assert sha1(str(marathon_preds.shape).encode("utf-8")+b"41fc11bd901353d0").hexdigest() == "8e8e0b29ade97e8b5fe6a19ecc2115c21b8fcc7b", "order of elements of marathon_preds.shape is not correct"

assert sha1(str(type("predictions" in marathon_preds.columns)).encode("utf-8")+b"79d592a199de251f").hexdigest() == "3b47e3fa20032bffb773c749276122e5fa7bba9a", "type of \"predictions\" in marathon_preds.columns is not bool. \"predictions\" in marathon_preds.columns should be a bool"
assert sha1(str("predictions" in marathon_preds.columns).encode("utf-8")+b"79d592a199de251f").hexdigest() == "a7e7f9a49a39a206b75d641e2ab1e17b59727f00", "boolean value of \"predictions\" in marathon_preds.columns is not correct"

assert sha1(str(type(sum(marathon_preds.predictions))).encode("utf-8")+b"b802ed5346f026f4").hexdigest() == "050239e0d65d4ef66d119daecea7d51a5a8100d4", "type of sum(marathon_preds.predictions) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(marathon_preds.predictions), 2)).encode("utf-8")+b"b802ed5346f026f4").hexdigest() == "21d146bd1a4e2b2c4dfcf7f880d83366add8e22f", "value of sum(marathon_preds.predictions) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(marathon_preds.time_hrs))).encode("utf-8")+b"a78d9ae02a4afe08").hexdigest() == "e7dc223c5fd034150139a5d9f4289b86f86685e4", "type of sum(marathon_preds.time_hrs) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(marathon_preds.time_hrs), 2)).encode("utf-8")+b"a78d9ae02a4afe08").hexdigest() == "4cb8159cbef674436f0086b5387fe25eeeaf2972", "value of sum(marathon_preds.time_hrs) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(marathon_plot is None)).encode("utf-8")+b"6ab9e7f4177d91eb").hexdigest() == "309ed4f67afda1a09ac9e645764036223dcd7469", "type of marathon_plot is None is not bool. marathon_plot is None should be a bool"
assert sha1(str(marathon_plot is None).encode("utf-8")+b"6ab9e7f4177d91eb").hexdigest() == "2ba7ee570a8f70cb3810870afb3d35b3c3936c3d", "boolean value of marathon_plot is None is not correct"

assert sha1(str(type(len(marathon_plot.layer))).encode("utf-8")+b"3ef644504d9a991d").hexdigest() == "0b6720a99a20341b62be24f3f02d9bf657dc22c4", "type of len(marathon_plot.layer) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(len(marathon_plot.layer)).encode("utf-8")+b"3ef644504d9a991d").hexdigest() == "28e93f262ba55f6c1c268ae8d974eb22d0c7ffd6", "value of len(marathon_plot.layer) is not correct"

print('Success!')