# Tutorial 6: Classification

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

* Recognize situations where a simple classifier would be appropriate for making predictions.
* Explain the k-nearest neighbour classification algorithm.
* Interpret the output of a classifier.
* Compute, by hand, the distance between points when there are two explanatory variables/predictors.
* Describe what a training data set is and how it is used in classification.
* In a dataset with two explanatory variables/predictors, perform k-nearest neighbour classification in Python using `scikit-learn` to predict the class of a single new observation.

In [None]:
### Run this cell before continuing.
import random

import altair as alt
import pandas as pd
import numpy as np
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

alt.data_transformers.disable_max_rows()

**Question 0.1** Multiple Choice: 
<br> {points: 1}

Before applying k-nearest neighbour to a classification task, we need to scale the data. What is the purpose of this step?

A. To help speed up the knn algorithm. 

B. To convert all data observations to numeric values. 

C. To ensure all data observations will be on a comparable scale and contribute equal shares to the calculation of the distance between points.

D. None of the above. 

*Assign your answer to an object called `answer0_1`. Make sure the correct answer is an uppercase letter. Surround your answer with quotation marks (e.g. `"F"`).*

*Note: we typically **standardize** (i.e., scale **and** center) the data before doing classification. For the K-nearest neighbour algorithm specifically, centering has no effect. But it doesn't hurt, and can help with other predictive data analyses, so we will do it below.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer0_1)).encode("utf-8")+b"34068262027156ec").hexdigest() == "44943f5e51a014aa6504b3b72ed2c9baccb2fbda", "type of answer0_1 is not str. answer0_1 should be an str"
assert sha1(str(len(answer0_1)).encode("utf-8")+b"34068262027156ec").hexdigest() == "fe3030e20ef0febe6800a68d0c8fa01fafa66380", "length of answer0_1 is not correct"
assert sha1(str(answer0_1.lower()).encode("utf-8")+b"34068262027156ec").hexdigest() == "94487837d1db455c0f47b5a23ee90e5f4d71677f", "value of answer0_1 is not correct"
assert sha1(str(answer0_1).encode("utf-8")+b"34068262027156ec").hexdigest() == "c4a32d1c04dfcef472414e16517587f930548e47", "correct string value of answer0_1 but incorrect case of letters"

print('Success!')

## 1. Fruit Data Example 

In the agricultural industry, cleaning, sorting, grading, and packaging food products are all necessary tasks in the post-harvest process. Products are classified based on appearance, size and shape, attributes which helps determine the quality of the food. Sorting can be done by humans, but it is tedious and time consuming. Automatic sorting could help save time and money. Images of the food products are captured and analysed to determine visual characteristics. 

The [dataset](https://www.kaggle.com/mjamilmoughal/k-nearest-neighbor-classifier-to-predict-fruits/notebook) contains observations of fruit described with four features: (1) mass (in g), (2) width (in cm), (3) height (in cm), and (4) color score (on a scale from 0 - 1).

**Question 1.0** 
<br> {points: 1}

Load the file, `fruit_data.csv`, into your notebook. 

*Assign your data to an object called `fruit_data`.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(fruit_data is None)).encode("utf-8")+b"2a8a9e71ed7ffe8c").hexdigest() == "6e44d758cd3b802c7fe7684d53781f6a6fd5bd68", "type of fruit_data is None is not bool. fruit_data is None should be a bool"
assert sha1(str(fruit_data is None).encode("utf-8")+b"2a8a9e71ed7ffe8c").hexdigest() == "387385673bb688a0b40b5e0e4431066d370e6b14", "boolean value of fruit_data is None is not correct"

assert sha1(str(type(fruit_data.shape)).encode("utf-8")+b"799125153dec7f76").hexdigest() == "cfbfbdaf4a2d54504e061cc89f5f2c0fd38471e8", "type of fruit_data.shape is not tuple. fruit_data.shape should be a tuple"
assert sha1(str(len(fruit_data.shape)).encode("utf-8")+b"799125153dec7f76").hexdigest() == "eb299e3f2c8a5d3668336670118cc8321bd048be", "length of fruit_data.shape is not correct"
assert sha1(str(sorted(map(str, fruit_data.shape))).encode("utf-8")+b"799125153dec7f76").hexdigest() == "d589a5409f463279a08fa18ddc2c51c2b593ac69", "values of fruit_data.shape are not correct"
assert sha1(str(fruit_data.shape).encode("utf-8")+b"799125153dec7f76").hexdigest() == "5b0ca264ec7979083fba99d34a345a116ec02bbe", "order of elements of fruit_data.shape is not correct"

assert sha1(str(type(fruit_data.fruit_name.dtype)).encode("utf-8")+b"d18fb54b5bd018a7").hexdigest() == "31175137b4f0f0df22e3d231024f0d99db5fbbb5", "type of fruit_data.fruit_name.dtype is not correct"
assert sha1(str(fruit_data.fruit_name.dtype).encode("utf-8")+b"d18fb54b5bd018a7").hexdigest() == "e81501b439ed9a3d616906a99c46f7569a0fb25b", "value of fruit_data.fruit_name.dtype is not correct"

assert sha1(str(type(fruit_data.fruit_name.unique())).encode("utf-8")+b"9b3e670f44230890").hexdigest() == "e4f88d1536ca2426a7bd82a1045eb1fef8081f95", "type of fruit_data.fruit_name.unique() is not correct"
assert sha1(str(fruit_data.fruit_name.unique()).encode("utf-8")+b"9b3e670f44230890").hexdigest() == "cc8462cadce32d09c6473927f8f877aed3fcb038", "value of fruit_data.fruit_name.unique() is not correct"

assert sha1(str(type(fruit_data.mass.values)).encode("utf-8")+b"2b7e837d430d7e99").hexdigest() == "e15f0c5569365f3b8bb76c84b1e64be050bfb595", "type of fruit_data.mass.values is not correct"
assert sha1(str(fruit_data.mass.values).encode("utf-8")+b"2b7e837d430d7e99").hexdigest() == "67d5d9ba3f9b03fe9f4424afee5411ba73e7a185", "value of fruit_data.mass.values is not correct"

print('Success!')

Let's take a look at the first few observations in the fruit dataset. Run the cell below.

In [None]:
# Run this cell.
fruit_data.head()

**Question 1.0.1** Multiple Choice:
<br> {points: 1}

**Which of the columns should we treat as categorical variables?**

A. Fruit label, width, fruit subtype

B. Fruit name, color score, height

C. Fruit label, fruit subtype, fruit name

D. Color score, mass, width 

*Assign your answer to an object called `answer1_0_1`. Make sure the correct answer is an uppercase letter. Remember to surround your answer with quotation marks (e.g. `"E"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_0_1)).encode("utf-8")+b"a28bdf508ac8651c").hexdigest() == "14f97293395f0e151882e506148f5be51c24102e", "type of answer1_0_1 is not str. answer1_0_1 should be an str"
assert sha1(str(len(answer1_0_1)).encode("utf-8")+b"a28bdf508ac8651c").hexdigest() == "3f4dbc8cb687d3d90f22c441dccfedc217db1d11", "length of answer1_0_1 is not correct"
assert sha1(str(answer1_0_1.lower()).encode("utf-8")+b"a28bdf508ac8651c").hexdigest() == "77265d09f1ed7f79e009b198918132275630e834", "value of answer1_0_1 is not correct"
assert sha1(str(answer1_0_1).encode("utf-8")+b"a28bdf508ac8651c").hexdigest() == "8074075a5c318965d8595754d656d049f4deb0f9", "correct string value of answer1_0_1 but incorrect case of letters"

print('Success!')

Run the cell below, and find the nearest neighbour based on mass and width to the first observation just by looking at the scatterplot (the first observation has been circled for you).

In [None]:
# Run this cell.
point1 = [192, 8.4]
point2 = [180, 8]
point44 = [194, 7.2]

fruit_chart = (
    alt.Chart(fruit_data)
    .mark_point(size=15)
    .encode(
        x=alt.X("mass", title="Mass (grams)"),
        y=alt.Y("width", title="Width (cm)", scale=alt.Scale(zero=False)),
        color=alt.Color("fruit_name", title="Name of the Fruit"),
    )
)

(
    fruit_chart
    + alt.Chart(pd.DataFrame([[192, 8.4]], columns=["x", "y"]))
    .mark_point(size=150)
    .encode(x="x", y="y", color=alt.value("black"))
    + alt.Chart(pd.DataFrame([[1, 183, 8.5]], columns=["text", "x", "y"]))
    .mark_text(size=15)
    .encode(x="x", y="y", text="text", color=alt.value("black"))
).configure_axis(labelFontSize=20, titleFontSize=20).configure_legend(
    titleFontSize=15, labelFontSize=15
).properties(
    width=400, height=300
)

**Question 1.1** Multiple Choice: 
<br> {points: 1}

Based on the graph generated, what is the `fruit_name` of the closest data point to the one circled?

A. apple

B. lemon

C. mandarin 

D. orange

*Assign your answer to an object called `answer1_1`. Make sure the correct answer is an uppercase letter. Surround your answer with quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_1)).encode("utf-8")+b"24ff6b9c810cd549").hexdigest() == "4bce7b6479b463e965cd8103e4f24e791c04a745", "type of answer1_1 is not str. answer1_1 should be an str"
assert sha1(str(len(answer1_1)).encode("utf-8")+b"24ff6b9c810cd549").hexdigest() == "14ee7a6198e45ef5c1b5cfeb6edbbc02cefb48fd", "length of answer1_1 is not correct"
assert sha1(str(answer1_1.lower()).encode("utf-8")+b"24ff6b9c810cd549").hexdigest() == "0d959946aa50aa9b7c079125f036fab978834d47", "value of answer1_1 is not correct"
assert sha1(str(answer1_1).encode("utf-8")+b"24ff6b9c810cd549").hexdigest() == "5929df07bb39f2e04521fff6a263db7119a028a7", "correct string value of answer1_1 but incorrect case of letters"

print('Success!')

**Question 1.2**
<br> {points: 1}

Using mass and width, calculate the distance between the first observation and the second observation with the `euclidean_distances` function. 

We provide a scaffolding to get you started. 

*Assign your answer to an object called `fruit_dist_2`.*

In [None]:
# ___ = euclidean_distances(
#     fruit_data.loc[0:1, ["mass", ___]]
# )

# your code here
raise NotImplementedError
fruit_dist_2

In [None]:
from hashlib import sha1
assert str(type(fruit_dist_2)) == "<class 'numpy.ndarray'>", "type of fruit_dist_2 is not correct"
assert str(fruit_dist_2) == "[[ 0.         12.00666482]\n [12.00666482  0.        ]]", "value of fruit_dist_2 is not correct"

print('Success!')

**Question 1.3**
<br> {points: 1}

Calculate the distance between the first and the the 44th observation in the fruit dataset using the mass and width variables. 

*Hint: remember that in Python, index starts from 0, so the 44th observation in Pandas Dataframe corresponds to index 43*

*Assign your answer to an object called `fruit_dist_44`.*

In [None]:
# your code here
raise NotImplementedError
fruit_dist_44

In [None]:
from hashlib import sha1
assert sha1(str(type(fruit_dist_44)).encode("utf-8")+b"90abe152c0403a8f").hexdigest() == "10f247cd5b83ef2e4c69a8b84b9e6d6c71e54874", "type of fruit_dist_44 is not correct"
assert sha1(str(fruit_dist_44).encode("utf-8")+b"90abe152c0403a8f").hexdigest() == "bae27216a5cc10019007f8a531b9adef50e33979", "value of fruit_dist_44 is not correct"

print('Success!')

Let's circle these three observations on the plot from earlier.


In [None]:
# Run this cell.
point1 = [192, 8.4]
point2 = [180, 8]
point44 = [194, 7.2]

(
    fruit_chart
    + alt.Chart(
        pd.DataFrame([[192, 8.4], [180, 8.0], [193.5, 7.2]], columns=["x", "y"])
    )
    .mark_point(size=150)
    .encode(x="x", y="y", color=alt.value("black"))
    + alt.Chart(
        pd.DataFrame(
            [[1, 183, 8.5], [2, 169, 8.1], [44, 204, 7.1]], columns=["text", "x", "y"]
        )
    )
    .mark_text(size=15)
    .encode(x="x", y="y", text="text", color=alt.value("black"))
).configure_axis(labelFontSize=20, titleFontSize=20).configure_legend(
    titleFontSize=15, labelFontSize=15
).properties(width=400, height=300)

What do you notice about your answers from **Question 1.2 & 1.3** that you just calculated? Is it what you would expect given the scatter plot above? Why or why not? Discuss with your neighbour. 

*Hint: Look at where the observations are on the scatterplot in the cell above this question, and what might happen if we changed grams into kilograms to measure the mass?*


**Question 1.4** Multiple Choice:
<br> {points: 1}

The distance between the first and second observation is 12.01 and the distance between the first and 44th observation is 2.33. By the formula, observation 1 and 44 are closer, however, if we look at the scatterplot the distance of the first observation to the second observation appears closer than to the 44th observation. 

Which of the following statements is correct?

A. A difference of 12 g in mass between observation 1 and 2 is large compared to a difference of 1.2 cm in width between observation 1 and 44. Consequently, mass will drive the classification results, and width will have less of an effect. 

B. If we measured mass in kilograms, then we’d get different nearest neighbours.

C. We should standardize the data so that all variables will be on a comparable scale. 

D. All of the above. 

*Assign your answer to an object called `answer1_4`. Make sure the correct answer is an uppercase letter. Surround your answer with quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_4)).encode("utf-8")+b"0fedc74c1685e024").hexdigest() == "83eefe32c50a6ebd1f0fcabc0c0a1831a0d14b53", "type of answer1_4 is not str. answer1_4 should be an str"
assert sha1(str(len(answer1_4)).encode("utf-8")+b"0fedc74c1685e024").hexdigest() == "334cf000a27b13ea522fb2385162330c9982fa47", "length of answer1_4 is not correct"
assert sha1(str(answer1_4.lower()).encode("utf-8")+b"0fedc74c1685e024").hexdigest() == "9af00b21c60156531fa7363917ee153170239747", "value of answer1_4 is not correct"
assert sha1(str(answer1_4).encode("utf-8")+b"0fedc74c1685e024").hexdigest() == "e0cfb5247b3283da1fdd6f8f36079add7fb05a1e", "correct string value of answer1_4 but incorrect case of letters"

print('Success!')

**Question 1.5**
<br> {points: 1}

Let's create a `preprocessor` to *standardize* (i.e., center and scale) all of the variables in the fruit dataset. Centering will make sure that every variable has an average of 0, and scaling will make sure that every variable has standard deviation of 1. We will use the `StandardScaler` in the `preprocessor`. Then `fit_transform` the preprocessor so that we can examine the output.

Fit and transform your preprocessor with predictors `mass`, `width`, `height`, and `color_score`. For other columns, we use `passthrough` in the preprocessor.

Name the preprocessor `fruit_data_preprocessor`, and name the preprocessed data `fruit_data_scaled`.

*Note that we would save the preprocessed data into a dataframe for upcoming exercises.*

In [None]:
# ___ = ___(
#     (
#         "passthrough",
#         [
#             ___,
#             ___,
#             ___,
#         ],
#     ),
#     (StandardScaler(), [___, ___, ___, ___]),
# )
# ___ = pd.DataFrame(
#     fruit_data_preprocessor.___(___),
#     columns=[
#         "fruit_label",
#         "fruit_name",
#         "fruit_subtype",
#         "mass",
#         "width",
#         "height",
#         "color_score",
#     ],
# )

# your code here
raise NotImplementedError
fruit_data_scaled.head()

In [None]:
from hashlib import sha1
assert sha1(str(type(fruit_data_scaled is None)).encode("utf-8")+b"b4f906bc621dbb9f").hexdigest() == "6803af83a07e676f34ff6b4eb7fa16556fd1d7d5", "type of fruit_data_scaled is None is not bool. fruit_data_scaled is None should be a bool"
assert sha1(str(fruit_data_scaled is None).encode("utf-8")+b"b4f906bc621dbb9f").hexdigest() == "0c32569977cca685e7ee13a50ff489f39b257cc4", "boolean value of fruit_data_scaled is None is not correct"

assert sha1(str(type(fruit_data_scaled.shape)).encode("utf-8")+b"7b96d4f4233f6d92").hexdigest() == "1fa302bd5bf6baa5a313cd106004295b1aa50087", "type of fruit_data_scaled.shape is not tuple. fruit_data_scaled.shape should be a tuple"
assert sha1(str(len(fruit_data_scaled.shape)).encode("utf-8")+b"7b96d4f4233f6d92").hexdigest() == "cae99fa87adafe3b9436b0c9a0b55c4fa4f81261", "length of fruit_data_scaled.shape is not correct"
assert sha1(str(sorted(map(str, fruit_data_scaled.shape))).encode("utf-8")+b"7b96d4f4233f6d92").hexdigest() == "41060d9c8df68be76363b24e0315478229c85eab", "values of fruit_data_scaled.shape are not correct"
assert sha1(str(fruit_data_scaled.shape).encode("utf-8")+b"7b96d4f4233f6d92").hexdigest() == "7d9e88a86100ba3014926f6180c09c36405941ba", "order of elements of fruit_data_scaled.shape is not correct"

assert sha1(str(type(fruit_data_scaled.fruit_name.dtype)).encode("utf-8")+b"565319c677ac7fc9").hexdigest() == "0a527cfccffe72ad5a2bcb1225520f52d606ff81", "type of fruit_data_scaled.fruit_name.dtype is not correct"
assert sha1(str(fruit_data_scaled.fruit_name.dtype).encode("utf-8")+b"565319c677ac7fc9").hexdigest() == "f10186d63e72c6ab8e65d88dd4c02fc3c8227127", "value of fruit_data_scaled.fruit_name.dtype is not correct"

assert sha1(str(type(np.mean(fruit_data_scaled.mass.dropna()))).encode("utf-8")+b"e6fed4e75bf42c3c").hexdigest() == "498be77a221e441915d01c9912d1b9978d21e2e2", "type of np.mean(fruit_data_scaled.mass.dropna()) is not correct"
assert sha1(str(np.mean(fruit_data_scaled.mass.dropna())).encode("utf-8")+b"e6fed4e75bf42c3c").hexdigest() == "0ecc4d2b71e6450dc164914004055e37adf9a960", "value of np.mean(fruit_data_scaled.mass.dropna()) is not correct"

assert sha1(str(type(np.mean(fruit_data_scaled.height.dropna()))).encode("utf-8")+b"237b6e6b28da8983").hexdigest() == "1b48a1f91d88d6f5beabe3d52f94835b351c9304", "type of np.mean(fruit_data_scaled.height.dropna()) is not correct"
assert sha1(str(np.mean(fruit_data_scaled.height.dropna())).encode("utf-8")+b"237b6e6b28da8983").hexdigest() == "b2d38c90fea839dc9e83d7c50bba9771eb1f390d", "value of np.mean(fruit_data_scaled.height.dropna()) is not correct"

assert sha1(str(type(np.mean(fruit_data_scaled.width.dropna()))).encode("utf-8")+b"6be340475ce2c2cd").hexdigest() == "aae1953f8f1b9dea3016a016e128bd84c8513f61", "type of np.mean(fruit_data_scaled.width.dropna()) is not correct"
assert sha1(str(np.mean(fruit_data_scaled.width.dropna())).encode("utf-8")+b"6be340475ce2c2cd").hexdigest() == "753fd42690fb96eff48ef1a054bcc5cce0d79ad4", "value of np.mean(fruit_data_scaled.width.dropna()) is not correct"

assert sha1(str(type(np.mean(fruit_data_scaled.color_score.dropna()))).encode("utf-8")+b"89ca895d4f60ab47").hexdigest() == "b4178eb92261251dfeba26eec1b37e2e76834754", "type of np.mean(fruit_data_scaled.color_score.dropna()) is not correct"
assert sha1(str(np.mean(fruit_data_scaled.color_score.dropna())).encode("utf-8")+b"89ca895d4f60ab47").hexdigest() == "93d2d936e816890d1833509bdf6a6a16c43f0ee2", "value of np.mean(fruit_data_scaled.color_score.dropna()) is not correct"

assert sha1(str(type(np.std(fruit_data_scaled.mass.dropna()))).encode("utf-8")+b"e7636da8d3f380b0").hexdigest() == "296b258d65bdc85d7bff18d367b837a69800637f", "type of np.std(fruit_data_scaled.mass.dropna()) is not correct"
assert sha1(str(np.std(fruit_data_scaled.mass.dropna())).encode("utf-8")+b"e7636da8d3f380b0").hexdigest() == "753b9c8248d551da294b9332a4294d1b8aadd1e7", "value of np.std(fruit_data_scaled.mass.dropna()) is not correct"

assert sha1(str(type(np.std(fruit_data_scaled.height.dropna()))).encode("utf-8")+b"e7a73947724d524b").hexdigest() == "a11ee63be28358b25450bfdc590111a6aa7b4673", "type of np.std(fruit_data_scaled.height.dropna()) is not correct"
assert sha1(str(np.std(fruit_data_scaled.height.dropna())).encode("utf-8")+b"e7a73947724d524b").hexdigest() == "2fb75f2154c0b9bef0578810718b2adce64d67df", "value of np.std(fruit_data_scaled.height.dropna()) is not correct"

assert sha1(str(type(np.std(fruit_data_scaled.width.dropna()))).encode("utf-8")+b"163bbe5491a7e61e").hexdigest() == "3ac481e3eb1386fbca86d13e7c6121c84c8664b4", "type of np.std(fruit_data_scaled.width.dropna()) is not correct"
assert sha1(str(np.std(fruit_data_scaled.width.dropna())).encode("utf-8")+b"163bbe5491a7e61e").hexdigest() == "b9b5fde90b902de31cedd93399ba0d8f4865936d", "value of np.std(fruit_data_scaled.width.dropna()) is not correct"

assert sha1(str(type(np.std(fruit_data_scaled.color_score.dropna()))).encode("utf-8")+b"11174fd860489a00").hexdigest() == "6c72a3481c1b808c50205e8edf2771c5fce8ebaf", "type of np.std(fruit_data_scaled.color_score.dropna()) is not correct"
assert sha1(str(np.std(fruit_data_scaled.color_score.dropna())).encode("utf-8")+b"11174fd860489a00").hexdigest() == "3d6b05569778c8cfc7b5bca2490df63035480ca7", "value of np.std(fruit_data_scaled.color_score.dropna()) is not correct"

assert sha1(str(type(fruit_data_preprocessor is None)).encode("utf-8")+b"4c6f3f4e2d76f884").hexdigest() == "33fd727399078fc87b4ef45f96244b52520539f0", "type of fruit_data_preprocessor is None is not bool. fruit_data_preprocessor is None should be a bool"
assert sha1(str(fruit_data_preprocessor is None).encode("utf-8")+b"4c6f3f4e2d76f884").hexdigest() == "2b242597e2727e87e53993843c6335b1de610a32", "boolean value of fruit_data_preprocessor is None is not correct"

assert sha1(str(type(fruit_data_preprocessor.transformers_[1][2])).encode("utf-8")+b"709307c4cc002cda").hexdigest() == "2acad60ad2136c73ad8a1de426d5251faff14735", "type of fruit_data_preprocessor.transformers_[1][2] is not list. fruit_data_preprocessor.transformers_[1][2] should be a list"
assert sha1(str(len(fruit_data_preprocessor.transformers_[1][2])).encode("utf-8")+b"709307c4cc002cda").hexdigest() == "cdd01c7f117fde05b5134b0b9c370fad009d0bcf", "length of fruit_data_preprocessor.transformers_[1][2] is not correct"
assert sha1(str(sorted(map(str, fruit_data_preprocessor.transformers_[1][2]))).encode("utf-8")+b"709307c4cc002cda").hexdigest() == "e3ea1d89c29b01b807ff726c1b901e5de49e4424", "values of fruit_data_preprocessor.transformers_[1][2] are not correct"
assert sha1(str(fruit_data_preprocessor.transformers_[1][2]).encode("utf-8")+b"709307c4cc002cda").hexdigest() == "e025059d16eaeeb9a445dce7f89e8f272c7a394c", "order of elements of fruit_data_preprocessor.transformers_[1][2] is not correct"

print('Success!')

**Question 1.6**
<br> {points: 1}

Let's repeat **Question 1.2 and 1.3** with the scaled variables:

- calculate the distance with the scaled mass and width variables between observations 1 and 2
- calculate the distances with the scaled mass and width variables between observations 1 and 44 

After you do this, think about how these distances compared to the distances you computed in **Question 1.2 and 1.3** for the same points.

*Assign your answers to objects called `distance_2` and `distance_44` respectively.*

In [None]:
# your code here
raise NotImplementedError
print(distance_2)
print(distance_44)

In [None]:
from hashlib import sha1
assert sha1(str(type(distance_2 is None)).encode("utf-8")+b"54a66f9daa24f648").hexdigest() == "8024e941d55d0ccbf33c68995da54e9dfe8ed1f0", "type of distance_2 is None is not bool. distance_2 is None should be a bool"
assert sha1(str(distance_2 is None).encode("utf-8")+b"54a66f9daa24f648").hexdigest() == "782b4b089624c3eee1c331fdaa43e1391762b457", "boolean value of distance_2 is None is not correct"

assert sha1(str(type(distance_44 is None)).encode("utf-8")+b"3f25e30b526c6a9f").hexdigest() == "241a87c26618678f90983a9509abf122244c207b", "type of distance_44 is None is not bool. distance_44 is None should be a bool"
assert sha1(str(distance_44 is None).encode("utf-8")+b"3f25e30b526c6a9f").hexdigest() == "73dae5d709aee7dce5f3116628c1dbda488fd105", "boolean value of distance_44 is None is not correct"

assert sha1(str(type(distance_2)).encode("utf-8")+b"5c7450cfb9dcadb2").hexdigest() == "443f6d6d9dd9739f6f999e8d70aeacc8487e5b26", "type of type(distance_2) is not correct"

assert sha1(str(type(distance_44)).encode("utf-8")+b"35a36c5bf8e65ce2").hexdigest() == "8636737652fc0e13d726707931210970ce4533b5", "type of type(distance_44) is not correct"

assert sha1(str(type(distance_2)).encode("utf-8")+b"59f1f5dc923e051b").hexdigest() == "b0cd85c2fc468a42059ef0c7fd469f38a34549ea", "type of distance_2 is not correct"
assert sha1(str(distance_2).encode("utf-8")+b"59f1f5dc923e051b").hexdigest() == "5bf7b769868c4a3cd21737abec8f607786f91f35", "value of distance_2 is not correct"

assert sha1(str(type(distance_44)).encode("utf-8")+b"9c18c6bb0a648f71").hexdigest() == "e50038efaf359899707e4f659f9a0db2e723336e", "type of distance_44 is not correct"
assert sha1(str(distance_44).encode("utf-8")+b"9c18c6bb0a648f71").hexdigest() == "84a2ed7cbc82f9559467333d317cdb9edeb423ba", "value of distance_44 is not correct"

print('Success!')

**Question 1.7**
<br> {points: 1}

Make a scatterplot of scaled mass on the horizontal axis and scaled color score on the vertical axis. Color the points by fruit name. 

*Assign your plot to an object called `fruit_plot`. Make sure to do all the things to make an effective visualization.*

In [None]:
# your code here
raise NotImplementedError
fruit_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(fruit_plot is None)).encode("utf-8")+b"1dd98d8debc47460").hexdigest() == "cddd48a98980e79d11c29710c815543cfbf70e7e", "type of fruit_plot is None is not bool. fruit_plot is None should be a bool"
assert sha1(str(fruit_plot is None).encode("utf-8")+b"1dd98d8debc47460").hexdigest() == "1b0a77f3566d35f6c0a92e7eedf1523fa0d4c4c2", "boolean value of fruit_plot is None is not correct"

assert sha1(str(type(fruit_plot.encoding.x.field)).encode("utf-8")+b"86cd8c65b43a8d87").hexdigest() == "4e05fbee0ecdfe45a2f23ed7f3b6fb2febf82f61", "type of fruit_plot.encoding.x.field is not str. fruit_plot.encoding.x.field should be an str"
assert sha1(str(len(fruit_plot.encoding.x.field)).encode("utf-8")+b"86cd8c65b43a8d87").hexdigest() == "5ee49281566e42044a4bb5ed9b9de82ab651f9a8", "length of fruit_plot.encoding.x.field is not correct"
assert sha1(str(fruit_plot.encoding.x.field.lower()).encode("utf-8")+b"86cd8c65b43a8d87").hexdigest() == "3e125a24958496df3f0cb69db1cef0cb32bdb92d", "value of fruit_plot.encoding.x.field is not correct"
assert sha1(str(fruit_plot.encoding.x.field).encode("utf-8")+b"86cd8c65b43a8d87").hexdigest() == "3e125a24958496df3f0cb69db1cef0cb32bdb92d", "correct string value of fruit_plot.encoding.x.field but incorrect case of letters"

assert sha1(str(type(fruit_plot.encoding.y.field)).encode("utf-8")+b"861ef9ffc9c6a3d9").hexdigest() == "5453a9e98cfd2b8d809d4f7fa05b361a64216326", "type of fruit_plot.encoding.y.field is not str. fruit_plot.encoding.y.field should be an str"
assert sha1(str(len(fruit_plot.encoding.y.field)).encode("utf-8")+b"861ef9ffc9c6a3d9").hexdigest() == "c52a758adf02ad2cfa911c62a89bc461cd3dd821", "length of fruit_plot.encoding.y.field is not correct"
assert sha1(str(fruit_plot.encoding.y.field.lower()).encode("utf-8")+b"861ef9ffc9c6a3d9").hexdigest() == "fd13abb1ed04d5e668a89bf6c363c3e96ad53947", "value of fruit_plot.encoding.y.field is not correct"
assert sha1(str(fruit_plot.encoding.y.field).encode("utf-8")+b"861ef9ffc9c6a3d9").hexdigest() == "fd13abb1ed04d5e668a89bf6c363c3e96ad53947", "correct string value of fruit_plot.encoding.y.field but incorrect case of letters"

assert sha1(str(type(fruit_plot.encoding.color.field)).encode("utf-8")+b"24943191f1f25216").hexdigest() == "ed9f0515a45ce77369487b965d183413c99b79db", "type of fruit_plot.encoding.color.field is not str. fruit_plot.encoding.color.field should be an str"
assert sha1(str(len(fruit_plot.encoding.color.field)).encode("utf-8")+b"24943191f1f25216").hexdigest() == "08fb146f2ec9b3bead614a139e17beaf17f4e343", "length of fruit_plot.encoding.color.field is not correct"
assert sha1(str(fruit_plot.encoding.color.field.lower()).encode("utf-8")+b"24943191f1f25216").hexdigest() == "e0d0cdeffe650e447d586692fda09a9b6fa724bf", "value of fruit_plot.encoding.color.field is not correct"
assert sha1(str(fruit_plot.encoding.color.field).encode("utf-8")+b"24943191f1f25216").hexdigest() == "e0d0cdeffe650e447d586692fda09a9b6fa724bf", "correct string value of fruit_plot.encoding.color.field but incorrect case of letters"

assert sha1(str(type(fruit_plot.mark)).encode("utf-8")+b"b81888746bfb5992").hexdigest() == "aaa6945c6698ede85721af609703d6e66d6d0e49", "type of fruit_plot.mark is not str. fruit_plot.mark should be an str"
assert sha1(str(len(fruit_plot.mark)).encode("utf-8")+b"b81888746bfb5992").hexdigest() == "172f658e8481057d1b02e8573adb7d1377836b45", "length of fruit_plot.mark is not correct"
assert sha1(str(fruit_plot.mark.lower()).encode("utf-8")+b"b81888746bfb5992").hexdigest() == "da9639cb0cf66594297e1e21ed37574be61706c9", "value of fruit_plot.mark is not correct"
assert sha1(str(fruit_plot.mark).encode("utf-8")+b"b81888746bfb5992").hexdigest() == "da9639cb0cf66594297e1e21ed37574be61706c9", "correct string value of fruit_plot.mark but incorrect case of letters"

assert sha1(str(type(fruit_plot.encoding.x.title != fruit_plot.encoding.x.field)).encode("utf-8")+b"2d8e1666ef552c14").hexdigest() == "9e2b91128da08666a8a9cb51ebd69eb2d0f68e43", "type of fruit_plot.encoding.x.title != fruit_plot.encoding.x.field is not bool. fruit_plot.encoding.x.title != fruit_plot.encoding.x.field should be a bool"
assert sha1(str(fruit_plot.encoding.x.title != fruit_plot.encoding.x.field).encode("utf-8")+b"2d8e1666ef552c14").hexdigest() == "70de59a807df52753668a95cb9379f00bae57e29", "boolean value of fruit_plot.encoding.x.title != fruit_plot.encoding.x.field is not correct"

assert sha1(str(type(fruit_plot.encoding.y.title != fruit_plot.encoding.y.field)).encode("utf-8")+b"634a7f86e3e0899f").hexdigest() == "30fc44a3c281973254e0352d6d69c37f1e829870", "type of fruit_plot.encoding.y.title != fruit_plot.encoding.y.field is not bool. fruit_plot.encoding.y.title != fruit_plot.encoding.y.field should be a bool"
assert sha1(str(fruit_plot.encoding.y.title != fruit_plot.encoding.y.field).encode("utf-8")+b"634a7f86e3e0899f").hexdigest() == "3223851c1996149e4ba2a0cf6adeaea9a456f449", "boolean value of fruit_plot.encoding.y.title != fruit_plot.encoding.y.field is not correct"

assert sha1(str(type(fruit_plot.encoding.color.title != fruit_plot.encoding.color.field)).encode("utf-8")+b"d06b4a5be3be858e").hexdigest() == "c304b87ad0e66d590fe356fbd68d7ddaea7477ee", "type of fruit_plot.encoding.color.title != fruit_plot.encoding.color.field is not bool. fruit_plot.encoding.color.title != fruit_plot.encoding.color.field should be a bool"
assert sha1(str(fruit_plot.encoding.color.title != fruit_plot.encoding.color.field).encode("utf-8")+b"d06b4a5be3be858e").hexdigest() == "f0ab4bf722550d072301338ecfdceb6c567ea756", "boolean value of fruit_plot.encoding.color.title != fruit_plot.encoding.color.field is not correct"

print('Success!')

**Question 1.8** 
<br> {points: 3}

Suppose we have a new observation in the fruit dataset with scaled mass 0.5 and scaled color score 0.5.

Just by looking at the scatterplot, how would you classify this observation using K-nearest neighbours if you use K = 3? Explain how you arrived at your answer.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.9**
<br> {points: 1}

Now, let's use the `scikit-learn` package to predict `fruit_name` for another new observation. The new observation we are interested in has mass 150g and color score 0.73.

First, create the K-nearest neighbour model specification. Specify we want $K=5$ neighbors and `weights = "distance"`. Name this model specification as `knn_spec`.

Then create a new preprocessor named `fruit_data_preprocessor_2` that centers and scales the predictors, but only uses `mass` and `color_score` as predictors. We can drop all other unused columns. Name the predictor as `X` and the target `y`.

Combine this with your neighbour model from before in a `pipeline`, and fit to the `fruit_data` dataset. 

*Name the fitted model `fruit_fit`.*

In [None]:
# ___ = KNeighborsClassifier(n_neighbors=___, weights="distance")

# ____ = make_column_transformer(
#     (___, [___, ___]),
# )

# X = ____.drop(
#         columns=[___, ___, ___, ___, ___]
#     )
# y = ___[___]

# ___ = ___(___, ___).fit(___, ___)

# your code here
raise NotImplementedError
fruit_fit

In [None]:
from hashlib import sha1
assert sha1(str(type(knn_spec is None)).encode("utf-8")+b"6906e5d4e4fae71b").hexdigest() == "1b756ac503f597d75c9219c631d459f60803a90b", "type of knn_spec is None is not bool. knn_spec is None should be a bool"
assert sha1(str(knn_spec is None).encode("utf-8")+b"6906e5d4e4fae71b").hexdigest() == "2694e6f8b0a25845fd253c3f95649684a70a50cf", "boolean value of knn_spec is None is not correct"

assert sha1(str(type(knn_spec.n_neighbors)).encode("utf-8")+b"e52eb27f8e370e06").hexdigest() == "47096dd58509d35bec9216c65f5f810e896706c0", "type of knn_spec.n_neighbors is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(knn_spec.n_neighbors).encode("utf-8")+b"e52eb27f8e370e06").hexdigest() == "5dd4893f63ef817c4a2e84d33e42b6b1fdb18713", "value of knn_spec.n_neighbors is not correct"

assert sha1(str(type(knn_spec.effective_metric_)).encode("utf-8")+b"47fa7fd74e9800b0").hexdigest() == "9983e87d0dd30d7f11907786c535d70bbcb4b44b", "type of knn_spec.effective_metric_ is not str. knn_spec.effective_metric_ should be an str"
assert sha1(str(len(knn_spec.effective_metric_)).encode("utf-8")+b"47fa7fd74e9800b0").hexdigest() == "f1e43e1efb7ea4d33ddedf1522b8823d9beeaa8f", "length of knn_spec.effective_metric_ is not correct"
assert sha1(str(knn_spec.effective_metric_.lower()).encode("utf-8")+b"47fa7fd74e9800b0").hexdigest() == "bb41532dbb3b1a7b91fc8ed2f57bb1bc5c891314", "value of knn_spec.effective_metric_ is not correct"
assert sha1(str(knn_spec.effective_metric_).encode("utf-8")+b"47fa7fd74e9800b0").hexdigest() == "bb41532dbb3b1a7b91fc8ed2f57bb1bc5c891314", "correct string value of knn_spec.effective_metric_ but incorrect case of letters"

assert sha1(str(type(fruit_data_preprocessor_2 is None)).encode("utf-8")+b"0ff00fb22ff6d714").hexdigest() == "ed0957c01eeb3067ca4afbfdc433e5fc62ce5874", "type of fruit_data_preprocessor_2 is None is not bool. fruit_data_preprocessor_2 is None should be a bool"
assert sha1(str(fruit_data_preprocessor_2 is None).encode("utf-8")+b"0ff00fb22ff6d714").hexdigest() == "376ed6ef00fdef04cb3f18e8fef95475b052c555", "boolean value of fruit_data_preprocessor_2 is None is not correct"

assert sha1(str(type(fruit_data_preprocessor_2.transformers_[0][2])).encode("utf-8")+b"57c62087c7342159").hexdigest() == "170aa9186e089ece150d93198e72f95890e6874e", "type of fruit_data_preprocessor_2.transformers_[0][2] is not list. fruit_data_preprocessor_2.transformers_[0][2] should be a list"
assert sha1(str(len(fruit_data_preprocessor_2.transformers_[0][2])).encode("utf-8")+b"57c62087c7342159").hexdigest() == "920c3bbce04ccaf96fee66897f62181d8571979c", "length of fruit_data_preprocessor_2.transformers_[0][2] is not correct"
assert sha1(str(sorted(map(str, fruit_data_preprocessor_2.transformers_[0][2]))).encode("utf-8")+b"57c62087c7342159").hexdigest() == "f64d0fb532dcae4a28cb10de93cee27f2a528635", "values of fruit_data_preprocessor_2.transformers_[0][2] are not correct"
assert sha1(str(fruit_data_preprocessor_2.transformers_[0][2]).encode("utf-8")+b"57c62087c7342159").hexdigest() == "ce5271ba99a558d65e9db1825d207774f3934ec6", "order of elements of fruit_data_preprocessor_2.transformers_[0][2] is not correct"

assert sha1(str(type(fruit_fit is None)).encode("utf-8")+b"064fafa5ea367927").hexdigest() == "d8c1574430a6920c07c86d632fbc333125bb697c", "type of fruit_fit is None is not bool. fruit_fit is None should be a bool"
assert sha1(str(fruit_fit is None).encode("utf-8")+b"064fafa5ea367927").hexdigest() == "68eefca216cdc7ceed94e29240d2b2bd12c81c2a", "boolean value of fruit_fit is None is not correct"

assert sha1(str(type(type(fruit_fit))).encode("utf-8")+b"19f56929484b6406").hexdigest() == "1cdb5e46553df2c88cc7eaa01e5a8f2c5cedc07e", "type of type(fruit_fit) is not correct"
assert sha1(str(type(fruit_fit)).encode("utf-8")+b"19f56929484b6406").hexdigest() == "8d45a70c4abe742d2ffcdf0ed06861b2e7edc264", "value of type(fruit_fit) is not correct"

assert sha1(str(type(fruit_fit.named_steps.kneighborsclassifier.n_neighbors)).encode("utf-8")+b"38090ebe6f9fecee").hexdigest() == "9bd1013fdb7b38e63468c6d77dc89a9027ba48c4", "type of fruit_fit.named_steps.kneighborsclassifier.n_neighbors is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(fruit_fit.named_steps.kneighborsclassifier.n_neighbors).encode("utf-8")+b"38090ebe6f9fecee").hexdigest() == "f2f95ff756afdc154b6063371de907ea43738cbb", "value of fruit_fit.named_steps.kneighborsclassifier.n_neighbors is not correct"

assert sha1(str(type(fruit_fit.named_steps.kneighborsclassifier.effective_metric_)).encode("utf-8")+b"6b923cf85545fc8d").hexdigest() == "b65b9bb9eb78dd899272bdbc1930a28a89122f75", "type of fruit_fit.named_steps.kneighborsclassifier.effective_metric_ is not str. fruit_fit.named_steps.kneighborsclassifier.effective_metric_ should be an str"
assert sha1(str(len(fruit_fit.named_steps.kneighborsclassifier.effective_metric_)).encode("utf-8")+b"6b923cf85545fc8d").hexdigest() == "ad3501dc2902a7c5a4ec2a581eccf56e28ed9414", "length of fruit_fit.named_steps.kneighborsclassifier.effective_metric_ is not correct"
assert sha1(str(fruit_fit.named_steps.kneighborsclassifier.effective_metric_.lower()).encode("utf-8")+b"6b923cf85545fc8d").hexdigest() == "1d0abc45dfa458f313bebe7bf65bc5618c5e8b69", "value of fruit_fit.named_steps.kneighborsclassifier.effective_metric_ is not correct"
assert sha1(str(fruit_fit.named_steps.kneighborsclassifier.effective_metric_).encode("utf-8")+b"6b923cf85545fc8d").hexdigest() == "1d0abc45dfa458f313bebe7bf65bc5618c5e8b69", "correct string value of fruit_fit.named_steps.kneighborsclassifier.effective_metric_ but incorrect case of letters"

print('Success!')

**Question 1.10**
<br> {points: 1}

Create a new dataframe `mass = 150` and `color_score = 0.73` and call it `new_fruit`. Then, pass `fruit_fit` and `new_fruit` to the `predict` function to predict the class for the new fruit observation. Save your prediction to an object named `fruit_predicted`.

In [None]:
# your code here
raise NotImplementedError
fruit_predicted

In [None]:
from hashlib import sha1
assert sha1(str(type(new_fruit is None)).encode("utf-8")+b"d52920b205bca88b").hexdigest() == "54f11361e9178230e210c20dc7d524f96035043a", "type of new_fruit is None is not bool. new_fruit is None should be a bool"
assert sha1(str(new_fruit is None).encode("utf-8")+b"d52920b205bca88b").hexdigest() == "2de1b41a1833d1b838b4e42d0124f53bac944e4b", "boolean value of new_fruit is None is not correct"

assert sha1(str(type(new_fruit.shape)).encode("utf-8")+b"fd95e062bed958c1").hexdigest() == "847942909c8ce44eb38a0190de2596eac3789233", "type of new_fruit.shape is not tuple. new_fruit.shape should be a tuple"
assert sha1(str(len(new_fruit.shape)).encode("utf-8")+b"fd95e062bed958c1").hexdigest() == "d129074fb63198b89432e514b837acddd361ca5b", "length of new_fruit.shape is not correct"
assert sha1(str(sorted(map(str, new_fruit.shape))).encode("utf-8")+b"fd95e062bed958c1").hexdigest() == "299007d358e7f45d28eb44a4885c72113e4ca75c", "values of new_fruit.shape are not correct"
assert sha1(str(new_fruit.shape).encode("utf-8")+b"fd95e062bed958c1").hexdigest() == "dc0d5e892b8f9b3d26fa8f67b66797809c8a938c", "order of elements of new_fruit.shape is not correct"

assert sha1(str(type(new_fruit.mass.values)).encode("utf-8")+b"aa93cb7d9eac0d97").hexdigest() == "bec843bc47eb2598f2171f8dfd275b397e3f12a7", "type of new_fruit.mass.values is not correct"
assert sha1(str(new_fruit.mass.values).encode("utf-8")+b"aa93cb7d9eac0d97").hexdigest() == "fe999b45cfeb7ca415adf63460817a1ca598e236", "value of new_fruit.mass.values is not correct"

assert sha1(str(type(new_fruit.color_score.values)).encode("utf-8")+b"0b22f9745f640d63").hexdigest() == "4917c7dbaa2b37d049512f8307cc2a3ffca19e4d", "type of new_fruit.color_score.values is not correct"
assert sha1(str(new_fruit.color_score.values).encode("utf-8")+b"0b22f9745f640d63").hexdigest() == "e922ccbd06bf9ee5464c033d932ad1db8c3192c8", "value of new_fruit.color_score.values is not correct"

assert sha1(str(type(fruit_predicted)).encode("utf-8")+b"f01d2a75bd87f2bf").hexdigest() == "156a47d7df98ddc4f1b3dc943d1f5fe84f3f4018", "type of fruit_predicted is not correct"
assert sha1(str(fruit_predicted).encode("utf-8")+b"f01d2a75bd87f2bf").hexdigest() == "aa892a81199b0b18d5eae116ec6b34785fbaa53f", "value of fruit_predicted is not correct"

print('Success!')

**Question 1.11** 
<br> {points: 3}

Revisiting `fruit_plot` and considering the prediction given by K-nearest neighbours above, do you think the classification model did a "good" job predicting? Could you have done/do better? Given what we know this far in the course, what might we want to do to help with tricky prediction cases such as this?

*You can use the code below to visualize the observation whose label we just tried to predict.*

In [None]:
fruit_plot + (
    alt.Chart(pd.DataFrame([[-0.3, -0.4]], columns=["x", "y"]))
    .mark_circle(size=50)
    .encode(x="x", y="y", color=alt.value("black"))
)

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.12**
<br> {points: 1}

Now do K-nearest neighbours classification again with the same data set, same K, and same new observation. However, this time, let's use **all the columns in the dataset as predictors (except for the categorical `fruit_label` and `fruit_subtype` variables).** Therefore, you would need to make a new preprocessor.

We have provided the `new_fruit_all` dataframe below, which encodes the predictors for our new observation. Your job is to use K-nearest neighbours to predict the class of this point. You can reuse the model specification you created earlier. 

Name the new predictor as `X_2` and new target `y_2`.

*Assign your answer (the output of `predict`) to an object called `fruit_all_predicted`.*

In [None]:
# This is the new observation to predict class label for
new_fruit_all = pd.DataFrame(
    [[150, 6, 10, 0.73]],
    columns=[
        "mass",
        "width",
        "height",
        "color_score",
    ],
)

# no hints this time!

# your code here
raise NotImplementedError
fruit_all_predicted

In [None]:
from hashlib import sha1
assert sha1(str(type(fruit_all_predicted)).encode("utf-8")+b"fef15cab4c5a0857").hexdigest() == "193a9545b9560954b1e9504bdb2d9a72c2e0ceb9", "type of fruit_all_predicted is not correct"
assert sha1(str(fruit_all_predicted).encode("utf-8")+b"fef15cab4c5a0857").hexdigest() == "a83d7bc9d26fca60c8432037707ddf33731d1e9e", "value of fruit_all_predicted is not correct"

print('Success!')

**Question 1.13** 
<br> {points: 3}

Did your second classification on the same data set with the same K change the prediction? If so, why do you think this happened?

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

## 2. Wheat Seed Dataset

X-ray images can be used to analyze and sort seeds. In [this data set](https://archive.ics.uci.edu/ml/datasets/seeds), we have 7 measurements from x-ray images from 3 varieties of wheat seeds (Kama, Rosa and Canadian). 

**Question 2.0**
<br> {points: 3}

Let's use `scikit-learn` to perform K-nearest neighbours to classify the wheat variety of seeds. The data set is available here: https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt. **Download the data set directly from this URL using the `pd.read_csv` function with `delimiter='\t'`**, which is helpful when the columns are separated by one or more white spaces.

The seven measurements were taken below for each wheat kernel:
1. area A, 
2. perimeter P, 
3. compactness C = 4*pi*A/P^2, 
4. length of kernel, 
5. width of kernel, 
6. asymmetry coefficient 
7. length of kernel groove. 

The last column in the data set is the variety label. The mapping for the numbers to varieties is listed below:

- 1 == Kama
- 2 == Rosa
- 3 == Canadian

Use `scikit-learn` with this data to perform K-nearest neighbours to classify the wheat variety of a new seed we measure with the given observed measurements (from an x-ray image) listed above. Specify that we want $K = 5$ neighbors to perform the classification. 

*Assign your answer to an object called `seed_predict`.*

Hints: 
- `names` can be used to specify the column names of a data frame.
- There are some nan values in the dataset, please use `dropna` to drop the nan values in the dataset before passing it into the K-nearest neighbours model.

In [None]:
# This is the new observation to predict
new_seed = pd.DataFrame(
    [[12.1, 14.2, 0.9, 4.9, 2.8, 3.0, 5.1]],
    columns=[
        "area",
        "perimeter",
        "compactness",
        "length",
        "width",
        "asymmetry_coefficient",
        "groove_length",
    ],
)

# your code here
raise NotImplementedError
seed_predict

**Question 2.1** Multiple Choice:
<br> {points: 1}

What is classification of the `new_seed` observation?

A. Kama

B. Rosa

C. Canadian

*Assign your answer to an object called `answer2_1`. Make sure your answer is in uppercase and is surrounded by quotation marks (e.g. `"F"`).*


In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_1)).encode("utf-8")+b"5913c72377d3ea12").hexdigest() == "1d1dc91ff6596e64cf8783f0d1e18bbd63b7e1b7", "type of answer2_1 is not str. answer2_1 should be an str"
assert sha1(str(len(answer2_1)).encode("utf-8")+b"5913c72377d3ea12").hexdigest() == "d0a71a616ae08269486f4109c34cc04cc7aaf59d", "length of answer2_1 is not correct"
assert sha1(str(answer2_1.lower()).encode("utf-8")+b"5913c72377d3ea12").hexdigest() == "bfca02df72ac2cd266c8ffcc00b3620069d51751", "value of answer2_1 is not correct"
assert sha1(str(answer2_1).encode("utf-8")+b"5913c72377d3ea12").hexdigest() == "5e784af2aa13cb05fdeb55491250cdcb1bc73690", "correct string value of answer2_1 but incorrect case of letters"

print('Success!')