## Introduction to Machine Learning  

## Assignment 4: Similarity-based Approaches to Supervised Learning

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links of 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).

### Assignment Learning Goals:

By the end of the module, students are expected to:

- Explain the notion of similarity-based algorithms.
- Broadly describe how 𝑘-NNs use distances.
- Describe the effect of using a small/large value of the hyperparameter 𝑘 when using the 𝑘-NN algorithm.
- Explain the problem of curse of dimensionality.
- Explain the general idea of SVMs with RBF kernel.
- Compare and contrast 𝑘-NNs and SVM RBFs.
- Broadly describe the relation of `gamma` and `C` hyperparameters with the fundamental tradeoff.

This assignment covers [Module 4](https://ml-learn.mds.ubc.ca/en/module4) of the online course. You should complete this module before attempting this assignment.

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct. These questions are worth 2 points.

In [None]:
# Import libraries needed for this lab
from hashlib import sha1

import altair as alt
# import graphviz
import numpy as np
import pandas as pd
#from altair_saver import save

from IPython.display import HTML
from sklearn import tree
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.neighbors import KNeighborsClassifier
import test_assignment4 as t
#alt.renderers.enable('png')
alt.data_transformers.disable_max_rows()

## 1. Splitting and Exploring Your Data

For the next few questions, we are going to concentrate on wine data obtained from [Kaggle](https://www.kaggle.com/numberswithkartik/red-white-wine-dataset) that examines the different measurements of wine and we will be attempting to predict if each example is of the red or white variety.  

The features in this dataset include: 

- `fixed_acidity`     
- `volatile_acidity`    
- `citric_acid`    
- `residual_sugar`    
- `chlorides`   
- `free_sulfur_dioxide`   
- `total_sulfur_dioxide`    
- `density`    
- `pH`     
- `sulphates`    
- `alcohol`      
- `quality`: (score between 0 and 10)       
- `style`       
     

In [None]:
wine_df = pd.read_csv('data/wine.csv')
wine_df.head()

In [None]:
wine_df.shape

**Question 1.1** <br> {points: 1}  

How many null values do we have in our dataset? Save your answer in an object named "null_vals" and if necessary, use `dropna()` and save over the object named `wine_df`. 

In [None]:
null_vals = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_1(null_vals)

**Question 1.2** <br> {points: 1}  

Split the data into 80% train and 20% test sets. Name your training split `train_df` and your test split `test_df`. We want to make sure that everyone has the same split so please specify an input argument of `random_state=2020`.

In [None]:
train_df, test_df = None, None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_2(train_df,test_df)

**Question 1.3** <br> {points: 2}  

How many dimensions does this dataset have? Save your answer in an object named `wine_dim`.

In [None]:
wine_dim = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
# check that the variable exists
assert 'wine_dim' in globals(
), "Please make sure that your solution is named 'wine_dim'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.


**Question 1.4** <br> {points: 1}  

Using the `wine_train` data, look at the summary statistics produced by `.describe()` and save the results in an object named `wine_described`.

In [None]:
wine_described = None 
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_4(wine_described)

**Question 1.5** <br> {points: 1}  

What is the average pH of the wine for each variant in `train_df`? Save the average red variant pH in an object named `avg_red_ph` and the average white variant as `avg_white_ph`

In [None]:
avg_red_ph = None
avg_white_ph = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_5(avg_red_ph,avg_white_ph)

**Question 1.6** <br> {points: 1}  

What is the average alcohol of the wine styles in `train_df`? Save the average alcohol content of the red variant in an object named `avg_red_alc` and the average alcohol content in the white variant as `avg_white_alc`.

In [None]:
avg_red_alc = None
avg_white_alc = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_1_6(avg_red_alc,avg_white_alc)

**Question 1.7** <br> {points: 1}  

Plot a bar chart showing the quantity of each wine style in the `train_df` dataframe. Make sure to give it a title and name your plot style_prop. 

In [None]:
style_prop = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_1_7(style_prop)

## 2. Finding the Nearest Neighbours

Let's explore the training set a bit more and calculate the distance between examples.

**Question 2.1** <br> {points: 1}  

Split up `train_df` and `test_df` into feature and target object and name them respectively `X_train`, `y_train`, `X_test` and `y_test`. Remember that our target value is `style`.

In [None]:
X_train = None
y_train = None
X_test = None
y_test = None


# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_1(X_train,X_test,y_train,y_test)

**Question 2.2** <br> {points: 1}  

What are the distances between all the wines in the training set? Save this in an object named `wine_similarities`.

*Hint: Make sure you are importing the necessary library.*

In [None]:
wine_similarities = None
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_2(wine_similarities)

**Question 2.3** <br> {points: 1}  

Which wine index is most similar to that at index 12? Save the index in an object named `sim_wine12`.

*Hint: You'll need to make sure you use `fill_diagonal()`.*

In [None]:
sim_wine12 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_3(sim_wine12)

**Question 2.4** <br> {points: 2}  

What is the distance between the wine at `sim_wine12` and index 12? 

Save this in an object named `distance_12`.

In [None]:
distance_12 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
# check that the variable exists
assert 'distance_12' in globals(
), "Please make sure that your solution is named 'distance_12'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 2.5** <br> {points: 1}  

A new wine was just released into liquor stores with the following feature vector.

```
[ 8.3   ,  0.325 ,  0.36  ,  13.3    ,  0.101 , 23.    , 49.    ,
        0.9966,  3.56  ,  0.42  , 8.2    ,  7.    ]
```

Which wine from the training dataset should wine merchants mention that it is most similar to? 


Save this in an object named `similar_new_wine`.

In [None]:
new_wine = [[8.3, 0.325, 0.36, 13.3, 0.101, 23.0, 49.0, 0.9966, 3.56, 0.42, 8.2, 7.0]]
similar_new_wine = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_5(similar_new_wine)

**Question 2.6** <br> {points: 1}  

How far away is the new wine from the most similar wine in the training dataset? 

Save this distance in an object named `new_distance`.

In [None]:
new_distance = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_6(new_distance)

## 3. KNN Classifiers with different hyperparameters 

**Question 3.1** <br> {points: 1}  

Build a `DummyClassifier` using `strategy = 'most_frequent'` and name it `dummy_model`.

Train it on `X_train` and `y_train`. Score it on the train **and** test sets.

Save the scores in objects named `dummy_train` and `dummy_test`.


In [None]:
dummy_model = None
dummy_train = None
dummy_test = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_3_1(dummy_train,dummy_test,dummy_model)

**Question 3.2** <br> {points: 1} 

Build a `KNeighborsClassifier` named `knn1` with  `n_neighbors=1`. Cross-validate using `cv=10`. 
What is the mean training score and the mean validation score? Save each respectively in objects named `knn1_train_score` and `knn1_valid_score`.

In [None]:
knn1 = None
knn1_train_score = None 
knn1_valid_score = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

print(knn1_train_score)
print(knn1_valid_score)

In [None]:
t.test_3_2(knn1_train_score,knn1_valid_score,knn1)

**Question 3.3** <br> {points: 2} 

Which model has the best training accuracy? 

A) `DummyClassifier`. 

B) `KNeighborsClassifier(n_neighbors=1)`. 

C) Both A and B

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer3_3`.*


In [None]:
answer3_3 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_3

In [None]:
# check that the variable exists
assert 'answer3_3' in globals(
), "Please make sure that your solution is named 'answer3_3'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.


**Question 3.4** <br> {points: 1} 

Which model has the best cross-validation accuracy?

A) `DummyClassifier`. 

B) `KNeighborsClassifier(n_neighbors=1)`. 

C) Both A and B

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer3_4`.*


In [None]:
answer3_4 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_4

In [None]:
t.test_3_4(answer3_4)

**Question 3.5** <br> {points: 1}

Which model is probably overfitting?

A) `DummyClassifier`

B) `KNeighborsClassifier(n_neighbors=1)`

C) Both A and B

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer3_5`.*


In [None]:
answer3_5 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_5

In [None]:
t.test_3_5(answer3_5)

**Question 3.6** <br> {points: 1} 

***True or False*** 

For smaller values of $k$ you are expected to get higher training scores. 


*Answer in the cell below by assigning `True` or `False` to an object called `answer3_6`.*

In [None]:
answer3_6 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_6

In [None]:
t.test_3_6(answer3_6)

**Question 3.7** <br> {points: 1} 

If we increase the number of features, which of the following is true?

A) The training and validation scores will always increase.

B) The training and validation scores will always decrease.

C) The training and validation scores will decrease when we start adding irrelevant features.

D) The model only picks features that it deems important so nothing changes. 

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer3_7`.*

In [None]:
answer3_7 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_7

In [None]:
t.test_3_7(answer3_7)

**Question 3.8** <br> {points: 3} 

Now let's do some hyperparameter tuning and find the most optimal value for  `n_neighbors` in our `KNeighborsClassifier`. 

We want to find the best hyperparameter value to predict on our test set so let's build a loop as we have in the previous assignments where we record the training and cross-validation scores for each hyperparameter value.

Create a `for` loop that iterates over `n_neighbors` values from every second number from 2 to 20 (inclusive). We've started this for you.

Each iteration should:
1. Create a `KKNeighborsClassifier` object with `n_neighbors` changing at each iteration.
2. Run 5-fold cross-validation with this value of `n_neighbors` using `cross_validate` to get the mean train and validation accuracies. Make sure to set `return_train_score=True` to get the training score in each fold. 
3. Appends the `n_neighbors` value to the list in the key `n_neighbors` of the dictionary named `results_dict`.
4. Appends the mean `train_score` of the cross-validation folds to the list in the `mean_train_score` dictionary key. 
5. Appends the mean `test_score` of the cross-validation folds to the list in the `mean_cv_score` dictionary key. 

(Note that this may take a few minutes to execute)

In [None]:
results_dict = {
    "n_neighbors": [],
    "mean_train_score": [],
    "mean_cv_score": []}

results_dict

for k in range(2,20, 2):
    
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

results_dict

In [None]:
t.test_3_8(results_dict)

**Question 3.9** <br> {points: 1} 

Convert the dictionary `results_dict` into a dataframe and use `pd.melt()` to melt the columns `mean_train_score` and `mean_cv_score` in the `results_df`.  Use `var_name='score_type'` and `value_name='accuracy'` and name the new dataframe `knn_plot_df`. 

In [None]:
knn_plot_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_9(knn_plot_df)

**Question 3.10** <br> {points: 1} 

Using Altair, make a `mark_line()` plot which displays the `n_neighbors` of the KNN model on the *x*-axis and the accuracy on the train and validation sets on the *y*-axis and don't forget to add `alt.Color(score_type)` to the `encode()` function after you specify `alt.X()` and `alt.y()`. 

Make sure it has the dimensions `width=500, height=300`. Don't forget to give it a title and the plot `knn_plot`.
To make things more legible, use `scale=alt.Scale(domain=[.92, 0.98])` in your `alt.Y()` function which will start the y-axis at 0.92 and end it at 0.98. 

In [None]:
knn_plot = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_10(knn_plot)

**Question 3.11** <br> {points: 1} 

From your results, what `n_neighbors` would you pick in your final model? Save your answer in an object named `best_k`.

*Hint: [<code>.idxmax()</code>](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmax.html) may come in handy.*

In [None]:
best_k = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
best_k

In [None]:
t.test_3_11(best_k)

**Question 3.12** <br> {points: 1} 

Build a K-Nearest Neighbour classifier named `best_model` with the best `n_neighbors` and fit it with `X_train` and `y_train`. Score your model on the test set and save your results in an object named `test_score`.

Is this doing better than your dummy classifier?

In [None]:
best_model = None
test_score = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_12(test_score)

# 4. Support Vector Machines Classifier 

Up until this point, we have been working with the K-Nearest Neighbour Classifier. Let's shake things up a bit and explore the second model that we learned in this module. Unlike other questions up to this point, we've only explored one hyperparameter at a time. This time let's see how well we can optimize more than one hyperparameter simultaneously.

**Question 4.1** <br> {points: 0} 

Import SVC from the appropriate library. 

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_4_1()

**Question 4.2** <br> {points: 1} 

Repeat the loop you made in **Question 3.8** but this time but this time, tune the hyperparameter `gamma` and find the most optimal value for  `gamma` in an `SVC` model.  

We want to find the best gamma value to test our model.
Don't forget to set `random_state=2020` so we can confirm your answer. 

Create a `for` loop that iterates over `gamma` values from 0.1 to 100 (inclusive) that increases exponentially with base 10. (We've started this for you.)

To recap the instructions from before, each iteration should:
1. Create a `SVC` object with `gamma` changing at each iteration.
2. Run 5-fold cross-validation with this value of `gamma` using `cross_validate` to get the mean train and validation accuracies. Make sure to set `return_train_score=True` to get the training score in each fold. 
3. Appends the `gamma` value to the list in the key `gamma`.
4. Appends the mean `train_score` of the cross-validation folds to the list in the `mean_train_score` dictionary key. 
5. Appends the mean `test_score` of the cross-validation folds to the list in the `mean_cv_score` dictionary key. 

(Note that this may take quite a few minutes to execute)

In [None]:
results_dict = {
    "gamma": [],
    "mean_train_score": [],
    "mean_cv_score": []}

results_dict

for g in [0.1, 1.0, 10.0, 100.0]:
    
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

results_dict

In [None]:
t.test_4_2(results_dict)

**Question 4.3** <br> {points: 1} 

Which value of gamma would you select for your model. Save your result in an object named `best_gamma`.


In [None]:
best_gamma = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
best_gamma

In [None]:
t.test_4_3(best_gamma)

**Question 4.4** <br> {points: 1} 

Now repeat **Question 4.2**, this time iterating over the hyperparameter C. 

In [None]:
results_dict = {
    "C": [],
    "mean_train_score": [],
    "mean_cv_score": []}

results_dict

for c in [0.1, 1.0, 10.0, 100.0]:
    
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

results_dict

In [None]:
t.test_4_4(results_dict)

**Question 4.5** <br> {points: 1} 

Which value of `C` would you select for your model now? Save your result in an object named `best_c`.

In [None]:
best_c = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
best_c

In [None]:
t.test_4_5(best_c)

**Question 4.6** <br> {points: 2} 

Do you think choosing the value of `Gamma` from question 4.3 and the value of `C` from question 4.5 will produce the best scoring model to run our test set on?  

A) No. we should have set `gamma` to `best_gamma` and iterated over the value of `C`. This would have produced the most optimal model. 

B) No, we should be searching all values of gamma with all values of C since it's possible that hyperparameter values may score lower independently but could score higher when cross-validated together. 

C) Yes. Both `Gamma` and `C` produced the highest cross-validation scores independently and therefore they must produce the highest cross-validation scores together.   

D) Yes. `Gamma` and `C` are not correlated and therefore finding the best values separately will not change how the scores would be if we tested them together. 


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer4_6`.*


In [None]:
answer4_6 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer4_6

In [None]:
# check that the variable exists
assert 'answer4_6' in globals(
), "Please make sure that your solution is named 'answer4_6'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 4.7** <br> {points: 2} 

Write a nested loops to search over gamma and C simultaneously. Use 5 fold cross-validation and append the training and validation (`test_score`) scores to the `param_scores` dictionary. 

*Note: This could take quite a few minutes to run.*

In [None]:
hyperparameters = {
    "gamma": [0.1, 1.0, 10.0, 100.0],
    "C": [0.1, 1.0, 10.0, 100.0]
}
param_scores = {"gamma": [], "C": [], "train_accuracy": [], "valid_accuracy": []}

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_4_7(param_scores)

**Question 4.8** <br> {points: 1} 

Save `param_scores` as a dataframe named `param_scores_df` and sort by validation score in descending order. 


In [None]:
param_scores_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_4_8(param_scores_df)

**Question 4.9** <br> {points: 1} 

Build your new model name `best_svc` using the values for `gamma` and `C` we obtained when tunning then hyperparameters simultaneously in **Question 4.8**. Set the random_state to 2020 and fit it with `X_train` and `y_train`. Score your model on the test set and save your results in an object named `svc_test_score`. 

In [None]:
svc_test_score = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_4_9(svc_test_score)

**Question 4.10** <br> {points: 1} 

Which model performs better?

A) `KNeighborsClassifier`

B) `SVC`

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer4_10`.*


In [None]:
answer4_10 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer4_10

In [None]:
t.test_4_10(answer4_10)

## Before Submitting 

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel and clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  

## Attributions
- Wine dataset - [Kaggle](https://www.kaggle.com/numberswithkartik/red-white-wine-dataset)


- MDS DSCI 571 - Supervised Learning I - [MDS's GitHub website](https://github.com/UBC-MDS/DSCI_571_sup-learn-1) 
