## Introduction to Machine Learning  

## Assignment 5: Preprocessing Numerical Features, Pipelines and Hyperparameter Optimization

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links of 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).

### Assignment Learning Goals:

By the end of the module, students are expected to:

- Identify when to implement feature transformations such as imputation and scaling.
- Apply `sklearn.pipeline.Pipeline` to build a machine learning pipeline.
- Use `sklearn` for applying numerical feature transformations on the data.
- Discuss the golden rule in the context of feature transformations.
- Carry out hyperparameter optimization using `sklearn`'s `GridSearchCV` and `RandomizedSearchCV`.
- Explain overfitting on the validation set.


This assignment covers [Module 5](https://ml-learn.mds.ubc.ca/en/module5) of the online course. You should complete this module before attempting this assignment.

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct. These questions are worth 2 points.

In [None]:
# Import libraries needed for this lab
from hashlib import sha1

import altair as alt
import graphviz
import numpy as np
import pandas as pd

from sklearn import tree
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import (
    FunctionTransformer,
    Normalizer,
    OneHotEncoder,
    StandardScaler,
    normalize,
    scale)
from sklearn.svm import SVC

import test_assignment5 as t
#alt.renderers.enable('mimetype')
alt.data_transformers.disable_max_rows()

## 1. Introducing and Exploring the dataset <a name="1"></a>
<hr>


In this lab you will be working on a sample of [the adult census dataset](https://www.kaggle.com/uciml/adult-census-income#). We have made some modification to this data so that it's easier to work with. 

This is a classification dataset and the classification task is to predict whether income exceeds 50K per year or not based on the census data. You can find more information on the dataset and features [here](http://archive.ics.uci.edu/ml/datasets/Adult).


*Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary.*

In [None]:
census_df = pd.read_csv("data/income.csv")
census_df.head()

For this assignment, we will be looking at the numeric columns only and then we will look into the both the categorical and numeric columns in the following assignment after we've learned how to preprocess them in module 6. 

In [None]:
census_df = census_df.drop(columns=['education', 'occupation', 'relationship', 'race', 'native.country'])
census_df.head()

**Question 1.1** <br> {points: 1}  

In order to avoid violation of the golden rule, the first step before we do anything is splitting the data. 

Split the data into `train_df` (80%) and `test_df` (20%). Keep the target column (`income`) in the splits so that we can use it in EDA. Make sure to set `random_state=123` for grading purposes. 


In [None]:
train_df, test_df = None, None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_1(train_df,test_df)

**Question 1.2** <br> {points: 1}  

Let's examine our train_df column dtypes. 

Do you notice anything odd? Which column needs further investigation? 

*Answer in the cell below by place the column label in `""` and assign it to an object called `answer1_2`.*


In [None]:
answer1_2 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_2(answer1_2)

**Question 1.3** <br> {points: 1}  

Take a look at the unique possible values in this column using [`.unique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html) or `.value_counts()`. 

Which value is not numerical? Save this value as a string by placing it between `""` and assigning it to an object called `answer1_3`.

In [None]:
answer1_3 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_3(answer1_3)

**Question 1.4** <br> {points: 1}  

Looking at the previous question, these values were likely questions not answered by some people during the census.

Usually `.describe()` or `.info()` methods would give you information on missing values. But here, they won't pick up this value as missing values as they are encoded as strings instead of an actual NaN in Python.

Let's replace them with `np.NaN` before we carry out EDA. Name your new `train_df` `test_df` dataframes with the replaced values `train_df_nan` and `test_df_nan` respectively. If you do not do it, you'll encounter an error later on when you try to pass this data to a classifier. 


In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_4(train_df_nan,test_df_nan)

**Question 1.5** <br> {points: 1}  

Now, that we've replaced the string values with a numerical value (`NaN` is a float), transform this column to dtype `float64`. Save it as the same column name in the original `train_df_nan` and `test_df_nan` dataframes.

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_5(train_df_nan,test_df_nan)

**Question 1.6** <br> {points: 1}  

Use `.describe()` to show summary statistics of each feature in the `train_df_nan` dataframe. Save this in an object named `train_stats`. 

In [None]:
train_stats = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_1_6(train_stats)

**Question 1.7** <br> {points: 2}  

What was highest capital loss someone reported? Save this in an object named `cap_loss_high`. 

What is the average number of years people reported of spending time on their education? Save this in an object named `edu_avg_yrs`. 

In [None]:
cap_loss_high = None
edu_avg_yrs = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
# check that the variable exists
assert 'cap_loss_high' in globals(
), "Please make sure that your solution is named 'cap_loss_high'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

In [None]:
t.test_1_7_2(edu_avg_yrs)

**Question 1.8** <br> {points: 1} 

We have provided you with some visualization of the features values plotted as histograms. 

In [None]:
def plot_histogram(df,feature):
    """
    plots a histogram of a decision trees feature

    Parameters
    ----------
    feature: str
        the feature name
    Returns
    -------
    altair.vegalite.v3.api.Chart
        an Altair histogram 
    """
    histogram = alt.Chart(df).mark_bar(
        opacity=0.7).encode(
        alt.X(str(feature) + str(':O'), bin=alt.Bin(maxbins=50)),
        alt.Y('count()', stack=None),
        alt.Color('income:N')).properties(
        title= str.title(feature))
    return histogram

feature_list = train_stats.columns[:3]
figure_dict = dict()
for feature in feature_list:
    train_df_nan = train_df_nan.sort_values('income')
    figure_dict.update({feature:plot_histogram(train_df_nan,feature)})
figure_panel = alt.vconcat(*figure_dict.values())
figure_panel

Given these features, which seems the most relevant for the given prediction task? 

*Assign `answer1_8` the the column label name as an string. in a list and save it as an object named `answer1_8`*.
*For example if you believe the answer is `age`, your answer would like this:*  

`answer1_8 = 'age'`

In [None]:
answer1_8 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_1_8(answer1_8)

**Question 1.9** <br> {points: 1} 

Let's now separate feature vectors from the targets.  Create `X_train`, `y_train`, `X_test`, `y_test` from `train_df_nan` and `test_df_nan`.


In [None]:
X_train = None 
y_train = None 
X_test = None 
y_test = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_9(X_train,X_test,y_train,y_test)

**Question 1.10** <br> {points: 1} 

At this point, if you train [`sklearn`'s `SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) model on `X_train` and `y_train` would it work? 

A) Yes, it would train but we make not get meaningful results without scaling. 

B) Yes, it would train and it may give results that are descent enough. 

C) No, it can't train since we have not scaled yet. 

D) No, it can't train since we have not imputed yet.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer1_10`.*

In [None]:
answer1_10 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_10

In [None]:
t.test_1_10(answer1_10)

## 2. Preprocessing - Imputation and Scaling without Pipelines


**Question 2.1** <br> {points: 1}  

Before preprocessing our data, build a dummy classifier using `strategy="prior"`. Carry out 5-fold cross validation on `X_train` and `y_train` using ` cross_validate()`. Don't forget to include the training_score. 

Save the results in a dataframe named `dummy_scores`. 

In [None]:
dummy_scores = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_1(dummy_scores)

**Question 2.2** <br> {points: 1}  

Now impute missing values **without** using `sklearn.pipeline.Pipeline`.

In this exercise you'll be imputing missing values **without using `scikit-learn` pipelines**. 

The goal here is two-fold. First, to understand what happens under the hood when you use `scikit-learn` `Pipelines`, and second, to convince yourself why it's a good idea to use pipelines.  

For numeric features, use [`scikit-learn`'s `SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) to impute `NaN` values with `strategy="median"`. Remember to apply the transformations on both the train and test splits.  

Save your `SimpleImputer()` in an object named `imp`. Next transform `X_train` and `X_test` using `imp` and save the results in objects named `X_imp_train` and `X_imp_test` respectively.


In [None]:
imp = None
X_imp_train = None
X_imp_test = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_2(imp,X_imp_train,X_imp_test)

**Question 2.3** <br> {points: 2}  

When using the `SimpleImputer` transformer on the numeric columns, is there any problem with calling `fit_transform` on the test split? Why or why not? 

A) It is not problematic. 

B) It is problematic because it will imputing missing values with wrong values.

C) It is problematic because we should never call fit on test data.

D) It is problematic because it will throw an error in the code.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer2_3`.*


In [None]:
answer2_3 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer2_3

In [None]:
# check that the variable exists
assert 'answer2_3' in globals(
), "Please make sure that your solution is named 'answer2_3'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 2.4** <br> {points: 1}  

Carry out cross validation using `cross_validate` on the preprocessed `X_imp_train` and `y_train` using the [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier) with default hyperparameters.
Save your results in a dataframe named `scores`. 

In [None]:
scores = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_4(scores)

**Question 2.5** <br> {points: 1}  

Are we violating the golden rule when we call `cross_validate` in 2.2? Why or why not? 

A) Yes, we are violating the golden rule since our test data is influencing our validation data. 

B) Yes, we are violating the Golden Rule because `cross_validate` is splitting the data after we already transformed it.

C) No we are not violating the golden rule since our test data and training data are not influencing one another. 

D) No we are not violating the golden rule since our validation data is not being influenced by the training phase.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer2_5`.*


In [None]:
answer2_5 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer2_5

In [None]:
t.test_2_5(answer2_5)

# 3. Preprocessing - Imputation and Scaling with Pipelines

**Question 3.1** <br> {points: 1}  

In this question we are going to build a pipeline for multiple classifiers and append the results to a dictionary named `results_dict`.

In this question we've written the most of the code for you. You will need to fill in the blank so that the code executes and produces a dataframe containing the statistics for each model. 

Make sure that the pipeline includes the transformers `SimpleImputer()` and `StandardScaler()`. 


In [None]:
results_dict = {'Dummy': {'mean_train_accuracy': round(dummy_scores["train_score"].mean(),4),
                          'mean_validation_accuracy': round(dummy_scores["test_score"].mean(),4),
                          'mean_fit_time (s)': round(dummy_scores["fit_time"].mean(),4),
                          'mean_score_time (s)': round(dummy_scores["score_time"].mean(),4)}}


models = {
    "Decision tree": DecisionTreeClassifier(),
    "kNN": KNeighborsClassifier(),
    "RBF SVM": SVC(),
}

for model_name, model in models.items():
    print(model_name, ":")
    
    pipe = Pipeline(steps=[("imputer", ...,
                           ("scaler", ..., 
                           ("classifier", ...)])
    
    scores = ...(..., ..., ..., cv=5, return_train_score=True)
    
    results_dict[...] ={'mean_train_accuracy': scores["train_score"].mean().round(4),
                        'mean_validation_accuracy': scores["test_score"].mean().round(4),
                        'mean_fit_time (s)': scores["fit_time"].mean().round(4),
                        'mean_score_time (s)': scores["score_time"].mean().round(4)
                              }
results_df = ...(results_dict).T
results_df

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_3_1(results_df)

**Question 3.2** <br> {points: 1}  

Which model produced the best score without hyperparameter tuning? 
Save you answer in an object named `highest_score`. 


In [None]:
highest_score = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
highest_score

In [None]:
t.test_3_2(highest_score)

**Question 3.3** <br> {points: 1}  

Which model appears to overfit the most? 
Save you answer in an object named `most_overfit`. 


In [None]:
most_overfit = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
most_overfit

In [None]:
t.test_3_3(most_overfit)

**Question 3.4** <br> {points: 2}  

Which model takes the most time to fit? 
Save you answer in an object named `longest_fit`. 


In [None]:
longest_fit = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
longest_fit

In [None]:
# check that the variable exists
assert 'longest_fit' in globals(
), "Please make sure that your solution is named 'longest_fit'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

# 4. Hyperparameter Optimization

Now that we have preprocessed features, and explored different models we are ready find optimal hyperparameters. 

**Question 4.1** <br> {points: 0}  
Import `GridSearchCV` and `RandomizedSearchCV` from the appropriate library. 

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_4_1()

**Question 4.2** <br> {points: 1}  

In this question you will tune the `n_neighbors` hyperparameter from the K-NN model. 

1. Create a pipeline with the steps `SimpleImputer(strategy="median")`, `StandardScaler()` and  `KNeighborsClassifier()`in an object named `knn_pipe`.
1. Sweep over the hyperparameters in `param_grid` in [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)(`param_grid` is given in the code below) and use 5-fold cross-validation. Similar to `cross_validate` you can pass `return_train_score=True` to your `GridSearchCV` object. Save this in an object named `k_search`. 
1. Fit `k_search` on `X_train` and `y_train`.

*Hint: Setting `n_jobs=-1` should speed it up. This will take about 2 minutes to run.*


In [None]:
param_grid = {"knn__n_neighbors": np.arange(1, 50, 10)}
knn_pipe = None
k_search = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_4_2(k_search)

**Question 4.3** <br> {points: 4}  

What is the best hyperparameter value for knn? Save it in an object named `best_k`. 

What was the corresponding validation score for it? Save this in an object named `best_k_score`. 

Does this model do better than without hyperparameter tuning `n_neighbors`?   <br> 
Answer as `True` or `False` in an object named `better_model`.


<br> 

*Hint: `.best_params_`  and `.best_score_` are helpful here.* 

In [None]:
best_k = None 

best_k_score = None 

better_model = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_4_3_1(best_k)

In [None]:
t.test_4_3_2(best_k_score)

In [None]:
# check that the variable exists
assert 'better_model' in globals(
), "Please make sure that your solution is named 'better_model'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 4.4** <br> {points: 1} 

Ok, let's step it up a notch and tune 2 hyperparameters at once and this time using [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html?highlight=randomizedsearchcv). 

This time let's find the optimal `gamma` and `C` values for a SVC model. 

1. Create a pipeline saving with the steps `SimpleImputer(strategy="median")`, `StandardScaler()` and  `SVC()`in an object named `svc_pipe`.
1. Sweep over the hyperparameters in `param_grid` in [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html?highlight=randomizedsearchcv) (again, `param_grid` is given in the code below) and use 5-fold cross-validation and specify `n_iter` to 5. Similar to `cross_validate` you can pass `return_train_score=True`  in `RandomizedSearchCV()`. Make sure to set `random_state=77` in `RandomizedSearchCV` or you will not pass the autograder. Save this in an object named `svc_search`. 
1. Fit `svc_search` on `X_train` and `y_train`.

*Hint: Setting `n_jobs=-1` should speed it up but it may still take around 5 minutes to run. You may want to set `verbose=2` here.*

In [None]:
param_grid = {
    "svc__C": [0.01, 0.1, 1, 10, 100],
    "svc__gamma": [0.01, 0.1, 1, 10, 100]}

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_4_4(svc_search)

**Question 4.5** <br> {points: 2}  

What is the best hyperparameter value for the svc model? Save it in an object named `best_svc`. 

What was the corresponding validation score for it? Save this in an object named `best_svc_score`. 

*Hint: `.best_params_`  and `.best_score_` are helpful here.* 

In [None]:
best_svc = None 

best_svc_score = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_4_5_1(best_svc)

In [None]:
t.test_4_5_2(svc_search, best_svc_score)

**Question 4.6** <br> {points: 1}  

***True or False***


The `SVC` model without default hyperparameters scores higher.


*Answer in the cell below by assigning `True` or `False` to an object called `answer4_6`.*


In [None]:
answer4_6 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_4_6(answer4_6)

# 5. Evaluating on the test set <a name="5"></a>
<hr>

Now that we have a best performing model, it's time to assess our model on the set aside test set. In this exercise you'll examine whether the results you obtained using cross-validation on the train set are consistent with the results on the test set. 


**Question 5.1** <br> {points: 1} 

What is the training score of the best scoring model? Save the result in an object named `train_score`. 

In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5_1(train_score)

**Question 5.2** <br> {points: 1} 


What is the test score of the best model? 

Score best model on `X_test` and `y_test`. 

Save the result in an object named `test_score`. 


In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_5_2(test_score)

## Before Submitting 

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel and clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  

## Attributions
- The adult census dataset - [Kaggle](https://www.kaggle.com/uciml/adult-census-income#)


- MDS DSCI 571 - Supervised Learning I - [MDS's GitHub website](https://github.com/UBC-MDS/DSCI_571_sup-learn-1) 
