# P6 


The project explores using classification models for various tasks. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_iris
from sklearn.datasets import make_classification


from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn import model_selection

from sklearn.feature_selection import SelectPercentile

from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn import svm

from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline

from sklearn import metrics

import numpy as np
np.random.seed(5550)

from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE

import time
st_time = time.time()

In [None]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

In [None]:
if IN_COLAB == True: 
    print("Installing otter:")
    !pip install otter-grader==4.2.0 

Installing otter:
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting otter-grader==4.2.0
  Downloading otter_grader-4.2.0-py3-none-any.whl (204 kB)
[K     |████████████████████████████████| 204 kB 26.8 MB/s 
[?25hCollecting fica>=0.2.0
  Downloading fica-0.2.2-py3-none-any.whl (11 kB)
Collecting python-on-whales
  Downloading python_on_whales-0.55.0-py3-none-any.whl (100 kB)
[K     |████████████████████████████████| 100 kB 6.7 MB/s 
Collecting jupytext
  Downloading jupytext-1.14.2-py3-none-any.whl (297 kB)
[K     |████████████████████████████████| 297 kB 75.9 MB/s 
Collecting jedi>=0.10
  Downloading jedi-0.18.2-py2.py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 51.0 MB/s 
Collecting mdit-py-plugins
  Downloading mdit_py_plugins-0.3.3-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 6.5 MB/s 
[?25hCollecting markdown-it-py<3.0.0,>=1.0.0
  Downloading markdown_it_py-2.1

In [None]:
import otter
grader = otter.Notebook()

# Model Evaluation (Review)

*Materials copied or adpated from Applied Machine Learning in Python by Mueller*

Let's review the different model evalution approaches starting from the simplistic and understand their limitations. 




## 1 - Hold-out set 

Let's start with a simplisted approach to model evaluation, split the data into training and testing data.  

<img src="https://pages.mtu.edu/~lebrown/un5550-f21/p6/train_test_split_new.png" width="50%">


This is a common approach, but has multiple limitations.  

*How to solve this problem?* Use an additional hold-out set. 

## 2 - Three-fold split or Train/Validation/Test set 

Use of three separate sets:    
* the training set for model building 
* the validation set for model selection 
* the test set for final model evaluation 
is probably the most common used method for model selection and evaluation. It is a **best practice** to follow (along with other techniques described below). 

<img src="https://pages.mtu.edu/~lebrown/un5550-f21/p6/train_test_validation_split.png" width="50%">

With this new approach, we use the validation set to select the optimum hyper-paramter and the test set to estimate the performance (accuracy).  Because the test set was not used for estimating the best hyper-parameter, the test set provides an unbiased estimate of the generalization performance. 




### Example of three-fold split 

Let's see a simple example of using the three-fold split to select the number of neighbors in KNN on the iris data set.  We first take 25% as the test set, then take 25% of the remaining as the validation set (about ~19% of the original data).  

We build a model for each value of `n_neighbors` (range from 1-15 in steps of 2), evaluate it on the validation set and store the result.  We find the value which gives the best performance. 

Then, we often rebuild the model on all the training data (train + validation) with the best-performing hyper-parameter (as determined by the validation set), and evaluate the model on the test set. 

The step of retraining the model using bot the training and validation set is optional, in particular, if model training is very expensive or if the amount of training data is large enough for our model.  In this example problem, neither is the case so we retrain. 

In [None]:

# Load the data 
X, y = load_iris(return_X_y = True)

# Split of the test set 
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.25, random_state=55)

# Split trainval into train + val 
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.25, random_state=5)

# create a list to hold the perf. results on validation set 
val_scores = [] 
# specify hyper-parameter values 
nbrs = np.arange(1,16,2)

for n in nbrs: 
    # build a model 
    knn = KNeighborsClassifier(n_neighbors=n)
    knn.fit(X_train, y_train)
    # calculate performance on validation set 
    val_scores.append(knn.score(X_val, y_val))

# Find the best score and best hyper-parameter 
print("best validation score:  %.3f" % np.max(val_scores))
best_nbrs = nbrs[np.argmax(val_scores)]
print("best n_neighbors: %d" % best_nbrs)

# Retrain model on train + validation set 
knn = KNeighborsClassifier(n_neighbors=best_nbrs)
knn.fit(X_trainval, y_trainval)
print("test-set score: %.3f" % knn.score(X_test, y_test))

best validation score:  1.000
best n_neighbors: 1
test-set score: 0.974


This approach has improved upon the hold-out set method, but still relies on the particular splits.  What is we change the random splits, we might end up with different results.  In fact, if we see different outcomes based on our splits, it may mean the model is not very robust or there is not enough data.  *How can we make it more robust?*  Cross-validation

## 3 - K-fold cross-validation 

The basic premise of cross-validation is to replace the split into training and validation data with multiple different splits.  Most commonly, cross validation is applied to the training/validation split, but it can also be applied to splitting off the test data. 

<img src="https://pages.mtu.edu/~lebrown/un5550-f21/p6/cross_validation_new.png" width="50%">

The most common variant of cross-validation is k-fold cross validation, the image above illustrates a 5-fold cross-validation.  

For each fold, a split of the data is made where this fold is the validation data, and the rest is the training data.  For the 5-fold cross-validation, we split the data into five parts, and have 5 different training/validation splits.  We build a model for each of the splits using the training part and validation part to evaluate it.  The outcome is five different performance values.  These can be aggregated - compute a mean/median, or use them to estimate a variance over the splits.  

This approach is more robust over using a single split.  All of the initial train/validation data is used in the validation set exactly once, where a single split only some of the data appears in the validation set.   The main disadvantage of cross-validation is the computational cost.  

Another issue of k-fold cross-validation is that it doesn't produce a model, it produced k models.  If you want to make predictions on new data, how to do so?  One obvious method is to retrain on the whole train/validation set. 




We can do cross-validation by hand, i.e., using the `KFold()` family of methods.  Alternatively, we can use the cross-validation functions: `cross_val_score` and `cross_validate`.

In [None]:
# from sklearn.model_selection import StratifiedKFold 

# Split of the test set 
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.25, random_state=12) 
 
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=124)

scores = []

for tr_indx, val_indx in kf.split(X_trainval, y_trainval):
    X_train, X_val = X_trainval[tr_indx], X_trainval[val_indx]
    y_train, y_val = y_trainval[tr_indx], y_trainval[val_indx]

    knn = KNeighborsClassifier()
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_val)
    scores.append(metrics.accuracy_score(y_val, y_pred))

print("mean validation score:  %.3f" % np.mean(scores))

mean validation score:  0.973


## 4 - Grid Search with Cross-validation in Validation Split

Let's now think about doing model selection, but using cross-validation rather than a single split.  The overall idea is illustrated below. We still have the initial split into training and test data.  But rather than a single split into training and validation data, we run cross-validation for each parameter setting.  We record the mean score averaged over the splits in the cross-validation.  After evaluating all candidate paramters, find the one with the best mean performance.  *Keep in mind this score does not correspond to a single model; there is no best model*.  We select the hyper-parameter that is best on average over the splits.  Then we build a new model, using the hyper-parameters that performed best on average in cross-validation, on the full training dataset (X_trainval).  Finally, we evaluate this model on the test data set.   


<img src="https://pages.mtu.edu/~lebrown/un5550-f21/p6/grid_search_cross_validation_new.png" width="60%">

<img src="https://pages.mtu.edu/~lebrown/un5550-f21/p6/cv.png" width="60%">

In [None]:
# Split of the test set 
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.20, random_state=55)

# create a list to hold the perf. results on validation sets 
cross_val_scores = [] 
# specify hyper-parameter values 
nbrs = np.arange(1,16,2)

for n in nbrs: 
    # build the model with hyper-parameters 
    knn = KNeighborsClassifier(n_neighbors=n)
    # Instead of fitting a single model, we perform cross-validation 
    scores = cross_val_score(knn, X_trainval, y_trainval, cv=10)
    # record the average over the 10 folds 
    cross_val_scores.append(np.mean(scores))

print(f"best cross-validation score: {np.max(cross_val_scores):.3}")
best_nbrs = nbrs[np.argmax(cross_val_scores)]
print(f"best n_neighbors: {best_nbrs}")

knn = KNeighborsClassifier(n_neighbors=best_nbrs)
knn.fit(X_trainval, y_trainval)
print(f"test-set score: {knn.score(X_test, y_test):.3f}")

best cross-validation score: 0.975
best n_neighbors: 15
test-set score: 0.967


The code above of grid-search with cross-validation and a hold-out test set is a gold standard approach for model comparison and parameter tuning.  

**ASIDE: Cross-validation vs. Grid Search** 

Students over conflate the use of cross-validation with the use of grid search.  These are distinct and should not be used interchangeably.  Cross-validation is a technique to robustly evaluate a particular model on a particular data set.  Grid search is a technique to tune the hyper-parameters of a particular model by brute-force search.  Often each candidate is evaluated using cross-validaiton, but it is not necessary (you could use a single split of training + validation set).  So while cross-validation is often used within a grid search, you can also do cross-validation outside of a grid search, and you can do a grid search without using cross-validation.

The overall approach is illustrated below.  Start by specifying hyper-parameters to evaluate (generally this means selecting the models we are using as well).  Split the data into training and test sets.  For each hyper-parameter candidate, run a grid search on the training set, yielding a score for each split, and a mean score over all splits.  The mean validation scores are used to select the best hyper-parameter value and retrain a model on the whole training data.  Then we evaluate this final model on the test set. 


<img src="https://pages.mtu.edu/~lebrown/un5550-f21/p6/gridsearch_workflow.png" width="60%">
Image from scikit-learn.

<br>

This pattern of evaluation is common, therefore, `scikit-learn` has a method `GridSearchCV`, which does most of this for you. 




## 5 - GridSearchCV 

The `GridSearchCV` class is a meta-estimator, it takes any scikit-learn model and tunes the hyper-parameters for you using cross-validation.  The hyper-parameter grid is specified as a dictionary where the keys are the names of the parameters in the estimator and the values are all the candidate values of the hyper-parameter we want to evaluate.  

In [None]:
# from sklearn.model_selection import GridSearchCV

# Split of the test set 
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.20, random_state=55)

# define the parameter grid 
param_grid = {'n_neighbors': np.arange(1, 16, 2)}

# Instantiate GridSearchCV - sets up the parameters on how to run 
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=10, 
                    return_train_score=True)
# Execute the search (and retrain the final model) 
grid.fit(X_trainval, y_trainval)

print(f"best mean cross-validation score: {grid.best_score_}")
print(f"best parameters: {grid.best_params_}")

# do a final evaluation on the test set 
print(f"test-set score: {grid.score(X_test, y_test):.3f}")

best mean cross-validation score: 0.975
best parameters: {'n_neighbors': 15}
test-set score: 0.967


Because grid search is a meta-estimator, after you instantiate it, you can use it like any other scikit-learn model: use `fit`, `predict`, `score` methods using the best hyper-parameter setting. 

The test set is reserved for the final evaluation, therefore, it can be a good idea to look at the search results without the test set.  If the `best_score_` is lower than expected or needed for an application, do not use the test set.  Also, you may want to look at whether the `best_params_` value is on the boundary of the search space specified.  If it is, you may want to extend the range.  Also, the model that was refit on the whole training + validation data (the model used when calling `predict` and `score`) is called as `best_estimator_`.  



## 6 - Nested Cross-Validation 

As mentioned above, it is most common to use cross-validation in the training/validation part of a three-fold split: train/validationa/test split.  However, it can be used for both, resulting in **nested cross-validation**.  Nested cross-validation is easy to implement, but not commonly used for three reasons:

* computationally expensive, adds another loop 
* it doesn't result in a single model, so it's hard to productionize 
* it is harder to understand 

Here we can see an example. 

In [None]:
param_grid = {'n_neighbors':  np.arange(1, 15, 2)}
# instantiate grid search
grid = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, cv=10,
                   return_train_score=True)
# perform cross-validation on the grid-search estimator
# where each individual fit will internally perform cross-validation
res = cross_validate(grid, X, y, cv=5, return_train_score=True, return_estimator=True)

In [None]:
pd.DataFrame(res)

Unnamed: 0,fit_time,score_time,estimator,test_score,train_score
0,0.33296,0.001656,"GridSearchCV(cv=10, estimator=KNeighborsClassi...",0.966667,0.975
1,0.421561,0.001927,"GridSearchCV(cv=10, estimator=KNeighborsClassi...",1.0,0.966667
2,0.369522,0.00299,"GridSearchCV(cv=10, estimator=KNeighborsClassi...",0.933333,0.966667
3,0.309979,0.001272,"GridSearchCV(cv=10, estimator=KNeighborsClassi...",0.966667,0.983333
4,0.443975,0.002245,"GridSearchCV(cv=10, estimator=KNeighborsClassi...",1.0,0.966667


Before, we had 8 hyper-parameter values (odds between 1-15), and 10 cross-validation folds, and a final evaluation model, so 81 models were learned.  With the outer cross-validation loop, there are 405 models, which adds time. 

Also, the outcome is five different scores, for each split.  However, these don't match to a single model, because the grid search may lead to different optimum parameters: 

In [None]:
[x.best_params_ for x in res['estimator']]

[{'n_neighbors': 9},
 {'n_neighbors': 7},
 {'n_neighbors': 3},
 {'n_neighbors': 7},
 {'n_neighbors': 11}]

In this case, there is not a model that can be immediately used on new data.  

Because of these reasons we are instead going to use the **best practice is described in 4 and 5** above to select hyper-parameters and evaluate the models. 

# Data Leakage 

*Copied and adpated from MLmastery Data Leakage page.*

A very common error when using cross-validation is data leakage. Data leakage is where information about the holdout dataset, such as a test or validation dataset, is made available to the model in the training dataset.   This is not a direct type of data leakage, where we would train the model on the test dataset. Instead, it is an indirect type of data leakage, where some knowledge about the test dataset, captured in summary statistics is available to the model during training. This can make it a harder type of data leakage to spot, especially for beginners.

For example, consider a problem where we want to normalize the data, that is, scale the data to a range of 0-1.  When we normalize the input variables, this requires that we first calculate the minimum and maximum values for each variable before using these values to scale the variables. The dataset is then split into train and test datasets, but the examples in the training dataset know something about the data in the test dataset; they have been scaled by the global minimum and maximum values, so they know more about the global distribution of the variable then they should.

This type of data leakage exists with almost any data preparation task, e.g., standardization or even imputation of missing values.  

*How to solve this issue?*  Data preparation must be fit on the training data set only.  More generally, the entire modeling pipeline must be prepared only on the training dataset to avoid data leakage. This might include data transforms, but also other techniques such feature selection, dimensionality reduction, feature engineering and more. 

Let's see an example of this issue. 

## Example of Data Leakage on Hold-out set

We will start with some synthetic data for a binary classification problem. 

The naive approach for scaling the data is:    

1. Run scaling on the entire data set 
2. Split the data into train/test 
3. Train the model on train, evaluate on test 

In [None]:
# BAD - EXAMPLE OF DATA LEAKAGE

# from sklearn.datasets import make_classification
# from sklearn.preprocessing import MinMaxScaler
# from sklearn import metrics

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=13, 
                           n_redundant=7, random_state=20)

# normalize the dataset 
scaler = MinMaxScaler()
Xsc = scaler.fit_transform(X)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    Xsc, y, test_size=0.25, random_state=5)

# fit the model
model = KNeighborsClassifier()
model.fit(X_train, y_train)

# evaluate the model
yhat = model.predict(X_test)

# evaluate predictions
accuracy = metrics.accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))


Accuracy: 94.800


Let's look at how we should do the data preparation to avoid data leakage:    

1. Split the data into train/test 
2. Run scaling, use train to set parameters, apply to both train and test 
3. Train the model on train, evaluate on test 

In [None]:
# GOOD - FIXED DATA LEAKAGE

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=5)

# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)

# fit the model
model = KNeighborsClassifier()
model.fit(X_train, y_train)

# evaluate the model
yhat = model.predict(X_test)

# evaluate predictions
accuracy = metrics.accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 94.400


The model with data leakage has slightly better performance that that without. *Note, this may change across random splits*. 

## Example of Data Leakage in Cross Validation 

Naive data preparation with cross-validation involves applying the data transform first, then using the cross-validation procedure. 



In [None]:
# BAD - EXAMPLE OF DATA LEAKAGE

# from sklearn import model_selection

# normalize the dataset
scaler = MinMaxScaler()
Xsc = scaler.fit_transform(X)

# define the model
model = KNeighborsClassifier()

# define the evaluation procedure
cv = model_selection.RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model using cross-validation
scores = cross_val_score(model, Xsc, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(scores)*100, np.std(scores)*100))

Accuracy: 94.800 (2.344)


Let's look at the correct way to do data preparation with cross-validation. 
It requires that the data preparation method is prepared on the training set and applied to the train and test sets within the cross-validation procedure.




In [None]:
# GOOD - FIXED DATA LEAKAGE

# Set up how to perform k-fold 
kf = model_selection.RepeatedStratifiedKFold(n_splits = 10, 
                                             n_repeats=3, random_state=1)
scores = [] 

# Loop over splits
for tr_indx, te_indx in kf.split(X, y): 
    X_train, X_test = X[tr_indx], X[te_indx]
    y_train, y_test = y[tr_indx], y[te_indx]
    
    # normalize the dataset
    scaler = MinMaxScaler().fit(X_train)
    X_train_trans = scaler.transform(X_train)
    X_test_trans = scaler.transform(X_test)

    # define the model
    model = KNeighborsClassifier()
    model.fit(X_train_trans, y_train) 
    yhat = model.predict(X_test_trans)

    scores.append(metrics.accuracy_score(y_test, yhat))

print(scores)
print('Accuracy: %.3f (%.3f)' % (np.mean(scores)*100, np.std(scores)*100))

[0.96, 0.88, 0.99, 0.96, 0.94, 0.97, 0.94, 0.91, 0.96, 0.96, 0.93, 0.95, 0.94, 0.95, 0.93, 0.92, 0.98, 0.93, 0.98, 0.96, 0.95, 0.95, 0.94, 0.95, 0.93, 0.97, 0.98, 0.94, 0.91, 0.97]
Accuracy: 94.767 (2.376)


## Example of Data Leakage with GridSearchCV 

Let's look at how we get data leakage when using GridSearchCV as discussed above. 



In [None]:
# BAD - EXAMPLE OF DATA LEAKAGE

# Split of the test set 
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.25, random_state=5)

# Scale the data 
scaler = MinMaxScaler()
X_trainval_sc = scaler.fit_transform(X_trainval)
X_test_sc = scaler.transform(X_test)

# Instantiate the model 
knn = KNeighborsClassifier()

# params for Grid Search 
params = {'n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15]}

# Use GridSearchCV 
grid = GridSearchCV(knn, params, cv=5, return_train_score=True)
grid.fit(X_trainval_sc, y_trainval) 

print(f"best mean cross-validation score: {grid.best_score_}")
print(f"best parameters: {grid.best_params_}")

best mean cross-validation score: 0.9333333333333333
best parameters: {'n_neighbors': 3}


What's the problem?  The scaling uses the data in train+validation to set the parameters and apply the scaling to the test set. 

The issue is with the GridSearchCV usage.  GridSearchCV will split the train+validation dataset into the train set and a validation set. See the image below from scikit-learn to illustrate this idea again. 

<img src="https://pages.mtu.edu/~lebrown/un5550-f21/p6/gridsearchSV.png" width="60%">
Image from scikit-learn. 

<br> 

Within the cross-validation, the validation set should be treated as a temporary unseen test set.  Therefore, the scaler should not be fit using this data. 

How do we solve data leakage in this case? 

Use **pipelines**. 



## Example of a Pipeline 

[Pipelines](https://scikit-learn.org/stable/data_transforms.html) allow us to use a number of different dataset transformations, we may clean, preprocess, reduce, or create feature representations. 



### Pipeline on Hold-out set 

Let's see an example of scaling the data using a pipeline.  From above, we have an example of data leakage. 

```python
# BAD - Example of Data Leakage

# normalize the dataset 
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=5)
# fit the model
model = KNeighborsClassifier().fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = metrics.accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))
```

Here, was the corrected code to prevent data leakage and compare it to a pipeline. 

```python 
# GOOD - Example without Data Leakage 

# split into train and test sets
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.25, random_state=5)
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
# fit the model
model = KNeighborsClassifier().fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = metrics.accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))
```

Now below is the code implemented as a pipeline.

In [None]:
# Pipeline to avoid data leakage

# from sklearn.pipeline import make_pipeline

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=5)

# Setup the pipeline 
pipe = make_pipeline(MinMaxScaler(), KNeighborsClassifier())

# Execute the pipeline with the data
pipe.fit(X_train, y_train)

# evaluate the model
res = pipe.score(X_test, y_test)

# evaluate predictions
print('Accuracy: %.3f' % (res*100))

Accuracy: 94.400


### Pipeline with Cross-validation 

In the example below, we can see how cross-validation with data leakage can be converted to a pipeline and also how we can eliminate data leakage when using for-loops for cross-validation and in a pipeline. 

You will want to use the code on the right ("GOOD") in the future. 

#### Preprocessing before cross-validation *BAD - DO NOT USE*


```python
# BAD!
scaler = MinMaxScaler()
X_sc = scaler.fit_transform(X)

scores = []
for tr_indx, te_indx in KFold().split(X_sc, Y):
    knn = KNeighborsClassifier().fit(X_sc[train], y[train])
    score = knn.score(X_sc[test], y[test])
    scores.append(score)
```

Which is equivalent to the following condensed code: 

```python
scaler = MinMaxScaler()
X_sc = scalar.fit_transform(X)
scores = cross_val_score(KNeighborsClassifier(), X_sc, y)
```

#### Preprocessing within cross validation  *GOOD - USE as EXAMPLE*


```python
# GOOD!
scores = []
scaler = MinMaxScaler()
for train, test in KFold().split(X, y):
    scaler.fit(X[train], y[train])
    X_sc_train = scaler.transform(X[train])
    knn = KNeighborsClassifier().fit(X_sc_train, y[train])
    X_sc_test = scaler.transform(X[test])
    score = knn.score(X_sc_test, y[test])
    scores.append(score)
```

Which is equivalent to: 

```python
pipe = make_pipeline(MinMaxScaler(), KNeighborsClassifier())
scores = cross_val_score(pipe, X, y)
```



In [None]:
pipe = make_pipeline(MinMaxScaler(), KNeighborsClassifier())
scores = cross_val_score(pipe, X, y)

print("Mean Acc: %.3f" % (np.mean(scores)*100))

Mean Acc: 94.700


### Pipeline with GridSearchCV 

If you recall, `GridSearchCV` is passed an estimator and a dictionary of parameter values for tuning hyper-parameters.  We can pass a `Pipeline` as the estimator, but we need to adjust the process above to ensure the parameter tuning is applied to the correct step of the pipeline.  This is done by specifying the hyper-parameters within a pipeline, by using the name of the step of the pipeline, followed by the double underscore ('dunder'), followed by the name of the hyper-parameter. 

So, when we create a pipeline and we want to tune the `n_neighbors` parameter of KNN, we need to use  `kneighborsclassifier__n_neighbors` as the hyper-parameter name. 
Below is the example code. 

In [None]:
# GOOD - EXAMPLE without Data Leakage using pipeline and GridSearchCV

# Split of the test set 
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.25, random_state=5)

# create the pipeline 
knn_pipe = make_pipeline(MinMaxScaler(), KNeighborsClassifier())

# create the parameter grid 
# Pipeline hyper-parameters are specified as <step name>__<hyper-parameter name>
params = {'kneighborsclassifier__n_neighbors': 
          [1, 3, 5, 7, 9, 11, 13, 15]}

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, random_state=5, shuffle=True)

# Instantiate the grid-search
grid = GridSearchCV(knn_pipe, params, cv=cvStrat)
# run the grid search are report results 
grid.fit(X_trainval, y_trainval)

print(grid.best_params_)
print(grid.score(X_test, y_test)*100)

{'kneighborsclassifier__n_neighbors': 5}
94.39999999999999


We can look at all the parameters with:    

In [None]:
knn_pipe.get_params()

{'memory': None,
 'steps': [('minmaxscaler', MinMaxScaler()),
  ('kneighborsclassifier', KNeighborsClassifier())],
 'verbose': False,
 'minmaxscaler': MinMaxScaler(),
 'kneighborsclassifier': KNeighborsClassifier(),
 'minmaxscaler__clip': False,
 'minmaxscaler__copy': True,
 'minmaxscaler__feature_range': (0, 1),
 'kneighborsclassifier__algorithm': 'auto',
 'kneighborsclassifier__leaf_size': 30,
 'kneighborsclassifier__metric': 'minkowski',
 'kneighborsclassifier__metric_params': None,
 'kneighborsclassifier__n_jobs': None,
 'kneighborsclassifier__n_neighbors': 5,
 'kneighborsclassifier__p': 2,
 'kneighborsclassifier__weights': 'uniform'}

We could also tune parameters of the pre-processing step.  Here we are adding a feature selection step to choose only a percentage of the top features to be included in the model. 


In [None]:
# from sklearn.feature_selection import SelectPercentile

# create a pipeline
select_pipe = make_pipeline(MinMaxScaler(), SelectPercentile(), 
                            KNeighborsClassifier())

# create the search grid.
# Pipeline hyper-parameters are specified as <step name>__<hyper-parameter name>
param_grid = {'kneighborsclassifier__n_neighbors': [1, 3, 5, 7, 9, 10, 13, 15],
              'selectpercentile__percentile': [1, 2, 5, 10, 50, 100]}

# Instantiate grid-search, here we use default 10-fold cross-validation
grid = GridSearchCV(select_pipe, param_grid, cv=10)

# run the grid-search and report results
grid.fit(X_trainval, y_trainval)

print(grid.best_params_)
print(grid.score(X_test, y_test)*100)

{'kneighborsclassifier__n_neighbors': 3, 'selectpercentile__percentile': 100}
94.0


Note, we can make the parameter names of the GridSearch a bit simpler by using establishing some abbreviations. 

In [None]:
# from sklearn.pipeline import Pipeline

# create a pipeline
#  Label each step of the pipeline with a name, e.g., 
#   'sc' - for scaling 
#   'fs' - for Feature Selection
#   'knn' - for KNN classifier 
select_pipe = Pipeline([
                        ('sc', MinMaxScaler()),
                        ('fs', SelectPercentile()),
                        ('knn', KNeighborsClassifier())])

# create the search grid.
# Pipeline hyper-parameters are specified as <step name>__<hyper-parameter name>
param_grid = {'knn__n_neighbors': [1, 3, 5, 7, 9, 10, 13, 15],
              'fs__percentile': [1, 2, 5, 10, 50, 100]}

# Instantiate grid-search
grid = GridSearchCV(select_pipe, param_grid, cv=10)

# run the grid-search and report results
grid.fit(X_trainval, y_trainval)

print(grid.best_params_)
print(grid.score(X_test, y_test)*100)

{'fs__percentile': 100, 'knn__n_neighbors': 3}
94.0


### Different options in a Pipeline. 

We may want the pipeline to select what preprocessing steps to include or what models to apply.  For example, I have been using `MinMaxScaler` in the examples, but what if instead we should use `StandardScaler` for this dataset.  We can let `GridSearchCV` answer this. 


In [None]:

# declare a two step pipeline, explicitly giving names to both steps.
pipe = Pipeline([('scaler', StandardScaler()), 
                 ('knn', KNeighborsClassifier())])

# The name of the first step is 'scaler' and we can assign different
# estimators to this step, such as MinMaxScaler or StandardScaler
# There is a special value 'passthrough' which skips the step
param_grid = {'scaler': [MinMaxScaler(), StandardScaler(), 'passthrough'],
              # we named the second step knn, so we have to use that name here
              'knn__n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15]}

# instantiate and run as before:
grid = GridSearchCV(pipe, param_grid, cv=10)
grid.fit(X_trainval, y_trainval)

print(grid.best_params_)
print(grid.score(X_test, y_test)*100)

{'knn__n_neighbors': 5, 'scaler': StandardScaler()}
95.19999999999999


Remember, we can see the detailed results with `cv_results_` attribute. 

In [None]:
grid.cv_results_

{'mean_fit_time': array([0.00111718, 0.0012866 , 0.00062964, 0.00135174, 0.00140572,
        0.00160112, 0.00216241, 0.00131483, 0.00064392, 0.00108795,
        0.00122931, 0.0006026 , 0.00128157, 0.00217707, 0.00058303,
        0.00128911, 0.00144634, 0.00071359, 0.00100749, 0.00156398,
        0.00072532, 0.00146277, 0.00129106, 0.00079634]),
 'std_fit_time': array([2.12865945e-04, 1.51089381e-04, 6.08465935e-05, 4.53118047e-04,
        2.53021070e-04, 1.95113184e-03, 2.54153728e-03, 1.09359398e-04,
        2.80309282e-05, 8.94034576e-05, 9.17856692e-05, 4.15333445e-05,
        3.53228161e-04, 2.69975020e-03, 3.04108856e-05, 3.01142759e-04,
        2.26160355e-04, 1.14210961e-04, 2.95633366e-05, 4.28518644e-04,
        9.12705932e-05, 1.03226787e-03, 4.60200294e-05, 3.71173218e-04]),
 'mean_score_time': array([0.00345488, 0.0043654 , 0.00327051, 0.00506496, 0.00569475,
        0.00654283, 0.0058588 , 0.00398679, 0.00392787, 0.0039458 ,
        0.00393741, 0.00411441, 0.00430393, 0.00

We can get even more advanced with our `GridSearchCV` options, because it can search over grids, and also over lists of grids (a list of dictionaries). This is useful when different pre-processing steps or models have different hyper-parameters. For example, say we wanted to tune whether the `MinMaxScaler` should scale between 0 and 1 or between -1 and 1, while also considering the case of using `StandardScaler`. We can't just add `feature_range` to the `param_grid` dictionary because `StandardScaler` doesn't have a `feature_range` parameter. Instead we can create a list of two grids: one grid that always uses `MinMaxScaler` and one that always uses `StandardScaler`. This is a bit of a contrived example, but once we know more models and transformers there will be plenty of cases where this comes in handy.

In [None]:
param_grid = [ # list of two dicts
    # first dict always uses MinMaxScaler
    {'scaler': [MinMaxScaler()],
     # two options for feature_range:
     'feature_range': [(0, 1), (-1, 1)], 
     'knn__n_neighbors': [1, 3, 5, 7, 9, 11]},
    # second dict always uses StandardScaler
    # there are no scaling options that we're tuning
    {'scaler': [StandardScaler()], 
     'knn__n_neighbors': [1, 3, 5, 7, 9, 11]}   
]

Note, the values for scaler always need to be a list, even if it's a list with a single element. So we can't specify `'scaler': MinMaxScaler()`. 

### Accessing attributes in grid-search pipeline

We may want to access information about the model. 

For example, we can access the model fitted on the whole training+validation data using the `best_estimator_` attribute. 

In [None]:
grid

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('knn', KNeighborsClassifier())]),
             param_grid={'knn__n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15],
                         'scaler': [MinMaxScaler(), StandardScaler(),
                                    'passthrough']})

In [None]:
grid.best_estimator_

Pipeline(steps=[('scaler', StandardScaler()), ('knn', KNeighborsClassifier())])

You can see that best estimator is a pipeline itself.  We can also access an individual step. 

In [None]:
grid.best_estimator_['scaler']

StandardScaler()

This is a scaler that was fit on the whole training+validation dataset. 

We can also access parameters that scaler uses, for example the min values used in the `MinMaxScaler` or the mean values used in the `StandardScaler`.

In [None]:
grid.best_estimator_['scaler'].mean_

array([ 0.40896306,  0.36480057,  1.29951722, -0.0558038 ,  0.41304036,
        0.58007339, -0.46107616,  0.50317085, -0.11268126, -0.09956297,
        0.87089602, -0.71248985,  0.96428355, -1.07964908,  0.04909933,
        1.03999126, -1.50948237, -0.05674912,  0.61672511,  0.49217227])

# Exercises: Classification - Music Hits 

For this problem, you will work to classify a song’s popularity. Specifically, you will develop methods to predict whether a song will make the Top10 of Billboard’s Hot 100 Chart. The data set consists of song from the Top10 of Billboard’s Hot 100 Chart from 1990-2010 along with a sampling of other songs that did not make the list.  

The data source is the MIT 15.071 course. The data set was created by scraping Billboard’s Hot 100, other songs on Billboard, and using the EchoNest API, now a part of Spotify, to get song information.

The variables included in the data set include several description of the song and artist (including song title and id numbers), the year the song was released. Additionally, several variables describe the song attributes: time signature, loudness, tempo, key, energy pitch, and timbre (measured of different sections of the song). The last variable is binary indicated whether the song was in the Top10 or not.

You will use the variables of the song attributes to predict whether the song will be popular or not.

## Q1  (5 pts) Load and understand the data 

Load in the `music` data. 

You should not use the `year`, `artistname`, `artistID`, `songtitle` or `songID` in the prediction.  
Additionally, remove any variables that are the confidence of another variable, e.g., `timesignature_confidence`, `temp_confidence`. 


Create a input feature matrix, `Xm` and label vector `ym` that you will use to create your classifiers. 


In [None]:
music = pd.read_csv('music.csv', encoding = "ISO-8859-1")

Xm = music[['timesignature', 'loudness', 'tempo', 'key', 'energy', 'pitch',
           'timbre_0_min', 'timbre_0_max', 'timbre_1_min', 'timbre_1_max',
           'timbre_2_min', 'timbre_2_max', 'timbre_3_min', 'timbre_3_max',
           'timbre_4_min', 'timbre_4_max', 'timbre_5_min', 'timbre_5_max',
           'timbre_6_min', 'timbre_6_max', 'timbre_7_min', 'timbre_7_max',
           'timbre_10_min', 'timbre_10_max', 'timbre_11_min', 'timbre_11_max']]
ym = music['Top10']

Xm.head()

Unnamed: 0,timesignature,loudness,tempo,key,energy,pitch,timbre_0_min,timbre_0_max,timbre_1_min,timbre_1_max,...,timbre_5_min,timbre_5_max,timbre_6_min,timbre_6_max,timbre_7_min,timbre_7_max,timbre_10_min,timbre_10_max,timbre_11_min,timbre_11_max
0,3,-4.262,91.525,11,0.966656,0.024,0.002,57.342,-6.496,171.093,...,-104.683,183.089,-88.771,73.549,-71.127,82.475,-126.44,18.658,-44.77,25.989
1,4,-4.051,140.048,10,0.98471,0.025,0.0,57.414,-37.351,171.13,...,-87.267,42.798,-86.895,75.455,-65.807,106.918,-103.808,121.935,-38.892,22.513
2,4,-3.571,160.512,2,0.9899,0.026,0.003,57.422,-17.222,171.06,...,-98.673,141.365,-88.874,66.504,-67.433,80.621,-108.313,33.3,-43.733,25.744
3,4,-3.815,97.525,1,0.939207,0.013,0.0,57.765,-32.083,220.895,...,-77.515,141.178,-70.79,64.54,-63.667,96.675,-102.676,46.422,-59.439,37.082
4,4,-4.707,140.053,6,0.987738,0.063,0.0,56.872,-223.922,171.13,...,-96.147,38.303,-110.757,72.391,-55.935,110.332,-52.796,22.888,-50.414,32.758


In [None]:
grader.check("q1")

## Q2. (40 pts) Classify Top 10 Hits 

We want to report out the results of predicting the top-10 hits using either KNN, Decision Trees, or SVMS.  

For each model, you will tune the hyper-parameters:    
* KNN, number of neighbors = [3, 7, 11, 15] 
* Decision Trees, maximum depth of the tree = [2, 5, 10, 25], random_state = 5
* SVM, use a rbf kernel with C = [0.001, 0.1, 10] 

In addition, you will want to see which scaling methods seems to work best for this dataset and method: `StandardScaler` or `MinMaxScaler`. 

Overall, you will construct three pipelines to perform this analysis one for each model: KNN, DT, SVM.  You will do an initial split of your data into training+validation set with 85% of the data and a test set with 15% of the data (random_state=5).  Use 10-fold stratified cross-validation with a random_state = 5. 

Additionally, when selecting the best hyper-parameters, instead of using accuracy you will use the `f1_measure`.  
 

One note, we are not using the results here to select a certain model (that would be using the test set for more than just estimating the generalized performance), rather just to report out the results. 

In [None]:
# Split of the test set 
X_trainval, X_test, y_trainval, y_test = train_test_split(
    Xm, ym, test_size=0.15, random_state=5)

# ** KNN **
# Create pipeline
knn_pipe = Pipeline([('scaler', StandardScaler()), 
                 ('knn', KNeighborsClassifier())])

# specify pipeline steps hyperparameters
knn_param = {'scaler': [MinMaxScaler(), StandardScaler(), 'passthrough'],
              'knn__n_neighbors': [3, 7, 11, 15]}

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, random_state=5, shuffle=True)

# instantiate and run GridSearchCV on pipeline:
knn_grid = GridSearchCV(knn_pipe, knn_param, cv=cvStrat, scoring='f1')
knn_grid.fit(X_trainval, y_trainval)

# preditions on final test set 
knn_ytest = knn_grid.predict(X_test)

print(knn_grid.best_params_)

{'knn__n_neighbors': 3, 'scaler': StandardScaler()}


In [None]:
np.random.seed(5550)

# ** DT **
dt_pipe = Pipeline([('scaler', StandardScaler()), 
                 ('dt', tree.DecisionTreeClassifier(random_state=5))])

dt_param = {'scaler': [MinMaxScaler(), StandardScaler(), 'passthrough'],
              'dt__max_depth': [2, 5, 10, 25]}

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, random_state=5, shuffle=True)

# instantiate and run GridSearchCV on pipeline:
dt_grid = GridSearchCV(dt_pipe, dt_param, cv=cvStrat, scoring='f1')
dt_grid.fit(X_trainval, y_trainval)

# preditions on final test set 
dt_ytest = dt_grid.predict(X_test)
print(dt_grid.best_params_)

{'dt__max_depth': 25, 'scaler': StandardScaler()}


In [None]:
# ** SVM ** 
svm_pipe = Pipeline([('scaler', StandardScaler()), 
                 ('svm', svm.SVC())])

svm_param = {'scaler': [MinMaxScaler(), StandardScaler(), 'passthrough'],
              'svm__C': [0.001, 0.1, 10]}

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, random_state=5, shuffle=True)

# instantiate and run GridSearchCV on pipeline:
svm_grid = GridSearchCV(svm_pipe, svm_param, cv=cvStrat, scoring='f1')
svm_grid.fit(X_trainval, y_trainval)

# preditions on final test set
svm_ytest = svm_grid.predict(X_test)

print(svm_grid.best_params_)

{'scaler': StandardScaler(), 'svm__C': 10}


In [None]:
grader.check("q2")

## Q3 - (10 pts) Table of Results 

Report in a DataFrame the following information for each model:
* `Model` type (KNN, DT, SVM), 
* best `Hyper-parameters` for the model, e.g., (n_neighbors, 7), (max_depth, 10), ('C', 0.1)
* `Accuracy`, 
* `Precision`,
* `Recall`, 
* `F1-measure` and 
* `Balanced Acc` - balanced accuracy

The last 5 values should all be calculated on the test set. 



In [None]:
# Build data frame of requested results
results = pd.DataFrame({
    "Model" : ["KNN", "DT", "SVM"],
    "Hyper-parameters" : [("n_neighbors", knn_grid.best_params_["knn__n_neighbors"]),
                         ("max_depth", dt_grid.best_params_["dt__max_depth"]),
                         ("C", svm_grid.best_params_["svm__C"])],
    "Accuracy" : [metrics.accuracy_score(y_test, knn_ytest),
                  metrics.accuracy_score(y_test, dt_ytest),
                  metrics.accuracy_score(y_test, svm_ytest)],
    "Precision" : [metrics.precision_score(y_test, knn_ytest),
                  metrics.precision_score(y_test, dt_ytest),
                  metrics.precision_score(y_test, svm_ytest)],
    "Recall" : [metrics.recall_score(y_test, knn_ytest),
                  metrics.recall_score(y_test, dt_ytest),
                  metrics.recall_score(y_test, svm_ytest)],
    "F1-measure" : [metrics.f1_score(y_test, knn_ytest),
                  metrics.f1_score(y_test, dt_ytest),
                  metrics.f1_score(y_test, svm_ytest)],
    "Balanced Acc." : [metrics.balanced_accuracy_score(y_test, knn_ytest),
                  metrics.balanced_accuracy_score(y_test, dt_ytest),
                  metrics.balanced_accuracy_score(y_test, svm_ytest)]
})

results

Unnamed: 0,Model,Hyper-parameters,Accuracy,Precision,Recall,F1-measure,Balanced Acc.
0,KNN,"(n_neighbors, 3)",0.829376,0.4375,0.230769,0.302158,0.587112
1,DT,"(max_depth, 25)",0.781003,0.31694,0.318681,0.317808,0.593896
2,SVM,"(C, 10)",0.840809,0.505618,0.247253,0.332103,0.60059


In [None]:
grader.check("q3")

<!-- BEGIN QUESTION -->

## Q4 (10 pts). Results summary 

Summarize the results.  Write 5-8 sentences about the results observed and the overall performance on the problem. 


In [None]:
ym.value_counts()

0    6455
1    1119
Name: Top10, dtype: int64

**ANSWER**

The dataset is unbalanced, as there are nearly 6 times more of negative samples than positive samples in the dataset.

All three models have good accuracy values. However, each model has poor balanced accuracy and F-1 measure scores. Moreover, precision and recall values for each model are not very high. 

For example, KNN model had a precision and a recall of roughly 0.44 and 0.23 on the test set, respectively; which prectically means that 44% of instances that KNN classifies as positve are actually positive and KNN accurately determines 23% of instances that are actually positive. The fact that KNN and SVM models both had around twice as high precision as recall indicates that these models are strict in classifiying an instance as positive, which makes them fail to capture many actually positive instances.

<!-- END QUESTION -->

## Bonus1 (12 pts).  Improve Performance of Models

The problem we are working with deals with an imbalanced data set.

In [None]:
npos = len(ym[ym==1])
nneg = len(ym[ym==0])
# Percentage of positive samples in data set
npos / (nneg + npos)*100

The imbalanced data is one explanation for the poor performance of our classifiers above (among other reasons).  

Let's try to improve this performance.  Classification with imbalanced data can be improved using a number of different techniques.  Two approaches are: 

* Cost-sensitive or weighted learning approach
* Data or sampling approach 

Here we will examing the class weighting approach. Some of our traditional classification models are adapted to include a penalty of cost for the different classes.  In our problem, we have a minority class "Top10 Hits" and the majority class "Non-Top10 Hits".  We can use class weighting to penalize the model for misclassifying the minority class more than the majority class.  

We will make use of the `scikit-learn` parameter `class_weight`.  Setting `class_weight ='balanced'` will have a weight applied inversely proportional to the class frequency.  

Note, not all classification models have this parameter to set, e.g., KNN. 

Rerun your DT and SVM pipelines from above (Q2) now with the DT and SVM using the parameter `class_weight ='balanced'`

Add the resulting models `DT bal class weights` and `SVM bal class weights` and their performance to the results table from Q3.  



In [None]:

# ** DT ** 
# include class_weight in DT parameters
dt_pipe2 = Pipeline([('scaler', StandardScaler()), 
           ('dt', tree.DecisionTreeClassifier(random_state=5, class_weight ='balanced'))])

dt_param2 = {'scaler': [MinMaxScaler(), StandardScaler(), 'passthrough'],
              'dt__max_depth': [2, 5, 10, 25]}

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, random_state=5, shuffle=True)

# instantiate and run GridSearchCV on pipeline:
dt_grid2 = GridSearchCV(dt_pipe2, dt_param2, cv=cvStrat, scoring='f1')
dt_grid2.fit(X_trainval, y_trainval)

# preditions on final test set 
dt_ytest2 = dt_grid2.predict(X_test)

print(dt_grid2.best_params_)

{'dt__max_depth': 10, 'scaler': MinMaxScaler()}


In [None]:
# ** SVM ** 
# include class_weight in SVM parameters
svm_pipe2 = Pipeline([('scaler', StandardScaler()), 
            ('svm', svm.SVC(class_weight ='balanced'))])

svm_param2 = {'scaler': [MinMaxScaler(), StandardScaler(), 'passthrough'],
              'svm__C': [0.001, 0.1, 10]}

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, random_state=5, shuffle=True)

# instantiate and run GridSearchCV on pipeline:
svm_grid2 = GridSearchCV(svm_pipe2, svm_param2, cv=cvStrat, scoring='f1')
svm_grid2.fit(X_trainval, y_trainval)

# preditions on final test set
svm_ytest2 = svm_grid2.predict(X_test)

print(svm_grid2.best_params_)

{'scaler': MinMaxScaler(), 'svm__C': 10}


In [None]:
# Add "DT balanced class weights" and "SVM balanced class weights" rows 
#  to the results table. 
results.loc[3] = [
 "DT balanced class weights",
 ("max_depth", dt_grid2.best_params_["dt__max_depth"]),
 metrics.accuracy_score(y_test, dt_ytest2),
 metrics.precision_score(y_test, dt_ytest2),
 metrics.recall_score(y_test, dt_ytest2),
 metrics.f1_score(y_test, dt_ytest2),
 metrics.balanced_accuracy_score(y_test, dt_ytest2)
 ]

results.loc[4] = [
 "SVM balanced class weights",
 ("C", svm_grid2.best_params_["svm__C"]),
 metrics.accuracy_score(y_test, svm_ytest2),
 metrics.precision_score(y_test, svm_ytest2),
 metrics.recall_score(y_test, svm_ytest2),
 metrics.f1_score(y_test, svm_ytest2),
 metrics.balanced_accuracy_score(y_test, svm_ytest2)
 ]

results

Unnamed: 0,Model,Hyper-parameters,Accuracy,Precision,Recall,F1-measure,Balanced Acc.
0,KNN,"(n_neighbors, 3)",0.829376,0.4375,0.230769,0.302158,0.587112
1,DT,"(max_depth, 25)",0.781003,0.31694,0.318681,0.317808,0.593896
2,SVM,"(C, 10)",0.840809,0.505618,0.247253,0.332103,0.60059
3,DT balanced class weights,"(max_depth, 10)",0.687775,0.275325,0.582418,0.373898,0.645135
4,SVM balanced class weights,"(C, 10)",0.756376,0.355623,0.642857,0.457926,0.710434


In [None]:
grader.check("b1")

## Bonus2 (18 pts). Improve Model Performance (part 2)

For this problem, you can improve the model performance by using a data or sampling approach. 

We will make use of the package `imbalanced-learn` to help with the sampling methods. 
[`imbalanced-learn`](https://imbalanced-learn.org/stable/) extends `scikit-learn` to provide tools for classification with imbalanced data. 

In general, the data or sampling approach tries to correct the imbalance in the data with two main methods: 

* Oversampling - generating new synthetic examples from the minority class 
* Undersampling - reduce the number of samples in the majority class to match the minority class 

Both methods have their advantages and disadvantages. 

We will make use of a commen techniques known as SMOTE - Synthetic Minority Oversampling TEchnique.  
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html


Refactor your code for Q2 to now add a sampling step to your pipeline.  This step will use SMOTE with the default options (random_state=5).  

You will now use the `ImbPipeline()` function (renaming of the `Pipeline` function in `imbalanced-learn`).  

Make pipelines for KNN, DT, and SVM with the same hyper-parameters as above.  For this problem, you will not use the `class_weight` option.  

Add your results as new rows to the results table for `KNN SMOTE`, `DT SMOTE`, and `SVM SMOTE`. 


In [None]:
# from imblearn.pipeline import Pipeline as ImbPipeline
# from imblearn.over_sampling import SMOTE

# ** KNN ** 
# Include SMOTE as a step after scaling in the pipeline
# Create pipeline
knn_pipe3 = ImbPipeline([
    ('scaler', StandardScaler()),
    ('knn_smote', SMOTE(random_state=5)),
    ('knn', KNeighborsClassifier())
    ])

# specify pipeline steps hyperparameters
knn_param3 = {'scaler': [MinMaxScaler(), StandardScaler(), 'passthrough'],
              'knn__n_neighbors': [3, 7, 11, 15]}

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, random_state=5, shuffle=True)

# instantiate and run GridSearchCV on pipeline:
knn_grid3 = GridSearchCV(knn_pipe3, knn_param3, cv=cvStrat, scoring='f1')
knn_grid3.fit(X_trainval, y_trainval)

# preditions on final test set 
knn_ytest3 = knn_grid3.predict(X_test)

print(knn_grid3.best_params_)

{'knn__n_neighbors': 3, 'scaler': StandardScaler()}


In [None]:
# from imblearn.pipeline import Pipeline as ImbPipeline
# from imblearn.over_sampling import SMOTE

# ** DT ** 
# Include SMOTE as a step after scaling in the pipeline
#   set random_state=5 for SMOTE
dt_pipe3 = ImbPipeline([
    ('scaler', StandardScaler()),
    ('dt_smote', SMOTE(random_state=5)),
    ('dt', tree.DecisionTreeClassifier(random_state=5))
     ])

dt_param3 = {'scaler': [MinMaxScaler(), StandardScaler(), 'passthrough'],
              'dt__max_depth': [2, 5, 10, 25]} 

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, random_state=5, shuffle=True)

# instantiate and run GridSearchCV on pipeline:
dt_grid3 = GridSearchCV(dt_pipe3, dt_param3, cv=cvStrat, scoring='f1')
dt_grid3.fit(X_trainval, y_trainval)

# preditions on final test set 
dt_ytest3 = dt_grid3.predict(X_test)

print(dt_grid3.best_params_)

{'dt__max_depth': 5, 'scaler': StandardScaler()}


In [None]:
# from imblearn.pipeline import Pipeline as ImbPipeline
# from imblearn.over_sampling import SMOTE

# ** SVM ** 
# include SMOTE as a step after scaling in the pipeline
#   set random_state=5 for SMOTE
svm_pipe3 = ImbPipeline([
    ('scaler', StandardScaler()), 
    ('svm_smote', SMOTE(random_state=5)),
    ('svm', svm.SVC())
    ])

svm_param3 = {'scaler': [MinMaxScaler(), StandardScaler(), 'passthrough'],
              'svm__C': [0.001, 0.1, 10]}

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, random_state=5, shuffle=True)

# instantiate and run GridSearchCV on pipeline:
svm_grid3 = GridSearchCV(svm_pipe3, svm_param3, cv=cvStrat, scoring='f1')
svm_grid3.fit(X_trainval, y_trainval)

# preditions on final test set
svm_ytest3 = svm_grid3.predict(X_test)

print(svm_grid3.best_params_)

{'scaler': StandardScaler(), 'svm__C': 0.1}


In [None]:
# Add "DT SMOTE" and "SVM SMOTE" rows 
#  to the results table. 

results.loc[5] = [
 "KNN SMOTE",
 ("n_neighbors", knn_grid3.best_params_["knn__n_neighbors"]),
 metrics.accuracy_score(y_test, knn_ytest3),
 metrics.precision_score(y_test, knn_ytest3),
 metrics.recall_score(y_test, knn_ytest3),
 metrics.f1_score(y_test, knn_ytest3),
 metrics.balanced_accuracy_score(y_test, knn_ytest3)
 ]

results.loc[6] = [
 "DT SMOTE",
 ("max_depth", dt_grid3.best_params_["dt__max_depth"]),
 metrics.accuracy_score(y_test, dt_ytest3),
 metrics.precision_score(y_test, dt_ytest3),
 metrics.recall_score(y_test, dt_ytest3),
 metrics.f1_score(y_test, dt_ytest3),
 metrics.balanced_accuracy_score(y_test, dt_ytest3)
 ]

results.loc[7] = [
 "SVM SMOTE",
 ("C", svm_grid3.best_params_["svm__C"]),
 metrics.accuracy_score(y_test, svm_ytest3),
 metrics.precision_score(y_test, svm_ytest3),
 metrics.recall_score(y_test, svm_ytest3),
 metrics.f1_score(y_test, svm_ytest3),
 metrics.balanced_accuracy_score(y_test, svm_ytest3)
 ]

results

Unnamed: 0,Model,Hyper-parameters,Accuracy,Precision,Recall,F1-measure,Balanced Acc.
0,KNN,"(n_neighbors, 3)",0.829376,0.4375,0.230769,0.302158,0.587112
1,DT,"(max_depth, 25)",0.781003,0.31694,0.318681,0.317808,0.593896
2,SVM,"(C, 10)",0.840809,0.505618,0.247253,0.332103,0.60059
3,DT balanced class weights,"(max_depth, 10)",0.687775,0.275325,0.582418,0.373898,0.645135
4,SVM balanced class weights,"(C, 10)",0.756376,0.355623,0.642857,0.457926,0.710434
5,KNN SMOTE,"(n_neighbors, 3)",0.643799,0.264271,0.686813,0.381679,0.661208
6,DT SMOTE,"(max_depth, 5)",0.647318,0.24356,0.571429,0.341544,0.616604
7,SVM SMOTE,"(C, 0.1)",0.760774,0.360248,0.637363,0.460317,0.710828


In [None]:
grader.check("b2")