# Workflow & Hyperparameter Optimization

Our dataset describes heart diseases in a binary class problem (1: disease, 0: no disease)

Your goal will be to fit KNN to best predict malignant targets by avoiding the maximum false negatives!

👇 Import the data

In [107]:
import pandas as pd
import seaborn as sns
import numpy as np

In [108]:
data.info()

In [109]:
X = data.drop(columns=['target'])
y = data['target']

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [110]:
y.value_counts()

1    357
0    212
Name: target, dtype: int64

## 1. Train/Test split

👇 Split the data to create your `X_train` `X_test` and `y_train` `y_test`

Use a test_size=0.3 and a `random_state=0` to compare with your buddy

In [111]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

## 2. Scaling

❓ Scale your training set using the scaler of your choice

In [112]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

## 3. Baseline KNN model

❓ Cross validate a basic KNN classifier on your training set on the using the "ROC area-under-curve" metric

In [113]:
from sklearn.metrics import SCORERS
SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'adjusted_rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_weighted'])

In [114]:
KNeighborsClassifier().get_params().keys()

dict_keys(['algorithm', 'leaf_size', 'metric', 'metric_params', 'n_jobs', 'n_neighbors', 'p', 'weights'])

In [115]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate

knn_model = KNeighborsClassifier()
knn_results = cross_validate(knn_model, X_train, y_train, scoring='roc_auc')
display(f'ROC score = {round(knn_results["test_score"].mean(),2)}')

'ROC score = 0.96'

## 4. Grid search

Use KNeighborsClassifier

👇 Grid search a KNN's hyperparameter k on the training data.
- Search k = [1,5,10,20,50]
- 5-fold cross validate
- Score with recall

In [116]:
from sklearn.model_selection import GridSearchCV

# Hyperparameter Grid
grid = {'n_neighbors': [1, 5, 10, 20, 25]}

# Instanciate Grid Search
grid_search = GridSearchCV(knn_model, grid, 
                           scoring = 'recall',
                           cv = 5,
                           n_jobs=-1 # paralellize computation
                          ) 

# Fit data to Grid Search
grid_search.fit(X_train_scaled,y_train);

In [117]:
# Best score
display(f'best recall score: {grid_search.best_score_}')

# Best Params
display(f'best params: {grid_search.best_params_}')
# grid_search.best_params_

# Best estimator
display(f'best estimator: {grid_search.best_estimator_}')

'best recall score: 0.992'

"best params: {'n_neighbors': 5}"

'best estimator: KNeighborsClassifier()'

❓ According to the grid search, what is the optimal K value?

it is 5!

❓ What is the best score the optimal K value produced?

It is 99.2% recall rate!

We now have an idea about where the best k lies, but some of the values we did not try could be better!

Re-run grid search with k-values around to your previous best value

❓ What is the best score and best k?

In [118]:
# Hyperparameter Grid
grid = {'n_neighbors': [1, 3, 5, 7, 9, 15, 20]}

# Instanciate Grid Search
grid_search2 = GridSearchCV(knn_model, grid, 
                           scoring = 'recall',
                           cv = 5,
                           n_jobs=-1 # paralellize computation
                          ) 

# Fit data to Grid Search
grid_search2.fit(X_train_scaled,y_train);

# Best score
display(f'best recall score: {grid_search2.best_score_}')

# Best Params
display(f'best params: {grid_search2.best_params_}')
# grid_search.best_params_

# Best estimator
display(f'best estimator: {grid_search2.best_estimator_}')

'best recall score: 0.992'

"best params: {'n_neighbors': 5}"

'best estimator: KNeighborsClassifier()'

## 5. Optimizing multiple hyperparameters

👇 Is the default distance parameter of a KNNClassifier optimal for the task? Run a random search to compute your answer. (look for the parameter 'p' in the [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html))

First, let's do a grid search for k and p at the same time. Try 10 combinations, e.g. k = [1, 10, 20, 30, 40]; p = [1, 2]

❓ What are the best parameters and the best score?

In [119]:
from sklearn.metrics import recall_score
best_model = KNeighborsClassifier(n_neighbors=14)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)

recall_score(y_test, y_pred)

0.9809523809523809

## 6. Random Search

Now let's see if a Random Search can find a better combination only 10 tries.

👇 Redo the same for `RandomizedSearchCV` but randomly sample `n_neighbors` from a `randint(1,40)` distribition

To compare apples to apples, run `RandomizedSearchCV` with `n_iter=10` to have the same number of total combinations to try. Also make sure you use the same scoring method for both, for example accuracy.

<details>
    <summary>🤔 Is the best score better with the Randomized Search?</summary>


It is not guaranteed because it is random, but we know there is a chance that RandomizedSearch will sample it.

You can play with np.random.seed() to see that sometimes RandomSearch will outperform GridSearch and sometimes not.

It is important to note that our dataset is extremely small and our hyperparameter optimization is thus extremely dependent (and overfitting) on our train/test split. **Always make sure your dataset is much bigger than the total number of hyperparameter combinations you are trying out!**

Randomized Search will become more useful even, when we want to search over even more than 2 numerical hyperparameters and sample all of them randomly, for example for SVMs!

One thing you can always do is run a coarse grained grid search frst, followed by a more fine grained search around the best parameter that you found. You can also do a randomized search followed by a grid search and vice versa. 
</details>

In [122]:
from sklearn.model_selection import RandomizedSearchCV
# Hyperparameter Grid
grid = {'n_neighbors': [1, 10, 20, 30, 40], 'p':[1,2]}

# Instanciate Grid Search
grid_search = GridSearchCV(knn_model, grid, 
                           scoring = 'recall',
                           cv = 5,
                           n_jobs=-1 # paralellize computation
                          ) 

# Fit data to Grid Search
grid_search.fit(X_train_scaled,y_train);

In [123]:
# Best score
display(f'best recall score: {grid_search.best_score_}')

# Best Params
display(f'best params: {grid_search.best_params_}')
# grid_search.best_params_

# Best estimator
display(f'best estimator: {grid_search.best_estimator_}')

'best recall score: 0.992'

"best params: {'n_neighbors': 20, 'p': 1}"

'best estimator: KNeighborsClassifier(n_neighbors=20, p=1)'

## 7. Generalization

👇 This is your final chance to finetune your model: Try to refine your Grid/RandomsearchCV, instanciate your best model and re-fit it on the entire train set.

Best p is 1 and k is 20, best score 97,7%

👇 Time has come to discover its its performance on the **unseen** test set. 

In [124]:
np.random.seed(8)
from random import randint
# Hyperparameter Grid
grid = {'n_neighbors': [randint(1,40)], 'p':[1,2]}

# Instanciate Grid Search
grid_search2 = RandomizedSearchCV(knn_model, grid, 
                           n_iter=10, 
                           scoring = 'recall',
                           cv = 5,
                           n_jobs=-1 # paralellize computation
                          ) 

# Fit data to Grid Search
grid_search2.fit(X_train_scaled,y_train);



❓ Would you consider the optimized model to generalize well?

In [125]:
# Best score
display(f'best recall score: {grid_search2.best_score_}')

# Best Params
display(f'best params: {grid_search2.best_params_}')
# grid_search.best_params_

# Best estimator
display(f'best estimator: {grid_search2.best_estimator_}')

'best recall score: 0.9960000000000001'

"best params: {'p': 1, 'n_neighbors': 13}"

'best estimator: KNeighborsClassifier(n_neighbors=13, p=1)'

<details><summary>Hints</summary>

Find horrible test performance? You probably foregot to scale your test set too! Re-use your scaler fitted on the train set to transform your test set accordingly!
</details>

🏁 Congratulation. Please push the exercice once completed