# Model Tuning

In [81]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV,\
cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score, recall_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

## Agenda

## Hyperparameters

Many of the models we have looked at are really *families* of models in the sense that they make use of **hyperparameters**.

Thus for example the $k$-nearest-neighbors algorithm allows us to make:

- a 1-nearest-neighbor model;
- a 2-nearest-neighbors model;
- a 3-nearest-neighbors model;
- etc.

Or, for another example, the decision tree algorithm allows us to make:

- a classifier that branches according to information gain;
- a classifier that branches according to Gini impurity;
- a regressor that branches according to mean squared error;
- etc.

Depending on the sort of problem and data at hand, it is natural to experiment with different values of these hyperparameters to try to improve model performance.

### Difference from Parametric / Non-Parametric Models

Contrast the notion of hyperparameters with the distinction between parametric and non-parametric models.

A linear regression model is parametric in the sense that we start with a given model *form* and we then search for the optimal parameters to fill in that form. But *those* parameters are not the sort we might tweak for the purposes of improving model performance. On the contrary, there is one best set of parameters, and the training of the model is a matter of finding those optimal values.

## Trying Different Values

In [None]:
penguins = sns.load_dataset('penguins')

penguins.info()

In [80]:
penguins.head()

Unnamed: 0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
4,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
5,Torgersen,39.3,20.6,190.0,3650.0,MALE


### Data Prep

We'll try to predict species given the other columns' values. Let's dummy-out `island` and `sex`:

In [None]:
penguins.isna().sum().sum()

In [None]:
penguins = penguins.dropna()

In [None]:
y = penguins.pop('species')

In [157]:
X_train, X_test, y_train, y_test = train_test_split(
    penguins, y, test_size=0.5, random_state=42)

In [158]:
X_train_cat = X_train.select_dtypes('object')

ohe = OneHotEncoder(
    drop='first',
    sparse=False)

dums = ohe.fit_transform(X_train_cat)
dums_df = pd.DataFrame(dums,
                       columns=ohe.get_feature_names(),
                       index=X_train_cat.index)

In [159]:
X_train_nums = X_train.select_dtypes('float64')

ss = StandardScaler()

ss.fit(X_train_nums)
nums_df = pd.DataFrame(ss.transform(X_train_nums),
                      index=X_train_nums.index)

In [160]:
X_train_clean = pd.concat([nums_df, dums_df], axis=1)

In [162]:
X_train_clean.head()

Unnamed: 0,0,1,2,3,x0_Dream,x0_Torgersen,x1_MALE
160,0.362748,0.903276,-0.472344,-0.094599,1.0,0.0,0.0
237,0.973499,-0.977375,1.408317,2.512546,0.0,0.0,1.0
2,-0.725152,0.44582,-0.472344,-1.185963,0.0,1.0,0.0
121,-1.221387,1.360731,-0.255345,-0.882806,0.0,1.0,1.0
179,1.030757,0.954104,-0.110678,-0.519018,1.0,0.0,1.0


In [172]:
knn_model = KNeighborsClassifier()

knn_model.fit(X_train_clean, y_train)

KNeighborsClassifier()

In [173]:
scores = cross_val_score(estimator=knn_model, X=X_train_clean,
               y=y_train, cv=10)
scores

array([1.        , 1.        , 1.        , 1.        , 1.        ,
       0.94117647, 1.        , 1.        , 1.        , 1.        ])

In [174]:
np.median(scores)

1.0

### Preparing the Test Set

In [175]:
X_test_cat = X_test.select_dtypes('object')

test_dums = ohe.transform(X_test_cat)
test_dums_df = pd.DataFrame(test_dums,
                       columns=ohe.get_feature_names(),
                      index=X_test_cat.index)

In [176]:
X_test_nums = X_test.select_dtypes('float64')

test_nums = ss.transform(X_test_nums)
test_nums_df = pd.DataFrame(test_nums,
                           index=X_test_nums.index)

In [177]:
X_test_clean = pd.concat([test_nums_df,
                 test_dums_df], axis=1)

### $k$-Nearest Neighbors Model

In [178]:
knn_model.score(X_test_clean, y_test)

0.9880239520958084

#### Decreasing $k$

In [179]:
knn5 = KNeighborsClassifier(n_neighbors=3)

knn5.fit(X_train_clean, y_train)

KNeighborsClassifier(n_neighbors=3)

In [180]:
knn5.score(X_test_clean, y_test)

0.9940119760479041

### Decision Tree

In [None]:
ct = DecisionTreeClassifier(random_state=10)

ct.fit(X_train_clean, y_train)

In [None]:
ct.score(X_test_clean, y_test)

#### Changing the branching criterion

In [None]:
ct = DecisionTreeClassifier(criterion='entropy',
                          random_state=10)

ct.fit(X_train_clean, y_train)

In [None]:
ct.score(X_test_clean, y_test)

## Gridsearching

It's not a bad idea to experiment with the values of your models' hyperparameters a bit as you're getting a feel for your models' performance. But there are more systematic ways of going about the search for optimal hyperparameters. One method of hyperparameter tuning is **gridsearching**. The idea is to build mulitple models with different hyperparameter values and then see which one performs the best. The hyperparameters and the values to try form a sort of *grid* along which we are looking for the best performance. For example:


    1           | 'minkowski' | 'uniform'
    3           | 'manhattan' | 'distance'
    5           |
    ______________________________________
    n_neighbors | metric      | weights

Scikit-Learn has a `GridSearchCV` class whose `fit()` method runs this procedure. Note that this can be quite computationally expensive since:

- A model is constructed for each combination of hyperparameter values that we input; and
- Each model is cross-validated.

### `GridSearchCV`

In [None]:
# Define the parameter grid

grid = {
    'n_neighbors': [1, 3, 5],
    'metric': ['minkowski', 'manhattan'],
    'weights': ['uniform', 'distance']
}

**Question: How many models will we be constructing with this grid?**

In [None]:
# Initialize the gridsearch object with five-fold cross-validation

gs = GridSearchCV(estimator=knn_model, param_grid=grid, cv=5)

In [None]:
gs.fit(X_train_clean, y_train)

In [None]:
gs.best_params_

In [None]:
gs.best_score_

In [None]:
gs.best_estimator_.score(X_test_clean, y_test)

In [None]:
gs.cv_results_

### Choice of Grid Values

Which values should you pick for your grid? Intuitively, you should try both "large" and "small" values, but of course what counts as large and small will really depend on the type of hyperparameter.

- For a $k$-nearest neighbors model, 1 or 3 would be a small value for the number of neighbors and 15 or 17 would be a large value.

- For a decision tree model, what counts as a small `max_depth` will really depend on the size of your training data. A `max_depth` of 5 would likely have little effect on a very small dataset but, at the same time, it would probably signficantly decrease the variance of a model where the dataset is large.

- For a logistic regression's regularization constant, you may want to try a set of values that are exponentially separated, like \[1, 10, 100, 1000\].

- **If a gridsearch finds optimal values at the ends of your hyperparameter ranges, you might try another gridsearch with more extreme values.**

## Exercise

Do a GridSearch on a **decision tree model** of penguin species. What are the optimal values for the hyperparameters you've chosen?

## GridSearching a Pipeline

In [187]:
ss_pipe = Pipeline(steps=[
    ('ss', StandardScaler())
])
                
ohe_pipe = Pipeline(steps=[
    ('ohe', OneHotEncoder(drop='first',
                         sparse=False))
])

trans = ColumnTransformer(transformers=[
    ('ss', ss_pipe, X_train_nums.columns),
    ('ohe', ohe_pipe, X_train_cat.columns)
])

In [194]:
trans.fit(X_train)

ColumnTransformer(transformers=[('ss',
                                 Pipeline(steps=[('ss', StandardScaler())]),
                                 Index(['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm',
       'body_mass_g'],
      dtype='object')),
                                ('ohe',
                                 Pipeline(steps=[('ohe',
                                                  OneHotEncoder(drop='first',
                                                                sparse=False))]),
                                 Index(['island', 'sex'], dtype='object'))])

In [195]:
X_train_tr = trans.transform(X_train)

In [196]:
model_pipe = Pipeline(steps=[
    ('knn', KNeighborsClassifier())
])

In [197]:
model_pipe.fit(X_train_tr)

TypeError: check_array() got an unexpected keyword argument 'multi_output'

## Exercise

Use a Random Forest Classifier to predict the category of price range for the phones in this dataset. Try tuning some hyperparameters using GridSearch, and then write up a short paragraph about your findings.

In [None]:
phones_train = pd.read_csv('data/train.csv')

phones_test = pd.read_csv('data/test.csv')