# Machine Learning with Python

In [164]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## 3.2 Hyperparameter search

To improve performance of a particular model, we will need to tune its *hyperparameter values* - i.e. the parameters that are not learned from the data but specified independently. Cross-validation allows us to make a sweep of the possible hyperparameter space and find combinations of hyperparameters that work well for the training data as a whole.

Once we have decided on the best values for hyperparameters, we can train a final model on the *entire* training dataset and evaluate on the testing data for an independent assessment of performance.

Let's look at another classification dataset. Here we are attempting to distinguish between nasal and oral vowel sounds, using the amplitudes of the first five harmonics.

In [11]:
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler

phoneme = fetch_openml(name='phoneme', version=1, parser='auto')
X, y = phoneme.data.to_numpy(), phoneme.target.to_numpy()

In [33]:
X.shape

(5404, 5)

We could try a KNN classifier:

In [96]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, random_state=10)

# We will use a smaller training set to make the problem harder
X_train_ = X_train[:100]
y_train_ = y_train[:100]
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train_, y_train_)

Let's assess using the [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score).

In [97]:
y_pred = knn.predict(X_test)

from sklearn.metrics import f1_score
f1_score(y_test, y_pred, pos_label='1')

0.8310077519379844

Looks good - but could we do better with a different value of *k*?

We can do an exhaustive search of the hyperparameter space using [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

In [105]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

parameters = {'n_neighbors':[1, 2, 5, 10, 20, 50]}
predictor = KNeighborsClassifier()
gs = GridSearchCV(predictor, 
                  parameters, 
                  cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=40),
                  scoring=make_scorer(f1_score, pos_label='1')
                  )
gs.fit(X_train_,y_train_)

This gives us some detailed results for each of the 5 splits:

In [106]:
gs.cv_results_

{'mean_fit_time': array([0.00146031, 0.0005794 , 0.00064697, 0.00054865, 0.00059581,
        0.00043845]),
 'std_fit_time': array([1.07154686e-03, 4.59482522e-05, 9.00839416e-05, 1.57957662e-05,
        7.89004854e-05, 2.69423544e-05]),
 'mean_score_time': array([0.00349474, 0.00276632, 0.0026258 , 0.00241842, 0.00243864,
        0.00235233]),
 'std_score_time': array([0.00088658, 0.00051437, 0.00037997, 0.00014071, 0.00014065,
        0.00058743]),
 'param_n_neighbors': masked_array(data=[1, 2, 5, 10, 20, 50],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'n_neighbors': 1},
  {'n_neighbors': 2},
  {'n_neighbors': 5},
  {'n_neighbors': 10},
  {'n_neighbors': 20},
  {'n_neighbors': 50}],
 'split0_test_score': array([0.84615385, 0.82758621, 0.85714286, 0.85714286, 0.85714286,
        0.78787879]),
 'split1_test_score': array([0.75      , 0.81481481, 0.84615385, 0.88888889, 0.84615385,
        0.78787879]),


It will also report the best parameter values found:

In [108]:
gs.best_params_

{'n_neighbors': 5}

In [102]:
gs.best_score_

0.8773736263736265

### Multi-parameter searches

Let's try a more complex example on the same dataset. Support Vector Machines have several hyperparameters that could be varied - for example, in addition to the kernel function itself, we have a regularisation parameter `C` to tune (a positive real value).

`GridSearchCV` makes it easy for us to explore the space of possible hyperparameter values and choose the best combination.

In [120]:
from sklearn.svm import SVC

parameters = {'kernel':('linear', 'rbf', 'poly'), 'C':[0.01, 0.1, 1, 10, 100]}
predictor = SVC()
gs = GridSearchCV(predictor, 
                  parameters, 
                  cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=40),
                  scoring=make_scorer(f1_score, pos_label='1')
                  )
gs.fit(X_train_,y_train_)


In [122]:
gs.cv_results_['params']

[{'C': 0.01, 'kernel': 'linear'},
 {'C': 0.01, 'kernel': 'rbf'},
 {'C': 0.01, 'kernel': 'poly'},
 {'C': 0.1, 'kernel': 'linear'},
 {'C': 0.1, 'kernel': 'rbf'},
 {'C': 0.1, 'kernel': 'poly'},
 {'C': 1, 'kernel': 'linear'},
 {'C': 1, 'kernel': 'rbf'},
 {'C': 1, 'kernel': 'poly'},
 {'C': 10, 'kernel': 'linear'},
 {'C': 10, 'kernel': 'rbf'},
 {'C': 10, 'kernel': 'poly'},
 {'C': 100, 'kernel': 'linear'},
 {'C': 100, 'kernel': 'rbf'},
 {'C': 100, 'kernel': 'poly'}]

In [125]:
gs.cv_results_['mean_test_score']

array([0.80926916, 0.80926916, 0.80926916, 0.88679487, 0.80926916,
       0.83518975, 0.87817949, 0.86250408, 0.83615908, 0.87817949,
       0.84528897, 0.82254022, 0.87817949, 0.79934644, 0.79475783])

In [126]:
gs.best_params_

{'C': 0.1, 'kernel': 'linear'}

In [127]:
gs.best_score_

0.8867948717948717

Then to train the final model we could do

In [137]:
final = SVC(**gs.best_params_)
final.fit(X_train_,y_train_)


In [139]:
y_pred = final.predict(X_test)
f1_score(y_test, y_pred, pos_label='1')

0.8250265111346766

Slightly disappointing? Notice that the cross-validation can still overestimate performance on onseen data - this is why it is important to have a final test dataset available to obtain a convincing assessment.

### Using GridSearchCV with a pipeline

When we have preprocessing steps to consider, the process becomes a little more complex. Remember that we will have to learn the transformations from *each split* in the training data. The pipeline can help here.

We will go back to the *autoMpg* regression dataset.

In [239]:
from sklearn.datasets import fetch_openml
mpg = fetch_openml(name='autoMpg', version=1, parser='auto')

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(mpg.data, mpg.target, random_state=0)


This time we will add a LASSO predictor to the workflow.

In [240]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso

# Defines preprocessing transformations for specified columns
ct = ColumnTransformer([('encode', OneHotEncoder(), ['origin']),
                        ('impute', IterativeImputer(), ['horsepower'])],
                       remainder='passthrough') 

# Defines individual steps in a workflow
pipe = Pipeline([('preprocessing', ct),
                 ('scaling', StandardScaler()),
                 ('predict', Lasso())])



In [241]:
# Note how we link the hyperparameter to the specific pipeline step
parameters = {'predict__alpha':[0.001,0.01, 0.1, 1, 10, 100]}

gs = GridSearchCV(pipe, 
                  parameters, 
                  cv=5,
                  scoring='r2'
                  )
gs.fit(X_train,y_train)

In [242]:
gs.best_params_

{'predict__alpha': 0.01}

In [243]:
gs.best_score_

0.8102035638516194

### Preprocessing steps can also have hyperparameters

The problem is currently fairly easy as there are only seven features to consider:

In [244]:
X_train

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model,origin
254,6,200.0,85.0,2965,15.8,78,1
246,4,78.0,52.0,1985,19.4,78,3
282,4,140.0,88.0,2890,17.3,79,1
81,4,97.0,92.0,2288,17.0,72,3
347,4,85.0,65.0,1975,19.4,81,3
...,...,...,...,...,...,...,...
323,4,156.0,105.0,2800,14.4,80,1
192,6,250.0,105.0,3353,14.5,76,1
117,4,68.0,49.0,1867,19.5,73,2
47,6,250.0,100.0,3282,15.0,71,1


Let's add a load of noisy random features to make things more difficult:

In [245]:
n_samples, n_features = X_train.shape
random_state = np.random.RandomState(12)
random_data = random_state.randn(n_samples, 300 * n_features)
X = pd.concat([X_train.reset_index(drop=True), 
               pd.DataFrame(random_data)], 
               axis=1)
X.columns = X.columns.astype(str)
X.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model,origin,0,1,2,...,2090,2091,2092,2093,2094,2095,2096,2097,2098,2099
0,6,200.0,85.0,2965,15.8,78,1,0.472986,-0.681426,0.242439,...,-1.340093,-0.587587,0.600642,0.805197,-1.085063,-0.228847,-0.37855,-1.178255,0.05937,-0.228445
1,4,78.0,52.0,1985,19.4,78,3,1.33388,-0.980664,0.692041,...,0.872736,-1.300653,0.108348,1.164079,-0.677696,-0.626215,-0.567621,-0.113455,0.789805,-0.305536
2,4,140.0,88.0,2890,17.3,79,1,-1.270399,-0.615653,0.88308,...,1.255203,-0.550017,1.184837,0.241276,-0.190719,-0.492024,-0.051984,-0.579174,1.207335,-1.422566
3,4,97.0,92.0,2288,17.0,72,3,1.167386,2.120115,0.757982,...,1.238595,0.132433,-0.920043,0.269289,-1.622019,1.425564,0.476196,-0.082289,0.508878,0.251893
4,4,85.0,65.0,1975,19.4,81,3,-2.009391,-1.025001,-0.335708,...,1.602066,0.01014,0.356846,1.172071,0.573408,0.867174,0.041964,-0.167523,0.050252,-0.422824


In [246]:
gs.fit(X,y_train)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [247]:
gs.best_score_

0.7789378581082992

In this situation, a dimensionality reduction step would help to reduce the noise. 
Let's include a PCA step in the pipeline:

In [248]:
from sklearn.decomposition import PCA

pipe2 = Pipeline([('preprocessing', ct),
                  ('scaling', StandardScaler()),
                  ('reduce', PCA()),
                  ('predict', Lasso())])


The number of PCA components is now a hyperparameter, so let's include it:

In [249]:
# Note how we link the hyperparameter to the specific pipeline step
parameters = {'predict__alpha':[0.001,0.01,0.1,1,10,100],
              'reduce__n_components':[2,3,4,5,6,7,8,9]}

gs = GridSearchCV(pipe2, 
                  parameters, 
                  cv=5,
                  scoring='r2'
                  )
gs.fit(X_train,y_train)

In [250]:
gs.cv_results_

{'mean_fit_time': array([0.01019869, 0.00831728, 0.00800862, 0.00775857, 0.007865  ,
        0.00779161, 0.0078568 , 0.00921402, 0.00925512, 0.00798068,
        0.00796661, 0.00830226, 0.04195623, 0.02780981, 0.00971198,
        0.0089776 , 0.00831575, 0.00825872, 0.00931253, 0.00828962,
        0.0111084 , 0.01102295, 0.00862322, 0.00829892, 0.00872869,
        0.01008477, 0.00875278, 0.00879421, 0.00800619, 0.0079823 ,
        0.0078836 , 0.00807805, 0.0088304 , 0.00886965, 0.00857906,
        0.00801988, 0.00794373, 0.00898199, 0.00853543, 0.00884447,
        0.00793543, 0.00784168, 0.00863862, 0.0093658 , 0.00861864,
        0.00837717, 0.00793056, 0.00803571]),
 'std_fit_time': array([0.00217043, 0.00025097, 0.00027796, 0.00014662, 0.0002411 ,
        0.00037924, 0.00033176, 0.00111292, 0.00070012, 0.00037348,
        0.00050547, 0.00060468, 0.06237966, 0.01107044, 0.00253296,
        0.00048078, 0.00034415, 0.00074727, 0.00065211, 0.00057406,
        0.00278478, 0.00221665, 0.000

In [251]:
gs.best_score_

0.8101858488919484

In [252]:
gs.best_params_

{'predict__alpha': 0.01, 'reduce__n_components': 8}