# Hyper-parameter tuning

First, let's fetch the "titanic" dataset directly from OpenML.

In [1]:
import pandas as pd

In this dataset, the missing values are stored with the following character `"?"`. We will notify it to Pandas when reading the CSV file.

In [2]:
df = pd.read_csv(
    "https://www.openml.org/data/get_csv/16826755/phpMYEkMl.csv",
    na_values='?'
)
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


The classification task is to predict whether or not a person will survive the Titanic disaster.

In [3]:
X_df = df.drop(columns='survived')
y = df['survived']

We will split the data into a training and a testing set.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_df, y, random_state=42, stratify=y
)

## The typical machine-learning pipeline

The titanic dataset is composed of mixed data types (i.e. numerical and categorical data). Therefore, we need to define a preprocessing pipeline for each data type and use a `ColumnTransformer` to process each type separetely.

First, let's define the different column depending of their data types.

In [5]:
num_cols = ['age', 'fare']
cat_col = ['sex', 'embarked', 'pclass']

Then, define the two preprocessing pipelines.

In [6]:
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder

# some of the categories will be rare and we need to
# specify the categories in advance
categories = [X_df[column].unique() for column in X_df[cat_col]]
for cat in categories:
    for idx, elt in enumerate(cat):
        if not isinstance(elt, str) and np.isnan(elt):
            cat[idx] = 'missing'

# define the pipelines
cat_pipe = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing'),
    OrdinalEncoder(categories=categories)
)
# Pour les colonnes numériques :
num_pipe = SimpleImputer(strategy='mean')

Combine both preprocessing using a `ColumnTransformer`.

In [7]:
from sklearn.compose import ColumnTransformer
preprocessing = ColumnTransformer(
    [('cat_preprocessor', cat_pipe, cat_col),
     ('num_preprocessor', num_pipe, num_cols)]
)

In [8]:
preprocessing.fit_transform(X_train)
# On retrouve le codage des categories d'abord, puis les 2 derniers sont les valeurs numériques
# Car on l'a mis en seconde position.

array([[  0.        ,   3.        ,   2.        ,  29.96344847,
          7.7333    ],
       [  0.        ,   3.        ,   2.        ,  29.96344847,
          7.75      ],
       [  0.        ,   1.        ,   2.        ,  38.        ,
          7.2292    ],
       ...,
       [  1.        ,   0.        ,   1.        ,  34.        ,
         13.        ],
       [  1.        ,   0.        ,   2.        ,  22.        ,
          8.05      ],
       [  0.        ,   0.        ,   0.        ,   2.        ,
        151.55      ]])

Finally, let's create a pipeline made of the preprocessor and a random forest classifier.

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

model = Pipeline([
    ('preprocessing', preprocessing),
    ('clf', RandomForestClassifier(n_jobs=-1, random_state=42))
])

In [10]:
_=model.fit(X_train,y_train)



In [11]:
model.score(X_test, y_test)

0.7835365853658537

# Influence of parameters tuning

Machine-learning algorithms rely on parameters which will affect the performance of the final model. Scikit-learn provides default values for these parameters. However, using these default parameters does not necessarily lead to the a model with the best performance.

Let's set some parameters which will may change the performance of the classifier.

In [12]:
model.get_params()

{'memory': None,
 'steps': [('preprocessing',
   ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                     transformer_weights=None,
                     transformers=[('cat_preprocessor',
                                    Pipeline(memory=None,
                                             steps=[('simpleimputer',
                                                     SimpleImputer(add_indicator=False,
                                                                   copy=True,
                                                                   fill_value='missing',
                                                                   missing_values=nan,
                                                                   strategy='constant',
                                                                   verbose=0)),
                                                    ('ordinalencoder',
                                                     OrdinalEncoder(

In [13]:
# Si on veut modifier un paramètre du classifier (random forest ici),
# Il suffit de faire un "set" du paramètre qui commence par "clf__..."

In [19]:
# Par exemple, ici, on veut modifier 2 paramètres : clf__n_estimators & clf__max_depth
model.set_params(clf__n_estimators=2, clf__max_depth=2)
_ = model.fit(X_train, y_train)
print(f'Accuracy score on the training data: '
      f'{model.score(X_train, y_train):.3f}')
print(f'Accuracy score on the testing data: '
      f'{model.score(X_test, y_test):.3f}')

Accuracy score on the training data: 0.757
Accuracy score on the testing data: 0.762


<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
    <li>By analyzing the training and testing scores, what can you say about the model? Is it under- or over-fitting?</li>
    </ul>
</div>

In [20]:
# in this case, it under-fit as the score on Train is not hight and Train is less than Test

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
    <li>What if we don't limit the depth of the trees in the forest?</li>
    </ul>
</div>

In [21]:
# set : clf__max_depth=None
# -> Over-fit

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
    <li>And for the case where the forest is composed of a large number of deep trees and each tree has no depth limit?</li>
    </ul>
</div>

In [22]:
# clf__n_estimators=150, clf__max_depth=None
# Over-fit 
# Testing score : is still OK.

# Use a grid-search instead

The previous is really tedious and we are not sure to cover all possible cases. Instead, we could make an automatic search to discover all possible combination of hyper-parameters and check what would be the performance of the model. One tool for search exhaustive search is called `GridSearchCV`.

With grid-search, we need to specify the set of values we wish to test. The `GridSearchCV` will create a grid with all the possible combinations.

In [24]:
from sklearn.model_selection import GridSearchCV

# Define all combinations of a list of hyper-parameters
param_grid = {
    'clf__n_estimators': [5, 50, 100],
    'clf__max_depth': [3, 5, 8, None]
}
grid = GridSearchCV(model, param_grid=param_grid, n_jobs=-1, cv=5) 
# n_jobs : on va entrainer sur une partie des données (cross validation à l'intérieur)
# cv=5 : Il va faire 5 fois le process train/test et prendre le score moyen



The obtain estimator is used as a normal estimator using `fit`.

In [25]:
# !! ne JAMAIS passer le jeu de TEST ici car on cherche à optimiser les hyper-paramètres du modèle
# Donc, c'est sur seulement "Train" qu'on peut faire ça.
grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('preprocessing',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('cat_preprocessor',
                                                                         Pipeline(memory=None,
                                                                                  steps=[('simpleimputer',
                                                                                          SimpleImputer(add_indicator=False,
                                                                                                        copy=True,


We can check the results of all combination by looking at the `cv_results_` attributes.

In [27]:
df_results = pd.DataFrame(grid.cv_results_)
columns_to_keep = [
    'param_clf__max_depth',
    'param_clf__n_estimators',
    'mean_test_score',
    'std_test_score',
]
df_results = df_results[columns_to_keep]
df_results.sort_values(by='mean_test_score', ascending=False)

Unnamed: 0,param_clf__max_depth,param_clf__n_estimators,mean_test_score,std_test_score
7,8.0,50,0.793068,0.008497
8,8.0,100,0.792049,0.015524
5,5.0,100,0.79001,0.017566
10,,50,0.787971,0.007024
3,5.0,5,0.786952,0.016059
4,5.0,50,0.785933,0.021304
11,,100,0.785933,0.011001
6,8.0,5,0.782875,0.023322
2,3.0,100,0.779817,0.018933
1,3.0,50,0.7737,0.027918


In [41]:
grid.best_params_

{'clf__max_depth': 8, 'clf__n_estimators': 50}

In [44]:
grid.best_score_

0.7930682976554536

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
    <li>What might be a limitation of using a grid-search with several parmaters and several values for each parameter?</li>
    </ul>
</div>

In [45]:
# We could miss a optimized combination (it depend on the list of values defined before)
# et plus on ajoute des cas, plus le nb de combinaison augmente, ça explose le temps de calcul

An alternative to the `RandomizedSearchCV`. In this case, the parameters values will be drawn from some predefined distribution. Then, we will make some successive drawing anch check the performance.

In [46]:
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    'clf__n_estimators': randint(1, 100),
    'clf__max_depth': randint(2, 15),
    'clf__max_features': [1, 2, 3, 4, 5],
    'clf__min_samples_split': [2, 3, 4, 5, 10, 30],
}
search = RandomizedSearchCV(
    model, param_distributions=param_distributions,
    n_iter=50, n_jobs=-1, cv=5, random_state=42
)
# n_iter : tirer 20 valeurs au hasard
# 

In [47]:
_ = search.fit(X_train, y_train)

In [48]:
df_results = pd.DataFrame(search.cv_results_)
columns_to_keep = [
    "param_" + param_name for param_name in param_distributions]
columns_to_keep += [
    'mean_test_score',
    'std_test_score',
]
df_results = df_results[columns_to_keep]
df_results = df_results.sort_values(by="mean_test_score", ascending=False)
df_results.head(5)

Unnamed: 0,param_clf__n_estimators,param_clf__max_depth,param_clf__max_features,param_clf__min_samples_split,mean_test_score,std_test_score
23,60,6,2,2,0.802243,0.017158
32,65,9,3,4,0.801223,0.013206
28,33,9,3,4,0.801223,0.012806
4,30,6,2,5,0.798165,0.018712
36,77,11,2,5,0.798165,0.015688


In [49]:
df_results.tail(5)

Unnamed: 0,param_clf__n_estimators,param_clf__max_depth,param_clf__max_features,param_clf__min_samples_split,mean_test_score,std_test_score
10,51,5,1,4,0.778797,0.017823
20,14,3,2,30,0.772681,0.021707
31,44,2,5,4,0.769623,0.020003
34,14,2,5,3,0.768603,0.023637
6,22,2,4,3,0.762487,0.024938


<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <p>Build a machine-learning pipeline using a <tt>HistGradientBoostingClassifier</tt> and fine tune your model on the Titanic dataset using a <tt>RandomizedSearchCV</tt>.</p>
    <p>You may want to set the parameter distributions is the following manner:</p>
    <ul>
    <li><tt>learning_rate</tt> with values ranging from 0.001 to 0.5 following a reciprocal distribution.</li>
    <li><tt>l2_regularization</tt> with values ranging from 0.0 to 0.5 following a uniform distribution.</li>
    <li><tt>max_leaf_nodes</tt> with integer values ranging from 5 to 30 following a uniform distribution.</li>
    <li><tt>min_samples_leaf</tt> with integer values ranging from 5 to 30 following a uniform distribution.</li>
    </ul>
</div>

In [None]:
# TODO
