<a href="https://colab.research.google.com/github/dajebbar/FreeCodeCamp-python-data-analysis/blob/main/Cross_validation_and_hyperparameter_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Cross-validation and hyperparameter tuning
---
In the previous notebooks, we saw two approaches to tune hyperparameters: via grid-search and randomized-search.

In this notebook, we will show how to combine such hyperparameters search with a cross-validation.

## Predictive model

In [1]:
from sklearn import set_config

set_config(display="diagram")

In [2]:
import pandas as pd

adult_census = pd.read_csv("./adult.csv")

We extract the column containing the target.

In [3]:
target_name = "class"
target = adult_census[target_name]
target

0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
48837     <=50K
48838      >50K
48839     <=50K
48840     <=50K
48841      >50K
Name: class, Length: 48842, dtype: object

We drop from our data the target and the "education-num" column which duplicates the information from the "education" column.

In [4]:
data = adult_census.drop(columns=[target_name, "education-num"])
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,226802,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,160323,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,103497,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


## predictive pipeline 


In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import make_column_selector as selector

num_features = selector(dtype_include='number')(data)
cat_features = selector(dtype_include='object')(data)

num_preprocessor = StandardScaler()
cat_preprocessor = OrdinalEncoder(handle_unknown='use_encoded_value',
                                  unknown_value=-1)

preprocessor = ColumnTransformer([
                                  ('standard-scaler', 
                                   num_preprocessor, 
                                   num_features),
                                  
                                  ('ordinal-cat', 
                                   cat_preprocessor, 
                                   cat_features),
], 
                                remainder='passthrough',
                                sparse_threshold=0)

In [6]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline([
                  ('preprocessor', preprocessor),
                  ('classifier', HistGradientBoostingClassifier(
                      random_state=42, 
                      max_leaf_nodes=4
                  ))
])

model

## Include a hyperparameter search within a cross-validation

As mentioned earlier, using a single train-test split during the grid-search does not give any information regarding the different sources of variations: variations in terms of test score or hyperparameters values.

To get reliable information, the hyperparameters search need to be nested within a cross-validation.

In [None]:
# n_cpus = multiprocessing.cpu_count()

In [8]:
from sklearn.model_selection import cross_validate, KFold
from sklearn.model_selection import GridSearchCV

cv = KFold(n_splits=10, shuffle=True, random_state=42)

param_grid = {
    'classifier__learning_rate': (0.05, 0.1),
    'classifier__max_leaf_nodes': (30, 40)
}

model_grid_search = GridSearchCV(model, 
                                 param_grid=param_grid, 
                                 n_jobs=4,
                                 cv=cv)

cv_results = cross_validate(model_grid_search,
                            data,
                            target,
                            cv=cv,
                            n_jobs=4,
                            return_estimator=True)

Running the above cross-validation will give us an estimate of the testing score.

In [9]:
scores = cv_results["test_score"]
print(f"Accuracy score by cross-validation combined with hyperparameters "
      f"search:\n{scores.mean():.3f} +/- {scores.std():.3f}")

Accuracy score by cross-validation combined with hyperparameters search:
0.874 +/- 0.004


The hyperparameters on each fold are potentially different since we nested the grid-search in the cross-validation. Thus, checking the variation of the hyperparameters across folds should also be analyzed.

In [10]:
for fold_idx, estimator in enumerate(cv_results["estimator"]):
    print(f"Best parameter found on fold #{fold_idx + 1}")
    print(f"{estimator.best_params_}")

Best parameter found on fold #1
{'classifier__learning_rate': 0.1, 'classifier__max_leaf_nodes': 30}
Best parameter found on fold #2
{'classifier__learning_rate': 0.1, 'classifier__max_leaf_nodes': 30}
Best parameter found on fold #3
{'classifier__learning_rate': 0.1, 'classifier__max_leaf_nodes': 30}
Best parameter found on fold #4
{'classifier__learning_rate': 0.1, 'classifier__max_leaf_nodes': 40}
Best parameter found on fold #5
{'classifier__learning_rate': 0.1, 'classifier__max_leaf_nodes': 40}
Best parameter found on fold #6
{'classifier__learning_rate': 0.1, 'classifier__max_leaf_nodes': 30}
Best parameter found on fold #7
{'classifier__learning_rate': 0.1, 'classifier__max_leaf_nodes': 40}
Best parameter found on fold #8
{'classifier__learning_rate': 0.05, 'classifier__max_leaf_nodes': 40}
Best parameter found on fold #9
{'classifier__learning_rate': 0.1, 'classifier__max_leaf_nodes': 30}
Best parameter found on fold #10
{'classifier__learning_rate': 0.1, 'classifier__max_leaf_

Obtaining models with unstable hyperparameters would be an issue in practice. Indeed, it would become difficult to set them.

In this notebook, we have seen how to combine hyperparameters search with cross-validation.