<a href="https://colab.research.google.com/github/dajebbar/FreeCodeCamp-python-data-analysis/blob/main/Cross_validation_and_hyperparameter_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Cross-validation and hyperparameter tuning
---
In the previous notebooks, we saw two approaches to tune hyperparameters: via grid-search and randomized-search.

In this notebook, we will show how to combine such hyperparameters search with a cross-validation.

## Predictive model

In [1]:
from sklearn import set_config

set_config(display="diagram")

In [2]:
import pandas as pd

adult_census = pd.read_csv("./adult.csv")

We extract the column containing the target.

In [3]:
target_name = "class"
target = adult_census[target_name]
target

0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
48837     <=50K
48838      >50K
48839     <=50K
48840     <=50K
48841      >50K
Name: class, Length: 48842, dtype: object

We drop from our data the target and the "education-num" column which duplicates the information from the "education" column.

In [4]:
data = adult_census.drop(columns=[target_name, "education-num"])
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,226802,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,160323,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,103497,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


## predictive pipeline 


In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import make_column_selector as selector

num_features = selector(dtype_include='number')(data)
cat_features = selector(dtype_include='object')(data)

num_preprocessor = StandardScaler()
cat_preprocessor = OrdinalEncoder(handle_unknown='use_encoded_value',
                                  unknown_value=-1)

preprocessor = ColumnTransformer([
                                  ('standard-scaler', 
                                   num_preprocessor, 
                                   num_features),
                                  
                                  ('ordinal-cat', 
                                   cat_preprocessor, 
                                   cat_features),
], 
                                remainder='passthrough',
                                sparse_threshold=0)

In [6]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline([
                  ('preprocessor', preprocessor),
                  ('classifier', HistGradientBoostingClassifier(
                      random_state=42, 
                      max_leaf_nodes=4
                  ))
])

model