# 📝 Exercise M3.01

The goal is to write an exhaustive search to find the best parameters
combination maximizing the model generalization performance.

Here we use a small subset of the Adult Census dataset to make the code faster
to execute. Once your code works on the small subset, try to change
`train_size` to a larger value (e.g. 0.8 for 80% instead of 20%).

In [34]:
import pandas as pd

from sklearn.model_selection import train_test_split

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42
)

In [35]:
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = make_column_transformer(
    (categorical_preprocessor, selector(dtype_include=object)),
    remainder="passthrough",
)

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("classifier", HistGradientBoostingClassifier(random_state=42)),
    ]
)

Use the previously defined model (called `model`) and using two nested `for`
loops, make a search of the best combinations of the `learning_rate` and
`max_leaf_nodes` parameters. In this regard, you need to train and test the
model by setting the parameters. The evaluation of the model should be
performed using `cross_val_score` on the training set. Use the following
parameters search:
- `learning_rate` for the values 0.01, 0.1, 1 and 10. This parameter controls
  the ability of a new tree to correct the error of the previous sequence of
  trees
- `max_leaf_nodes` for the values 3, 10, 30. This parameter controls the depth
  of each tree.

In [36]:
from sklearn.model_selection import cross_val_score

learning_rates = [1e-2, 1e-1, 1, 10]
max_leaves_nodes = [3, 10, 30]

In [37]:
for lr in learning_rates:
    for mln in max_leaves_nodes:
        model.set_params(classifier__learning_rate=lr)
        model.set_params(classifier__max_leaf_nodes=mln)
        cv_results = cross_val_score(
            model, data_train, target_train
        )
        print(f"The score is {cv_results.mean()} ± {cv_results.std()}")

The score is 0.7895167448342078 ± 0.0036849010032786417
The score is 0.8135749478140604 ± 0.002472448816213207
The score is 0.8415242853945927 ± 0.006772678658768942
The score is 0.848690280968156 ± 0.003584970927016427
The score is 0.8634318983313601 ± 0.004819126026076143
The score is 0.8611794258210212 ± 0.005819082138923353
The score is 0.854115784392801 ± 0.006151889305972811
The score is 0.8372241534819539 ± 0.008324200933981259
The score is 0.8238127207387947 ± 0.00640636369219988
The score is 0.2882867131950897 ± 0.008163650477184288
The score is 0.6461141324713153 ± 0.14549737992994932
The score is 0.5339030156476585 ± 0.1906607424921216


**From the results, we can see that the 5th score is the best:\
i.e. *Learning rate = 0.1* / *Max leaf nodes = 10*\
This is the best scoring combo.**

Now use the test set to score the model using the best parameters that we
found using cross-validation. You will have to refit the model over the full
training set.

In [40]:
model.set_params(
    classifier__learning_rate=0.1, classifier__max_leaf_nodes=10
)
model.fit(data_train, target_train)
test_score = model.score(data_test, target_test)

print(f"The score is {test_score}")

The score is 0.8695807954138302
