# 📝 Exercise M3.01

The goal is to write an exhaustive search to find the best parameters
combination maximizing the model generalization performance.

Here we use a small subset of the Adult Census dataset to make the code faster
to execute. Once your code works on the small subset, try to change
`train_size` to a larger value (e.g. 0.8 for 80% instead of 20%).

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# prepare data, target
adult_census = pd.read_csv("../datasets/adult-census.csv")
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

# prepare train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    data, target, train_size=0.2, random_state=42
)

data.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

# define preprocessors
categorical_preprocessor = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1
)

preprocessor = ColumnTransformer(
    [
        ("cat_preprocessor", categorical_preprocessor, selector(dtype_include=object))   # handle categorical using ordinal
    ],
    remainder="passthrough"    # passthrough remaining data
)

# build model pipeline
model = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("classifier", HistGradientBoostingClassifier(random_state=42))
    ]
)

In [8]:
# get model parameter names
for param in model.get_params():
    if param.startswith("classifier"):
        print(param)


classifier
classifier__categorical_features
classifier__class_weight
classifier__early_stopping
classifier__interaction_cst
classifier__l2_regularization
classifier__learning_rate
classifier__loss
classifier__max_bins
classifier__max_depth
classifier__max_features
classifier__max_iter
classifier__max_leaf_nodes
classifier__min_samples_leaf
classifier__monotonic_cst
classifier__n_iter_no_change
classifier__random_state
classifier__scoring
classifier__tol
classifier__validation_fraction
classifier__verbose
classifier__warm_start


Use the previously defined model (called `model`) and using two nested `for`
loops, make a search of the best combinations of the `learning_rate` and
`max_leaf_nodes` parameters. In this regard, you need to train and test the
model by setting the parameters. The evaluation of the model should be
performed using `cross_val_score` on the training set. Use the following
parameters search:
- `learning_rate` for the values 0.01, 0.1, 1 and 10. This parameter controls
  the ability of a new tree to correct the error of the previous sequence of
  trees
- `max_leaf_nodes` for the values 3, 10, 30. This parameter controls the depth
  of each tree.

In [None]:
# Write your code here.
from sklearn.model_selection import cross_val_score

# define hyperparams to search through
learning_rate_params = [0.01, 0.1, 1, 10]
max_leaf_nodes_params = [3, 10, 30]

# define optimal hyperparams
best_learning_rate = None
best_max_leaf_nodes_params = None
best_score = 0

# loop through params
for learning_rate in learning_rate_params:
    for max_leaf_nodes in max_leaf_nodes_params:
        # setting params and obtaining cross_val_score
        model.set_params(classifier__learning_rate=learning_rate, classifier__max_leaf_nodes=max_leaf_nodes)
        # WARNING: cv on train dataset, separate from test to avoid seeing it
        score = cross_val_score(model, X_train, y_train, cv=5) 
        if score.mean() > best_score:
            best_score = score.mean()
            best_learning_rate = learning_rate
            best_max_leaf_nodes_params = max_leaf_nodes
        print(f"With learning_rate={learning_rate}, max_leaf_nodes={max_leaf_nodes}. The mean accuracy was {score.mean():.3f} ± {score.std():.3f}")

With learning_rate=0.01, max_leaf_nodes=3. The mean accuracy was 0.790 ± 0.004
With learning_rate=0.01, max_leaf_nodes=10. The mean accuracy was 0.814 ± 0.002
With learning_rate=0.01, max_leaf_nodes=30. The mean accuracy was 0.842 ± 0.007
With learning_rate=0.1, max_leaf_nodes=3. The mean accuracy was 0.849 ± 0.004
With learning_rate=0.1, max_leaf_nodes=10. The mean accuracy was 0.863 ± 0.005
With learning_rate=0.1, max_leaf_nodes=30. The mean accuracy was 0.861 ± 0.006
With learning_rate=1, max_leaf_nodes=3. The mean accuracy was 0.854 ± 0.006
With learning_rate=1, max_leaf_nodes=10. The mean accuracy was 0.839 ± 0.008
With learning_rate=1, max_leaf_nodes=30. The mean accuracy was 0.824 ± 0.007
With learning_rate=10, max_leaf_nodes=3. The mean accuracy was 0.288 ± 0.008
With learning_rate=10, max_leaf_nodes=10. The mean accuracy was 0.646 ± 0.145
With learning_rate=10, max_leaf_nodes=30. The mean accuracy was 0.534 ± 0.191


Now use the test set to score the model using the best parameters that we
found using cross-validation. You will have to refit the model over the full
training set.  

Best model params: learning_rate=0.1, max_leaf_nodes=10

In [17]:
# Write your code here.
model.set_params(classifier__learning_rate=best_learning_rate, classifier__max_leaf_nodes=best_max_leaf_nodes_params)
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.8695807954138302