# 📝 Exercise M3.02

The goal is to find the best set of hyperparameters which maximize the
generalization performance on a training set.

Here again with limit the size of the training set to make computation
run faster. Feel free to increase the `train_size` value if your computer
is powerful enough.

In [1]:

import numpy as np
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42)

In this exercise, we will progressively define the classification pipeline
and later tune its hyperparameters.

Our pipeline should:
* preprocess the categorical columns using a `OneHotEncoder` and use a
  `StandardScaler` to normalize the numerical data.
* use a `LogisticRegression` as a predictive model.

Start by defining the columns and the preprocessing pipelines to be applied
on each group of columns.

In [2]:
from sklearn.compose import make_column_selector as selector

# Write your code here.
num_cols = selector(dtype_include='number')(data)
cat_cols = selector(dtype_include='object')(data)
print(num_cols)
print(cat_cols)

['age', 'capital-gain', 'capital-loss', 'hours-per-week']
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']


In [3]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

# Write your code here.
num_transformer = StandardScaler()
cat_transformer = OneHotEncoder(handle_unknown='ignore')

Subsequently, create a `ColumnTransformer` to redirect the specific columns
a preprocessing pipeline.

In [4]:
from sklearn.compose import ColumnTransformer

# Write your code here.
processor = ColumnTransformer([
    ('scl', num_transformer, num_cols),
    ('one-hot', cat_transformer, cat_cols),  
])

Assemble the final pipeline by combining the above preprocessor
with a logistic regression classifier. Force the maximum number of
iterations to `10_000` to ensure that the model will converge.

In [5]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

# Write your code here.
model = make_pipeline(processor, LogisticRegression(max_iter=10_000))

Use `RandomizedSearchCV` with `n_iter=20` to find the best set of
hyperparameters by tuning the following parameters of the `model`:

- the parameter `C` of the `LogisticRegression` with values ranging from
  0.001 to 10. You can use a log-uniform distribution
  (i.e. `scipy.stats.loguniform`);
- the parameter `with_mean` of the `StandardScaler` with possible values
  `True` or `False`;
- the parameter `with_std` of the `StandardScaler` with possible values
  `True` or `False`.

Once the computation has completed, print the best combination of parameters
stored in the `best_params_` attribute.

In [6]:
for p in model.get_params():
    print(p)

memory
steps
verbose
columntransformer
logisticregression
columntransformer__n_jobs
columntransformer__remainder
columntransformer__sparse_threshold
columntransformer__transformer_weights
columntransformer__transformers
columntransformer__verbose
columntransformer__verbose_feature_names_out
columntransformer__scl
columntransformer__one-hot
columntransformer__scl__copy
columntransformer__scl__with_mean
columntransformer__scl__with_std
columntransformer__one-hot__categories
columntransformer__one-hot__drop
columntransformer__one-hot__dtype
columntransformer__one-hot__handle_unknown
columntransformer__one-hot__sparse
logisticregression__C
logisticregression__class_weight
logisticregression__dual
logisticregression__fit_intercept
logisticregression__intercept_scaling
logisticregression__l1_ratio
logisticregression__max_iter
logisticregression__multi_class
logisticregression__n_jobs
logisticregression__penalty
logisticregression__random_state
logisticregression__solver
logisticregression__t

In [9]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

# Write your code here.
param_distributions = {
    'logisticregression__C': loguniform(.001, 10),
    'columntransformer__scl__with_mean': [True, False],
    'columntransformer__scl__with_std': [True, False], 
}

search = RandomizedSearchCV(model, 
                            param_distributions=param_distributions,
                            n_iter=20,
                            n_jobs=-1,
                            random_state=42,
                           )

In [12]:
search.fit(data_train, target_train)
accuracy = search.score(data_test, target_test)

print(f"The test accuracy score of the best model is "
      f"{accuracy:.2f}")

The test accuracy score of the best model is 0.85


In [13]:
from pprint import pprint

print("The best parameters are:")
pprint(search.best_params_)

The best parameters are:
{'columntransformer__scl__with_mean': True,
 'columntransformer__scl__with_std': False,
 'logisticregression__C': 6.3512210106406926}


In [None]:
from sklearn.model_selection import cross_validate
bm = cross_validate(search, data, target, cv=2)

In [18]:
acc = bm['test_score'].mean().round(3)

In [19]:
acc

0.851