<a href="https://colab.research.google.com/github/dajebbar/FreeCodeCamp-python-data-analysis/blob/main/Exercise_M3_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Exercise M3.02
---
The goal is to find the best set of hyperparameters which maximize the generalization performance on a training set.

In [1]:
import numpy as np
import pandas as pd

adult_census = pd.read_csv("./adult.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data, 
    target,
    test_size=.2,
    random_state=42)

Create your machine learning pipeline

You should:

- preprocess the categorical columns using a `OneHotEncoder` and use a `StandardScaler` to normalize the numerical data.
- use a `LogisticRegression` as a predictive model.  

Start by defining the columns and the preprocessing pipelines to be applied on each columns.

In [2]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_selector as selector

categorical_column = selector(dtype_include='object')(data)
numerical_column = selector(dtype_include='number')(data)
categorical_preprocessor = OneHotEncoder(handle_unknown='ignore')
numerical_preprocessor = StandardScaler()

Subsequently, create a ColumnTransformer to redirect the specific columns a preprocessing pipeline.

In [3]:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([
                                  ('standard-scaler', 
                                   numerical_preprocessor, 
                                   numerical_column),
                                  ('cat-prep', categorical_preprocessor, categorical_column)
], remainder='passthrough', sparse_threshold=0)

Finally, concatenate the preprocessing pipeline with a logistic regression.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

model = Pipeline([
                  ('preprocessor', preprocessor),
                  ('classifier', LogisticRegression()
                  )
])

Use a `RandomizedSearchCV` with `n_iter=20` to find the best set of hyperparameters by tuning the following parameters of the `model`:

- the parameter `C` of the `LogisticRegression` with values ranging from 0.001 to 10. You can use a log-uniform distribution (i.e. `scipy.stats.loguniform`);
- the parameter `with_mean` of the `StandardScaler` with possible values `True` or `False`;
- the parameter `with_std` of the `StandardScalerè  with possible values `True` or `False`.  

Once the computation has completed, print the best combination of parameters stored in the `best_params_` attribute.

In [5]:
for p in model.get_params():
  print(p)

memory
steps
verbose
preprocessor
classifier
preprocessor__n_jobs
preprocessor__remainder
preprocessor__sparse_threshold
preprocessor__transformer_weights
preprocessor__transformers
preprocessor__verbose
preprocessor__verbose_feature_names_out
preprocessor__standard-scaler
preprocessor__cat-prep
preprocessor__standard-scaler__copy
preprocessor__standard-scaler__with_mean
preprocessor__standard-scaler__with_std
preprocessor__cat-prep__categories
preprocessor__cat-prep__drop
preprocessor__cat-prep__dtype
preprocessor__cat-prep__handle_unknown
preprocessor__cat-prep__sparse
classifier__C
classifier__class_weight
classifier__dual
classifier__fit_intercept
classifier__intercept_scaling
classifier__l1_ratio
classifier__max_iter
classifier__multi_class
classifier__n_jobs
classifier__penalty
classifier__random_state
classifier__solver
classifier__tol
classifier__verbose
classifier__warm_start


In [6]:
from scipy.stats import loguniform
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    'classifier__C': loguniform(.001, 10),
    'preprocessor__standard-scaler__with_mean': [True, False],
    'preprocessor__standard-scaler__with_std': [True, False]
}

model_random_search = RandomizedSearchCV(model,
                                         param_distributions=param_distributions,
                                         n_iter=20,
                                         n_jobs=-1,
                                         cv=10,
                                         verbose=1)

model_random_search.fit(X_train, y_train)

Fitting 10 folds for each of 20 candidates, totalling 200 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


RandomizedSearchCV(cv=10,
                   estimator=Pipeline(steps=[('preprocessor',
                                              ColumnTransformer(remainder='passthrough',
                                                                sparse_threshold=0,
                                                                transformers=[('standard-scaler',
                                                                               StandardScaler(),
                                                                               ['age',
                                                                                'fnlwgt',
                                                                                'capital-gain',
                                                                                'capital-loss',
                                                                                'hours-per-week']),
                                                                           

In [7]:
accuracy = model_random_search.score(X_test, y_test)
print(f"The test accuracy score of the best model is "
      f"{accuracy:.2f}")

The test accuracy score of the best model is 0.86


In [8]:
from pprint import pprint

print("The best parameters are:")
pprint(model_random_search.best_params_)

The best parameters are:
{'classifier__C': 0.24851639207507104,
 'preprocessor__standard-scaler__with_mean': False,
 'preprocessor__standard-scaler__with_std': True}


In [9]:
from sklearn.model_selection import cross_validate, KFold


cv = KFold(n_splits=10, shuffle=True, random_state=42)

cv_res = cross_validate(model_random_search,
                            X_train,
                            y_train,
                            cv=cv,
                            n_jobs=-1,
                            return_estimator=True)

In [10]:
scores = cv_res["test_score"]
print(f"Accuracy score by cross-validation combined with hyperparameters "
      f"search:\n{scores.mean():.3f} +/- {scores.std():.3f}")

Accuracy score by cross-validation combined with hyperparameters search:
0.851 +/- 0.007


In [11]:
for fold_idx, estimator in enumerate(cv_res["estimator"]):
    print(f"Best parameter found on fold #{fold_idx + 1}")
    print(f"{estimator.best_params_}")

Best parameter found on fold #1
{'classifier__C': 0.14773993784221434, 'preprocessor__standard-scaler__with_mean': True, 'preprocessor__standard-scaler__with_std': True}
Best parameter found on fold #2
{'classifier__C': 9.652058909869588, 'preprocessor__standard-scaler__with_mean': True, 'preprocessor__standard-scaler__with_std': True}
Best parameter found on fold #3
{'classifier__C': 9.956447678929921, 'preprocessor__standard-scaler__with_mean': True, 'preprocessor__standard-scaler__with_std': True}
Best parameter found on fold #4
{'classifier__C': 0.1177306598697333, 'preprocessor__standard-scaler__with_mean': True, 'preprocessor__standard-scaler__with_std': True}
Best parameter found on fold #5
{'classifier__C': 3.1947560241376323, 'preprocessor__standard-scaler__with_mean': True, 'preprocessor__standard-scaler__with_std': True}
Best parameter found on fold #6
{'classifier__C': 0.11164462798002081, 'preprocessor__standard-scaler__with_mean': True, 'preprocessor__standard-scaler__wit

In [12]:
cv_res = pd.DataFrame(cv_res)

In [13]:
cv_res.to_csv('cv_res.csv')

In [14]:
cv_res = pd.read_csv('cv_res.csv')
cv_res.drop('Unnamed: 0', axis=1, inplace=True)
cv_res.head()

Unnamed: 0,fit_time,score_time,estimator,test_score
0,296.142516,0.036671,"RandomizedSearchCV(cv=10,\n ...",0.846981
1,316.958745,0.051523,"RandomizedSearchCV(cv=10,\n ...",0.847492
2,261.556571,0.034958,"RandomizedSearchCV(cv=10,\n ...",0.839048
3,297.887637,0.034838,"RandomizedSearchCV(cv=10,\n ...",0.848989
4,322.163099,0.048613,"RandomizedSearchCV(cv=10,\n ...",0.85462


In [17]:
cv_results = pd.DataFrame(model_random_search.cv_results_)

cv_results.to_csv('cv_results.csv')
cv_results = pd.read_csv('cv_results.csv')
cv_results.head()

Unnamed: 0.1,Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__C,param_preprocessor__standard-scaler__with_mean,param_preprocessor__standard-scaler__with_std,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0,2.737492,0.800403,0.085691,0.014671,9.832512,True,False,"{'classifier__C': 9.832511885948751, 'preproce...",0.780962,0.808086,0.782497,0.783466,0.801638,0.785513,0.775531,0.770668,0.77937,0.77937,0.78471,0.010921,18
1,1,2.967874,0.91257,0.080524,0.037659,0.035763,True,False,"{'classifier__C': 0.03576254771847586, 'prepro...",0.805015,0.817042,0.795036,0.785001,0.773483,0.785513,0.775531,0.771692,0.804454,0.77937,0.789214,0.014698,16
2,2,1.972833,0.10513,0.035776,0.003354,0.116381,False,True,"{'classifier__C': 0.1163811664663427, 'preproc...",0.856704,0.850051,0.846213,0.85334,0.858715,0.852316,0.848989,0.846429,0.851037,0.851293,0.851509,0.003815,4
3,3,1.276877,0.220017,0.035645,0.002404,0.00159,False,False,"{'classifier__C': 0.0015901326740022466, 'prep...",0.798874,0.807318,0.794524,0.798567,0.794216,0.799079,0.793192,0.793448,0.795239,0.793704,0.796816,0.00414,11
4,4,1.314041,0.228484,0.036354,0.004174,1.581154,False,False,"{'classifier__C': 1.5811541454722464, 'preproc...",0.798874,0.806295,0.794524,0.798567,0.794216,0.799079,0.791144,0.793448,0.795239,0.793704,0.796509,0.004111,12


In [18]:
column_name_mapping = {
    "param_classifier__C": "C",
    "param_preprocessor__standard-scaler__with_mean": "centering",
    "param_preprocessor__standard-scaler__with_std": "scaling",
    "mean_test_score": "mean test accuracy",
}

cv_results = cv_results.rename(columns=column_name_mapping)
cv_results = cv_results[column_name_mapping.values()].sort_values(
    "mean test accuracy", ascending=False)

In [19]:
column_scaler = ["centering", "scaling"]
cv_results[column_scaler] = cv_results[column_scaler].astype(np.int64)
cv_results['log C'] = np.log10(cv_results['C'])

In [20]:
column_scaler = ["centering", "scaling"]
cv_results[column_scaler] = cv_results[column_scaler].astype(np.int64)
cv_results['log C'] = np.log10(cv_results['C'])

In [21]:
import plotly.express as px

fig = px.parallel_coordinates(
    cv_results,
    color="mean test accuracy",
    dimensions=["log C", "centering", "scaling", "mean test accuracy"],
    color_continuous_scale=px.colors.diverging.Tealrose,
)
fig.show()

We recall that it is possible to select a range of results by clicking and holding on any axis of the parallel coordinate plot. You can then slide (move) the range selection and cross two selections to see the intersections.

Selecting the best performing models (i.e. above an accuracy of ~0.845), we observe the following pattern:

scaling the data is important. All the best performing models are scaling the data;

centering the data does not have a strong impact. Both approaches, centering and not centering, can lead to good models;

using some regularization is fine but using too much is a problem. Recall that a smaller value of C means a stronger regularization. In particular no pipeline with C lower than 0.001 can be found among the best models.