<a href="https://colab.research.google.com/github/dajebbar/FreeCodeCamp-python-data-analysis/blob/main/hyperparams.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Set and get hyperparameters in scikit-learn
---

This notebook shows how one can get and set the value of a hyperparameter in a scikit-learn estimator. We recall that hyperparameters refer to the parameter that will control the learning process.

They should not be confused with the fitted parameters, resulting from the training. These fitted parameters are recognizable in scikit-learn because they are spelled with a final underscore `_`, for instance model`.coef_`.

We will start by loading the adult census dataset and only use the numerical features.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

In [None]:
adult_census = pd.read_csv("./adult.csv")

In [None]:
from sklearn.compose import make_column_selector as selector 

num_features = selector(dtype_include='number')(adult_census.drop('education-num', axis=1))
data = adult_census[num_features]
target = adult_census['class']

data.head()

Unnamed: 0,age,fnlwgt,capital-gain,capital-loss,hours-per-week
0,25,226802,0,0,40
1,38,89814,0,0,50
2,28,336951,0,0,40
3,44,160323,7688,0,40
4,18,103497,0,0,30


In [None]:
target.head()

0     <=50K
1     <=50K
2      >50K
3      >50K
4     <=50K
Name: class, dtype: object

Let's create a simple predictive model made of a scaler followed by a logistic regression classifier.

As mentioned in previous notebooks, many models, including linear ones, work better if all features have a similar scaling. For this purpose, we use a `StandardScaler`, which transforms the data by rescaling features.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

model = Pipeline([
                  ('preprocessor', StandardScaler()),
                  ('classifier', LogisticRegression())
])

We can evaluate the generalization performance of the model via `cross-validation`.

In [None]:
from sklearn.model_selection import cross_validate, KFold

cv = KFold(n_splits=10, shuffle=True, random_state=42)

results_cv = cross_validate(
    model,
    data,
    target,
    cv=cv,
    n_jobs=2
)

results_cv = pd.DataFrame(results_cv)
results_cv.head()

Unnamed: 0,fit_time,score_time,test_score
0,0.250015,0.019003,0.803889
1,0.249899,0.021099,0.81085
2,0.24945,0.019039,0.807944
3,0.258686,0.019263,0.793612
4,0.248251,0.017927,0.800369


In [None]:
scores = results_cv.test_score
print(f"Accuracy score via cross-validation:\n"
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

Accuracy score via cross-validation:
0.800 +/- 0.008


We created a model with the default `C` value that is equal to 1. If we wanted to use a different `C` parameter we could have done so when we created the `LogisticRegression` object with something like `LogisticRegression(C=1e-3)`.

We can also change the parameter of a model after it has been created with the `set_params` method, which is available for all scikit-learn estimators. For example, we can set `C=1e-3`, fit and evaluate the model:

In [None]:
model.set_params(classifier__C=1.e-3)
results_cv = cross_validate(
    model,
    data,
    target,
    cv=cv,
    n_jobs=2
)

results_cv = pd.DataFrame(results_cv)

print(f"Accuracy score via cross-validation:\n"
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

Accuracy score via cross-validation:
0.800 +/- 0.008


When the model of interest is a `Pipeline`, the parameter names are of the form `<model_name>__<parameter_name> `(note the double underscore in the middle). In our case, classifier comes from the Pipeline definition and `C` is the parameter name of `LogisticRegression`.

In general, you can use the `get_params` method on scikit-learn models to list all the parameters with their values. For example, if you want to get all the parameter names, you can use:



In [None]:
for param in model.get_params():
  print(param)

memory
steps
verbose
preprocessor
classifier
preprocessor__copy
preprocessor__with_mean
preprocessor__with_std
classifier__C
classifier__class_weight
classifier__dual
classifier__fit_intercept
classifier__intercept_scaling
classifier__l1_ratio
classifier__max_iter
classifier__multi_class
classifier__n_jobs
classifier__penalty
classifier__random_state
classifier__solver
classifier__tol
classifier__verbose
classifier__warm_start


`.get_params()` returns a `dict` whose keys are the parameter names and whose values are the parameter values. If you want to get the value of a single parameter, for example `classifier__C`, you can use:

In [None]:
model.get_params()['classifier__C']

0.001

We can systematically vary the value of C to see if there is an optimal value.

In [None]:
for C in [1e-3, 1e-2, 1e-1, 1, 5, 10, 100]:
    model.set_params(classifier__C=C)
    cv_results = cross_validate(
    model,
    data,
    target,
    cv=cv,
    n_jobs=2
)
    scores = cv_results["test_score"]
    print(f"Accuracy score via cross-validation with C={C}:\n"
          f"{scores.mean():.3f} +/- {scores.std():.3f}")

Accuracy score via cross-validation with C=0.001:
0.788 +/- 0.007
Accuracy score via cross-validation with C=0.01:
0.800 +/- 0.007
Accuracy score via cross-validation with C=0.1:
0.800 +/- 0.007
Accuracy score via cross-validation with C=1:
0.800 +/- 0.007
Accuracy score via cross-validation with C=5:
0.800 +/- 0.007
Accuracy score via cross-validation with C=10:
0.800 +/- 0.007
Accuracy score via cross-validation with C=100:
0.800 +/- 0.007
