<a href="https://colab.research.google.com/github/Viny2030/sklearn/blob/main/parameter_tuning_manual.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set and get hyperparameters in scikit-learn

Recall that hyperparameters refer to the parameters that control the learning
process of a predictive model and are specific for each family of models. In
addition, the optimal set of hyperparameters is specific to each dataset and
thus they always need to be optimized.

This notebook shows how one can get and set the value of a hyperparameter in a
scikit-learn estimator.

They should not be confused with the fitted parameters, resulting from the
training. These fitted parameters are recognizable in scikit-learn because
they are spelled with a final underscore `_`, for instance `model.coef_`.

We start by loading the adult census dataset and only use the numerical
features.

Establecer y obtener hiperparámetros en scikit-learn
Recuerde que los hiperparámetros se refieren a los parámetros que controlan el proceso de aprendizaje de un modelo predictivo y son específicos para cada familia de modelos. Además, el conjunto óptimo de hiperparámetros es específico para cada conjunto de datos y, por lo tanto, siempre deben optimizarse.

Este cuaderno muestra cómo se puede obtener y establecer el valor de un hiperparámetro en un estimador de scikit-learn.

No deben confundirse con los parámetros ajustados, resultantes del entrenamiento. Estos parámetros ajustados son reconocibles en scikit-learn porque se escriben con un guión bajo final _, por ejemplo, model.coef_.

Comenzamos cargando el conjunto de datos del censo de adultos y solo usamos las características numéricas.


In [1]:
import pandas as pd

adult_census = pd.read_csv("https://raw.githubusercontent.com/Viny2030/datasets/refs/heads/main/adult_census.csv")

target_name = "class"
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]

target = adult_census[target_name]
data = adult_census[numerical_columns]

Our data is only numerical.

In [2]:
data.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50
2,28,0,0,40
3,44,7688,0,40
4,18,0,0,30


Let's create a simple predictive model made of a scaler followed by a logistic
regression classifier.

As mentioned in previous notebooks, many models, including linear ones, work
better if all features have a similar scaling. For this purpose, we use a
`StandardScaler`, which transforms the data by rescaling features.

Creemos un modelo predictivo simple compuesto por un escalador seguido de un clasificador de regresión logística.

Como se mencionó en cuadernos anteriores, muchos modelos, incluidos los lineales, funcionan mejor si todas las características tienen un escalado similar. Para este propósito, utilizamos un StandardScaler, que transforma los datos reescalando las características.

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

model = Pipeline(
    steps=[
        ("preprocessor", StandardScaler()),
        ("classifier", LogisticRegression()),
    ]
)

In [29]:
from sklearn.linear_model import LogisticRegression

model1 = Pipeline(
    steps=[
        ("preprocessor", StandardScaler()),
        ("classifier", LogisticRegression()),
    ]
)


In [30]:
model1.get_params()

{'memory': None,
 'steps': [('preprocessor', StandardScaler()),
  ('classifier', LogisticRegression())],
 'verbose': False,
 'preprocessor': StandardScaler(),
 'classifier': LogisticRegression(),
 'preprocessor__copy': True,
 'preprocessor__with_mean': True,
 'preprocessor__with_std': True,
 'classifier__C': 1.0,
 'classifier__class_weight': None,
 'classifier__dual': False,
 'classifier__fit_intercept': True,
 'classifier__intercept_scaling': 1,
 'classifier__l1_ratio': None,
 'classifier__max_iter': 100,
 'classifier__multi_class': 'deprecated',
 'classifier__n_jobs': None,
 'classifier__penalty': 'l2',
 'classifier__random_state': None,
 'classifier__solver': 'lbfgs',
 'classifier__tol': 0.0001,
 'classifier__verbose': 0,
 'classifier__warm_start': False}

We can evaluate the generalization performance of the model via
cross-validation.

Podemos evaluar el rendimiento de generalización del modelo mediante validación cruzada.

In [32]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]
print(
    "Accuracy score via cross-validation:\n"
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)

Accuracy score via cross-validation:
0.800 ± 0.003


In [33]:
cv_results

{'fit_time': array([0.10442591, 0.13993931, 0.13754535, 0.13829398, 0.13670921]),
 'score_time': array([0.04064608, 0.03798366, 0.03712845, 0.03743505, 0.03733563]),
 'test_score': array([0.79557785, 0.80049135, 0.79965192, 0.79873055, 0.80446355])}

In [36]:
from sklearn.model_selection import cross_validate

cv_results1 = cross_validate(model1, data, target)
scores1 = cv_results1["test_score"]
print(
    "Accuracy score via cross-validation:\n"
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)

Accuracy score via cross-validation:
0.800 ± 0.003


In [37]:
cv_results1

{'fit_time': array([0.09746623, 0.13817811, 0.13940883, 0.13799977, 0.13284373]),
 'score_time': array([0.03794789, 0.03754663, 0.03743029, 0.03700733, 0.03672624]),
 'test_score': array([0.79557785, 0.80049135, 0.79965192, 0.79873055, 0.80456593])}

In [38]:
scores1

array([0.79557785, 0.80049135, 0.79965192, 0.79873055, 0.80456593])

We created a model with the default `C` value that is equal to 1. If we wanted
to use a different `C` hyperparameter we could have done so when we created the
`LogisticRegression` object with something like `LogisticRegression(C=1e-3)`.

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">For more information on the model hyperparameter <tt class="docutils literal">C</tt>, refer to the
<a class="reference external" href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">documentation</a>.
Be aware that we will focus on linear models in an upcoming module.</p>
</div>

We can also change the hyperparameter of a model after it has been created
with the `set_params` method, which is available for all scikit-learn
estimators. For example, we can set `C=1e-3`, fit and evaluate the model:

Creamos un modelo con el valor C predeterminado que es igual a 1. Si quisiéramos usar un hiperparámetro C diferente, podríamos haberlo hecho cuando creamos el objeto LogisticRegression con algo como LogisticRegression(C=1e-3).

Nota

Para obtener más información sobre el hiperparámetro C del modelo, consulte la documentación. Tenga en cuenta que nos centraremos en los modelos lineales en un próximo módulo.

También podemos cambiar el hiperparámetro de un modelo después de que se haya creado con el método set_params, que está disponible para todos los estimadores de scikit-learn. Por ejemplo, podemos establecer C=1e-3, ajustar y evaluar el modelo:


In [9]:
model.set_params(classifier__C=1e-3)
cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]
print(
    "Accuracy score via cross-validation:\n"
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)

Accuracy score via cross-validation:
0.787 ± 0.002


When the model of interest is a `Pipeline`, the hyperparameter names are of
the form `<model_name>__<hyperparameter_name>` (note the double underscore in
the middle). In our case, `classifier` comes from the `Pipeline` definition
and `C` is the hyperparameter name of `LogisticRegression`.

In general, you can use the `get_params` method on scikit-learn models to list
all the hyperparameters with their values. For example, if you want to get all
the hyperparameter names, you can use:

Cuando el modelo de interés es un Pipeline, los nombres de los hiperparámetros tienen el formato <model_name>__<hyperparameter_name> (observe el doble guión bajo en el medio). En nuestro caso, el clasificador proviene de la definición de Pipeline y C es el nombre del hiperparámetro de LogisticRegression.

En general, puede utilizar el método get_params en los modelos de scikit-learn para enumerar todos los hiperparámetros con sus valores. Por ejemplo, si desea obtener todos los nombres de los hiperparámetros, puede utilizar:

In [10]:
for parameter in model.get_params():
    print(parameter)

memory
steps
verbose
preprocessor
classifier
preprocessor__copy
preprocessor__with_mean
preprocessor__with_std
classifier__C
classifier__class_weight
classifier__dual
classifier__fit_intercept
classifier__intercept_scaling
classifier__l1_ratio
classifier__max_iter
classifier__multi_class
classifier__n_jobs
classifier__penalty
classifier__random_state
classifier__solver
classifier__tol
classifier__verbose
classifier__warm_start


In [12]:
# Assuming 'model' is your trained model object
preprocessor = model.named_steps['preprocessor'] # Assuming 'preprocessor' is the name of the preprocessing step in your pipeline
print(preprocessor)

StandardScaler()


In [21]:

params = model.get_params()
# Assuming your classifier is an ElasticNet or LogisticRegression, try accessing the l1_ratio directly:
if 'classifier__l1_ratio' in params:
    parameter__C = params['classifier__l1_ratio']
    print(parameter__C)
else:
    print("Classifier does not have an l1_ratio parameter.")

None


`.get_params()` returns a `dict` whose keys are the hyperparameter names and
whose values are the hyperparameter values. If you want to get the value of a
single hyperparameter, for example `classifier__C`, you can use:

.get_params() devuelve un diccionario cuyas claves son los nombres de los hiperparámetros y cuyos valores son los valores de los hiperparámetros. Si desea obtener el valor de un solo hiperparámetro, por ejemplo, classifier__C, puede utilizar:

In [13]:
model.get_params()["classifier__C"]

0.001

In [39]:
model1.get_params()["classifier__C"]

1.0

We can systematically vary the value of C to see if there is an optimal
value.

In [22]:
for C in [1e-3, 1e-2, 1e-1, 1, 10]:
    model.set_params(classifier__C=C)
    cv_results = cross_validate(model, data, target)
    scores = cv_results["test_score"]
    print(
        f"Accuracy score via cross-validation with C={C}:\n"
        f"{scores.mean():.3f} ± {scores.std():.3f}"
    )

Accuracy score via cross-validation with C=0.001:
0.787 ± 0.002
Accuracy score via cross-validation with C=0.01:
0.799 ± 0.003
Accuracy score via cross-validation with C=0.1:
0.800 ± 0.003
Accuracy score via cross-validation with C=1:
0.800 ± 0.003
Accuracy score via cross-validation with C=10:
0.800 ± 0.003


In [41]:
for C in [1e-3, 1e-2, 1e-1, 1, 10]:
    model1.set_params(classifier__C=C)
    cv_results1 = cross_validate(model1, data, target)
    scores1 = cv_results1["test_score"]
    print(
        f"Accuracy score via cross-validation with C={C}:\n"
        f"{scores1.mean():.3f} ± {scores.std():.3f}"
    )

Accuracy score via cross-validation with C=0.001:
0.787 ± 0.003
Accuracy score via cross-validation with C=0.01:
0.799 ± 0.003
Accuracy score via cross-validation with C=0.1:
0.800 ± 0.003
Accuracy score via cross-validation with C=1:
0.800 ± 0.003
Accuracy score via cross-validation with C=10:
0.800 ± 0.003


We can see that as long as C is high enough, the model seems to perform well.

What we did here is very manual: it involves scanning the values for C and
picking the best one manually. In the next lesson, we will see how to do this
automatically.

<div class="admonition warning alert alert-danger">
<p class="first admonition-title" style="font-weight: bold;">Warning</p>
<p class="last">When we evaluate a family of models on test data and pick the best performer,
we can not trust the corresponding prediction accuracy, and we need to apply
the selected model to new data. Indeed, the test data has been used to select
the model, and it is thus no longer independent from this model.</p>
</div>

Podemos ver que, siempre que C sea lo suficientemente alto, el modelo parece funcionar bien.

Lo que hicimos aquí es muy manual: implica escanear los valores de C y elegir el mejor manualmente. En la próxima lección, veremos cómo hacer esto automáticamente.

# Advertencia

Cuando evaluamos una familia de modelos en datos de prueba y elegimos el que tiene mejor desempeño, no podemos confiar en la precisión de la predicción correspondiente y necesitamos aplicar el modelo seleccionado a nuevos datos. De hecho, los datos de prueba se han utilizado para seleccionar el modelo y, por lo tanto, ya no es independiente de este modelo.

In this notebook we have seen:

- how to use `get_params` and `set_params` to get the hyperparameters of a model
  and set them.

En este cuaderno hemos visto:

cómo utilizar get_params y set_params para obtener los hiperparámetros de un modelo y configurarlos.