## Hiperparameter optimization

After hipothesis testing, we will optimize the hyperparameters of:

1. Neural Network
2. XGBoost
3. Random Forest

We will use RandomizedSearch to determine some hyperparameters and Genetic Algorithm to determine the others.

In [5]:
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import metrics
from xgboost import XGBClassifier
from sklearn_genetic import GASearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn_genetic.space import Continuous, Categorical, Integer
from sklearn_genetic.plots import plot_fitness_evolution, plot_search_space

sns.set_style("darkgrid")
sns.set_palette("Set1")

### Data loading

In [6]:
filename = "../Models/ParametrizationData/parametrization.pkl"
X_train, X_test, y_train, y_test = pickle.load(open(filename, "rb"))

### RandomizedSearch

Here, we will make a first optimization of the hyperparameters of the models using RandomizedSearch. This will be a general approach and will help us to have a better idea of the hyperparameters that we should focus on later, when we use the Genetic Algorithm.

In [7]:
mlp_param_grid = {
    "hidden_layer_sizes": [(25,), (50,), (25, 25), (50, 50)],
    "activation": ["relu", "tanh"],
    "solver": ["adam", "sgd"],
    "alpha": [0.001, 0.01],
    "learning_rate": ["adaptive"],
    "max_iter": [500, 1000],
}

xgb_param_grid = {
    "n_estimators": [50, 100, 200],
    "learning_rate": [0.01, 0.1, 0.2],
    "max_depth": [3, 5, 7],
    "subsample": [0.7, 0.8, 0.9],
    "colsample_bytree": [0.7, 0.8, 0.9],
}

rf_param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [5, 10, 20],
    "min_samples_split": [3, 5, 7],
    "min_samples_leaf": [2],
    "max_features": ["auto", "sqrt", "log2"],
}

In [8]:
mlp = MLPClassifier(random_state=42)
mlp_grid_search = RandomizedSearchCV(
    mlp, mlp_param_grid, cv=3, scoring="accuracy", verbose=2, n_jobs=-1
)
mlp_grid_search.fit(X_train, y_train)
print("Best parameters for MLPClassifier:", mlp_grid_search.best_params_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best parameters for MLPClassifier: {'solver': 'adam', 'max_iter': 1000, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (50, 50), 'alpha': 0.001, 'activation': 'relu'}


In [9]:
xgb = XGBClassifier(random_state=42)
xgb_grid_search = RandomizedSearchCV(
    xgb, xgb_param_grid, cv=3, scoring="accuracy", verbose=2, n_jobs=-1
)
xgb_grid_search.fit(X_train, y_train)
print("Best parameters for XGBoostClassifier:", xgb_grid_search.best_params_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best parameters for XGBoostClassifier: {'subsample': 0.8, 'n_estimators': 100, 'max_depth': 7, 'learning_rate': 0.2, 'colsample_bytree': 0.7}


In [None]:
rf = RandomForestClassifier(random_state=42)
rf_grid_search = RandomizedSearchCV(
    rf, rf_param_grid, cv=3, scoring="accuracy", verbose=2, n_jobs=-1
)
rf_grid_search.fit(X_train, y_train)
print("Best parameters for RandomForest:", rf_grid_search.best_params_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


3 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\juanj\OneDrive - UPB\Estructurados\Final\.venv\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\juanj\OneDrive - UPB\Estructurados\Final\.venv\Lib\site-packages\sklearn\base.py", line 1466, in wrapper
    estimator._validate_params()
  File "c:\Users\juanj\OneDrive - UPB\Estructurados\Final\.venv\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\juanj\OneDrive - UPB\Estructurados\Final\.venv\Li

Best parameters for MLPClassifier: {'n_estimators': 50, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 20}


### Genetic Algorithm

Here we will make a more throrough optimization of the hyperparameters of the models using the Genetic Algorithm. This will be a more specific approach and will take into account the hyperparameters that we found to be more important in the RandomizedSearch.

In [12]:
# adam, adaptive, relu

mlp_param_grid = {
    "hidden_layer_sizes": [(25,), (50,), (25, 25), (50, 50)],
    "activation": ["relu", "tanh"],
    "solver": ["adam", "sgd"],
    "alpha": [0.001, 0.01],
    "learning_rate": ["adaptive"],
    "max_iter": [500, 1000],
}

xgb_param_grid = {
    "n_estimators": [50, 100, 200],
    "learning_rate": [0.01, 0.1, 0.2],
    "max_depth": [3, 5, 7],
    "subsample": [0.7, 0.8, 0.9],
    "colsample_bytree": [0.7, 0.8, 0.9],
}

# sqrt
rf_param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [5, 10, 20],
    "min_samples_split": [3, 5, 7],
    "min_samples_leaf": [2],
    "max_features": ["auto", "sqrt", "log2"],
}

### Hiperparametrization results

### Model Selection