<a href="https://colab.research.google.com/github/andres-merino/AprendizajeAutomaticoInicial-05-N0105/blob/main/2-Notebooks/22-Optimizacion-Hiperparametros.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<table style="border: none; border-collapse: collapse;">
    <tr>
        <td style="width: 20%; vertical-align: middle; padding-right: 10px;">
            <img src="https://i.imgur.com/nt7hloA.png" width="100">
        </td>
        <td style="width: 2px; text-align: center;">
            <font color="#0030A1" size="7">|</font><br>
            <font color="#0030A1" size="7">|</font>
        </td>
        <td>
            <p style="font-variant: small-caps;"><font color="#0030A1" size="5">
                <b>Escuela de Ciencias Físicas y Matemática</b>
            </font> </p>
            <p style="font-variant: small-caps;"><font color="#0030A1" size="4">
                Aprendizaje Automático Inicial &bull; Optimización de Hiperparámetros
            </font></p>
            <p style="font-style: oblique;"><font color="#0030A1" size="3">
                Andrés Merino &bull; 2024-02
            </font></p>
        </td>  
    </tr>
</table>

---
## <font color='264CC7'> Introducción </font>

Este notebook está diseñado como una guía introductoria para implementar optimización de hiperparámetros en modelos de aprendizaje automático.


Los paquetes necesarios son:

In [79]:
import pandas as pd  # Manejo de datos
import matplotlib.pyplot as plt  # Visualización

from sklearn.model_selection import train_test_split # División de datos
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report  # Métrica de evaluación

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold  # Búsqueda de hiperparámetros
from sklearn.ensemble import RandomForestClassifier


---
## <font color='264CC7'> Ejemplo práctico </font>


### <font color='264CC7'> Preprocesamiento de datos </font>

Primero leamos los datos y seleccionemos las columnas que utilizaremos:

In [80]:
# Leer los datos
data = pd.read_csv('https://raw.githubusercontent.com/andres-merino/AprendizajeAutomaticoInicial-05-N0105/refs/heads/main/2-Notebooks/datos/Pokemon.csv')
# Tomo las columnas de interés
numeric_cols = ['Attack', 'Defense', 'Speed', 'Sp. Atk', 'Sp. Def', 'HP']
class_col = ['Stage']
data = data[['Name', *numeric_cols, *class_col]]
# Muestro los primeros registros
display(data.head())

Unnamed: 0,Name,Attack,Defense,Speed,Sp. Atk,Sp. Def,HP,Stage
0,Bulbasaur,49,49,45,65,65,45,1
1,Ivysaur,62,63,60,80,80,60,2
2,Venusaur,82,83,80,100,100,80,3
3,Charmander,52,43,65,60,50,39,1
4,Charmeleon,64,58,80,80,65,58,2


Revisemos los datos:

In [81]:
data.describe()

Unnamed: 0,Attack,Defense,Speed,Sp. Atk,Sp. Def,HP,Stage
count,151.0,151.0,151.0,151.0,151.0,151.0,151.0
mean,72.549669,68.225166,68.933775,67.139073,66.019868,64.211921,1.582781
std,26.596162,26.916704,26.74688,28.534199,24.197926,28.590117,0.676832
min,5.0,5.0,15.0,15.0,20.0,10.0,1.0
25%,51.0,50.0,46.5,45.0,49.0,45.0,1.0
50%,70.0,65.0,70.0,65.0,65.0,60.0,1.0
75%,90.0,84.0,90.0,87.5,80.0,80.0,2.0
max,134.0,180.0,140.0,154.0,125.0,250.0,3.0


Dividimos los datos en los conjuntos de entrenamiento y prueba.

In [82]:
X = data[numeric_cols]

# Dividir los datos en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, data['Stage'], test_size=0.2, random_state=42, stratify=data['Stage'])

### <font color='264CC7'> Modelo </font>

Definimos el modelo:

In [83]:
# Crear y entrenar un arbol con ganancia de información
modelo_base = RandomForestClassifier(random_state=42)

# Parámetros del modelo
modelo_base.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

Definimos los parámetros y la grilla:

In [84]:
parametros = {'n_estimators': [10, 20], 
              'max_depth': [None, 3, 5]}	
k_fold = KFold(n_splits=5, shuffle=True, random_state=42)

modelo = GridSearchCV(modelo_base, parametros, cv=k_fold, scoring='accuracy')
modelo

Realizamos la búsqueda de hiperparámetros:

In [85]:
# Definir los k-folds
modelo.fit(X_train, y_train)

# Mostrar los resultados
pd.DataFrame(modelo.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.024022,0.008167,0.0,0.0,,10,"{'max_depth': None, 'n_estimators': 10}",0.708333,0.625,0.791667,0.791667,0.541667,0.691667,0.097183,3
1,0.046158,0.012347,0.007125,0.006512,,20,"{'max_depth': None, 'n_estimators': 20}",0.708333,0.666667,0.75,0.791667,0.541667,0.691667,0.085797,3
2,0.029427,0.0048,0.005188,0.004554,3.0,10,"{'max_depth': 3, 'n_estimators': 10}",0.75,0.75,0.708333,0.708333,0.458333,0.675,0.109924,5
3,0.033515,0.002899,0.006428,0.005891,3.0,20,"{'max_depth': 3, 'n_estimators': 20}",0.666667,0.75,0.75,0.791667,0.541667,0.7,0.088976,1
4,0.018828,0.00624,0.003153,0.006306,5.0,10,"{'max_depth': 5, 'n_estimators': 10}",0.708333,0.708333,0.75,0.708333,0.458333,0.666667,0.105409,6
5,0.047832,0.00839,0.001493,0.002985,5.0,20,"{'max_depth': 5, 'n_estimators': 20}",0.708333,0.666667,0.833333,0.75,0.541667,0.7,0.096465,1


Vemos los mejores hiperparámetros:

In [86]:
# Mejores parámetros
print("Mejores parámetros", modelo.best_params_)
print("Mejor score", modelo.best_score_)

Mejores parámetros {'max_depth': 3, 'n_estimators': 20}
Mejor score 0.7


<div style="background-color: #edf1f8; border-color: #264CC7; border-left: 5px solid #264CC7; padding: 0.5em;">
<strong>Ejercicio:</strong><br>
Prueba otros valores de <code>cv</code>, <code>scoring</code> y de los hiperparámetros para ver cómo cambian los resultados. 
</div>
</br>

Realizamos la predicción:

In [87]:
# Realizar predicciones y evaluar el modelo
y_pred = modelo.predict(X_test)

# Precisión del modelo con dos decimales
accuracy = round(accuracy_score(y_test, y_pred), 2)
print("Precisión del modelo:", accuracy)

# Matriz de confusión
cm = confusion_matrix(y_test, y_pred)
print("Matriz de confusión:")
print(cm)

# Reporte de clasificación
print("Reporte de clasificación:")
print(classification_report(y_test, y_pred))

Precisión del modelo: 0.74
Matriz de confusión:
[[12  4  0]
 [ 1 11  0]
 [ 0  3  0]]
Reporte de clasificación:
              precision    recall  f1-score   support

           1       0.92      0.75      0.83        16
           2       0.61      0.92      0.73        12
           3       0.00      0.00      0.00         3

    accuracy                           0.74        31
   macro avg       0.51      0.56      0.52        31
weighted avg       0.71      0.74      0.71        31



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Podemos hacer una búsqueda aleatoria de hiperparámetros:

In [88]:
parametros = {'n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
              'max_depth': [None, 3, 5, 10, 15, 20, 25, 30, 35, 40]}
k_fold = KFold(n_splits=5, shuffle=True, random_state=42)

modelo = RandomizedSearchCV(modelo_base, parametros, cv=k_fold, scoring='accuracy', n_iter=10, random_state=42)
modelo

Realizamos la búsqueda de hiperparámetros:

In [89]:
# Definir los k-folds
modelo.fit(X_train, y_train)

# Mostrar los resultados
pd.DataFrame(modelo.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.085521,0.016936,0.010231,0.008693,40,35.0,"{'n_estimators': 40, 'max_depth': 35}",0.708333,0.708333,0.833333,0.791667,0.541667,0.716667,0.1,4
1,0.087447,0.02253,0.009389,0.007666,40,20.0,"{'n_estimators': 40, 'max_depth': 20}",0.708333,0.708333,0.833333,0.791667,0.541667,0.716667,0.1,4
2,0.03157,0.001327,0.003103,0.006206,10,30.0,"{'n_estimators': 10, 'max_depth': 30}",0.708333,0.625,0.791667,0.791667,0.541667,0.691667,0.097183,6
3,0.149687,0.008168,0.013418,0.004451,60,15.0,"{'n_estimators': 60, 'max_depth': 15}",0.708333,0.708333,0.791667,0.791667,0.583333,0.716667,0.076376,2
4,0.14026,0.010174,0.006464,0.007926,50,15.0,"{'n_estimators': 50, 'max_depth': 15}",0.708333,0.708333,0.833333,0.791667,0.583333,0.725,0.085797,1
5,0.213006,0.027389,0.02049,0.006304,100,10.0,"{'n_estimators': 100, 'max_depth': 10}",0.708333,0.708333,0.791667,0.833333,0.541667,0.716667,0.1,2
6,0.071386,0.014973,0.007146,0.006974,30,5.0,"{'n_estimators': 30, 'max_depth': 5}",0.75,0.666667,0.791667,0.708333,0.541667,0.691667,0.085797,6
7,0.02207,0.007502,0.0124,0.006204,10,35.0,"{'n_estimators': 10, 'max_depth': 35}",0.708333,0.625,0.791667,0.791667,0.541667,0.691667,0.097183,6
8,0.028951,0.006795,0.003132,0.006264,10,3.0,"{'n_estimators': 10, 'max_depth': 3}",0.75,0.75,0.708333,0.708333,0.458333,0.675,0.109924,10
9,0.034257,0.005984,0.0,0.0,10,,"{'n_estimators': 10, 'max_depth': None}",0.708333,0.625,0.791667,0.791667,0.541667,0.691667,0.097183,6


Vemos los mejores hiperparámetros:

In [90]:
# Mejores parámetros
print("Mejores parámetros", modelo.best_params_)
print("Mejor score", modelo.best_score_)

Mejores parámetros {'n_estimators': 50, 'max_depth': 15}
Mejor score 0.725


Realizamos la predicción:

In [91]:
# Realizar predicciones y evaluar el modelo
y_pred = modelo.predict(X_test)

# Precisión del modelo con dos decimales
accuracy = round(accuracy_score(y_test, y_pred), 2)
print("Precisión del modelo:", accuracy)

# Matriz de confusión
cm = confusion_matrix(y_test, y_pred)
print("Matriz de confusión:")
print(cm)

# Reporte de clasificación
print("Reporte de clasificación:")
print(classification_report(y_test, y_pred))

Precisión del modelo: 0.81
Matriz de confusión:
[[12  3  1]
 [ 0 11  1]
 [ 0  1  2]]
Reporte de clasificación:
              precision    recall  f1-score   support

           1       1.00      0.75      0.86        16
           2       0.73      0.92      0.81        12
           3       0.50      0.67      0.57         3

    accuracy                           0.81        31
   macro avg       0.74      0.78      0.75        31
weighted avg       0.85      0.81      0.81        31

