# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
# Primero se hace el train test y luego normalizar

In [4]:
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [5]:
# Primero separamos la variable objetivo
X = spaceship.drop(columns=['Transported'])
y = spaceship['Transported']

# Identificar las columnas numéricas y categóricas de X (las características)
numerical_columns = X.select_dtypes(include=['float64', 'int64']).columns
categorical_columns = X.select_dtypes(include=['object']).columns

# Rellenar valores nulos en columnas numéricas con la media
for col in numerical_columns:
    X[col] = X[col].fillna(X[col].mean())

# Rellenar valores nulos en columnas categóricas con la moda
for col in categorical_columns:
    X[col] = X[col].fillna(X[col].mode()[0])

  X[col] = X[col].fillna(X[col].mode()[0])


In [6]:
# Codificar las columnas categóricas
label_encoder = LabelEncoder()
for col in categorical_columns:
    if X[col].dtype == 'object':  # Solo codificamos las columnas con valores categóricos
        X[col] = label_encoder.fit_transform(X[col])

# Escalar las columnas numéricas
scaler = StandardScaler()
X[numerical_columns] = scaler.fit_transform(X[numerical_columns])

In [7]:
# Dividir los datos en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
# Crear un objeto de StandardScaler
scaler = StandardScaler()

# Aplicar el escalado a las columnas numéricas (esto ya lo habíamos hecho antes, pero ahora lo actualizamos)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Mostrar las primeras filas de X_train escalado
print("Primeras filas de X_train escalado:\n", X_train_scaled[:5])

Primeras filas de X_train escalado:
 [[-0.80517603 -0.81343264 -0.72595401  1.45963893  0.62183455 -0.05794566
  -0.15148573 -0.333264   -0.25773351 -0.28924819  0.30744539 -0.26274904
   0.37592734]
 [-0.70270416 -0.81343264 -0.72595401  0.35985883  0.62183455 -0.82767203
  -0.15148573 -0.333264    0.4736217  -0.23814341 -0.28505687 -0.26274904
  -0.85480844]
 [ 1.58409965  0.44695296  1.37749773 -1.14709938 -1.82815834 -0.05794566
  -0.15148573 -0.333264   -0.29301819 -0.28924819 -0.28505687 -0.26274904
  -1.49351925]
 [ 1.53406455  1.70733855 -0.72595401  0.0509823   0.62183455 -0.61774665
  -0.15148573  0.00273796 -0.29173511  0.18718024  0.59647088 -0.26274904
   1.50443718]
 [-1.5388906   0.44695296  1.37749773 -1.25338485 -1.82815834  0.50185534
  -0.15148573 -0.333264   -0.29301819 -0.28924819 -0.28505687 -0.26274904
  -1.65048032]]


In [9]:
# Primero, asegúrate de que el modelo de Random Forest esté entrenado 

# Crear el modelo de Random Forest
rf_model = RandomForestClassifier(random_state=42)

# Entrenar el modelo con el conjunto de entrenamiento
rf_model.fit(X_train_scaled, y_train)

# Obtener la importancia de las características
importances = rf_model.feature_importances_

# Crear un DataFrame con las importancias
importances_df = pd.DataFrame({'Feature': X.columns, 'Importance': importances})

# Ordenar las características según su importancia
importances_df = importances_df.sort_values(by='Importance', ascending=False)

# Mostrar las características más importantes
print(importances_df.head())

        Feature  Importance
3         Cabin    0.134712
10          Spa    0.110485
0   PassengerId    0.102269
12         Name    0.098754
7   RoomService    0.096778


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [10]:
# Vamos a aplicar RandomizedSearchCV 

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Definir los parámetros para la búsqueda aleatoria
param_dist = {
    'n_estimators': [100],  # Número de árboles
    'max_depth': [10, 20],  # Profundidad máxima
    'min_samples_split': [2, 5],  # Mínimo número de muestras para dividir un nodo
    'min_samples_leaf': [1, 2],  # Mínimo número de muestras por hoja
    'max_features': ['sqrt'],  # Número de características a considerar para dividir
}

In [11]:

# Crear el clasificador de Random Forest
rf_model = RandomForestClassifier(random_state=42)

# Crear el objeto RandomizedSearchCV (usaremos 10 combinaciones aleatorias)
random_search = RandomizedSearchCV(estimator=rf_model, param_distributions=param_dist, 
                                   n_iter=10, cv=3, n_jobs=-1, verbose=2, scoring='accuracy', random_state=42)

# Ajustar RandomizedSearchCV a los datos de entrenamiento
random_search.fit(X_train_scaled, y_train)

# Mostrar los mejores parámetros encontrados y el mejor rendimiento de validación
print("Mejores parámetros encontrados: ", random_search.best_params_)
print("Mejor rendimiento de validación: ", random_search.best_score_)

Fitting 3 folds for each of 8 candidates, totalling 24 fits




[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   1.9s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.0s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   2.0s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.1s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   2.2s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   2.1s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   2.2s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   2.1s
[CV] END max_depth=10, max_features=sqrt

- Evaluate your model

In [12]:

# Obtener el mejor modelo encontrado por RandomizedSearchCV
best_rf_model = random_search.best_estimator_

# Hacer predicciones con el conjunto de prueba
y_pred_best = best_rf_model.predict(X_test_scaled)

# Evaluar el modelo con el conjunto de prueba
accuracy = accuracy_score(y_test, y_pred_best)
print("Accuracy con los mejores hiperparámetros:", accuracy)

# Mostrar el classification report y la matriz de confusión
print("\nClassification Report:\n", classification_report(y_test, y_pred_best))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_best))

Accuracy con los mejores hiperparámetros: 0.78953421506613

Classification Report:
               precision    recall  f1-score   support

       False       0.80      0.77      0.78       861
        True       0.78      0.81      0.79       878

    accuracy                           0.79      1739
   macro avg       0.79      0.79      0.79      1739
weighted avg       0.79      0.79      0.79      1739


Confusion Matrix:
 [[666 195]
 [171 707]]


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [13]:
# Primero importar las bibliotecas necesarias
# from sklearn.model_selection import GridSearchCV
# from sklearn.ensemble import RandomForestClassifier

# Definir los parámetros para la búsqueda
param_grid = {
    'n_estimators': [100],  # Número de árboles
    'max_depth': [10, 20],  # Profundidad máxima de los árboles
    'min_samples_split': [2, 5],  # Mínimo número de muestras para dividir un nodo
    'min_samples_leaf': [1, 2],    # Mínimo número de muestras por hoja
    'max_features': ['sqrt'],  # Número de características a considerar para dividir
}

- Run Grid Search

In [14]:
# Definir los parámetros para la búsqueda

# Crear el clasificador de Random Forest
rf_model = RandomForestClassifier(random_state=42)

# Crear el objeto GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, 
                           cv=5, n_jobs=-1, verbose=2, scoring='accuracy')

# Ajustar GridSearchCV a los datos de entrenamiento
grid_search.fit(X_train_scaled, y_train)

# Mostrar los mejores parámetros y el mejor rendimiento
print("Mejores parámetros encontrados: ", grid_search.best_params_)
print("Mejor rendimiento de validación: ", grid_search.best_score_)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.3s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.3s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.3s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.3s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   2.4s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.4s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   2.4s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=100;

- Evaluate your model

In [15]:
# Obtener el mejor modelo encontrado por GridSearchCV
best_rf_model = grid_search.best_estimator_

# Hacer predicciones con el conjunto de prueba
y_pred_best = best_rf_model.predict(X_test_scaled)

# Evaluar el modelo con el conjunto de prueba
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

accuracy = accuracy_score(y_test, y_pred_best)
print("Accuracy con los mejores hiperparámetros:", accuracy)

print("Accuracy con los mejores hiperparámetros:", accuracy_score(y_test, y_pred_best))
print("\nClassification Report:\n", classification_report(y_test, y_pred_best))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_best))

Accuracy con los mejores hiperparámetros: 0.7878090856814262
Accuracy con los mejores hiperparámetros: 0.7878090856814262

Classification Report:
               precision    recall  f1-score   support

       False       0.81      0.75      0.78       861
        True       0.77      0.82      0.80       878

    accuracy                           0.79      1739
   macro avg       0.79      0.79      0.79      1739
weighted avg       0.79      0.79      0.79      1739


Confusion Matrix:
 [[647 214]
 [155 723]]
