# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [3]:
# Identificar valores nulos
print("Valores nulos por columna:")
print(spaceship.isnull().sum())

# 1. Rellenar valores nulos (ejemplo: media para numéricos, moda para categóricos)
for column in spaceship.columns:
    if spaceship[column].dtype in ['int64', 'float64']:
        # Rellenar numéricos con la media
        spaceship[column].fillna(spaceship[column].mean(), inplace=True)
    else:
        # Rellenar categóricos con la moda
        spaceship[column].fillna(spaceship[column].mode()[0], inplace=True)

# Verificar nuevamente los valores nulos
print("\nValores nulos después del manejo:")
print(spaceship.isnull().sum())

# Vista rápida del dataset limpio
print("\nDatos después de limpieza:")
print(spaceship.head())

Valores nulos por columna:
PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

Valores nulos después del manejo:
PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64

Datos después de limpieza:
  PassengerId HomePlanet  CryoSleep  Cabin  Destination   Age    VIP  \
0     0001_01     Europa      False  B/0/P  TRAPPIST-1e  39.0  False   
1     0002_01      Earth      False  F/0/S  TRAPPIST-1e  24.0  False   
2     0003_01     Europa      False  A/0/S  TRAPPIST-1e  58.0   True   
3     0003_02     Europa      False  A/0/S  TRAPPIST-1e  33.0  F

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  spaceship[column].fillna(spaceship[column].mode()[0], inplace=True)
  spaceship[column].fillna(spaceship[column].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  spaceship[column].fillna(spaceship[column].mean(), inplace=True)


In [4]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler, LabelEncoder

In [5]:
# Codificar variables categóricas
spaceship = pd.get_dummies(spaceship, drop_first=True)

In [6]:
x = spaceship.drop("Transported", axis=1)
y = spaceship["Transported"]

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [7]:
#your code here
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

In [9]:
selector = SelectKBest(score_func=f_classif, k=10)
x_selected = selector.fit_transform(x_scaled, y)
selected_features = x.columns[selector.get_support()]
print("Selected Features:", selected_features)

Selected Features: Index(['CryoSleep', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'Spa', 'VRDeck',
       'HomePlanet_Europa', 'Cabin_G/981/S', 'Destination_TRAPPIST-1e'],
      dtype='object')


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [18]:
#your code here
X_train, X_test, y_train, y_test = train_test_split(x_selected, y, test_size=0.2, random_state=42)

- Evaluate your model

In [19]:
#your code here

# Define el modelo base
model = RandomForestClassifier(random_state=42)

# Define el rango de hiperparámetros para ajustar
param_grid = {
    'n_estimators': [50, 100, 200],  # Número de árboles
    'max_depth': [None, 10, 20, 30],  # Profundidad máxima del árbol
    'min_samples_split': [2, 5, 10],  # Número mínimo de muestras para dividir un nodo
    'min_samples_leaf': [1, 2, 4]     # Número mínimo de muestras en las hojas
}

# Configuración del Grid Search
grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='accuracy',  # Métrica de evaluación
    cv=5,                # Validación cruzada de 5 pliegues
    verbose=2,
    n_jobs=-1            # Usar todos los núcleos disponibles
)

# Ajustar Grid Search con los datos seleccionados
grid_search.fit(x_selected, y)

# Resultados del Grid Search
print("Mejores Hiperparámetros:", grid_search.best_params_)
print("Mejor Score en Validación Cruzada:", grid_search.best_score_)


Fitting 5 folds for each of 108 candidates, totalling 540 fits
Mejores Hiperparámetros: {'max_depth': 30, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 200}
Mejor Score en Validación Cruzada: 0.7921344158349275


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [14]:
#your code here

# Define el modelo base
model = RandomForestClassifier(random_state=42)

# Define los hiperparámetros a ajustar
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

- Run Grid Search

In [15]:
# Configura GridSearchCV
grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    verbose=2,
    n_jobs=-1
)

# Entrena el Grid Search
grid_search.fit(x_selected, y)

# Mejor estimador y parámetros
print("Mejores hiperparámetros:", grid_search.best_params_)
print("Mejor puntuación en validación:", grid_search.best_score_)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Mejores hiperparámetros: {'max_depth': 30, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 200}
Mejor puntuación en validación: 0.7921344158349275


- Evaluate your model

In [16]:
# Evalúa en el conjunto de prueba
from sklearn.metrics import classification_report, confusion_matrix

In [17]:
# Mejor modelo
best_model = grid_search.best_estimator_

# Predicciones en el conjunto de prueba
y_pred = best_model.predict(x_selected)

# Métricas de evaluación
print("Matriz de confusión:")
print(confusion_matrix(y, y_pred))
print("\nReporte de clasificación:")
print(classification_report(y, y_pred))

Matriz de confusión:
[[3528  787]
 [ 498 3880]]

Reporte de clasificación:
              precision    recall  f1-score   support

       False       0.88      0.82      0.85      4315
        True       0.83      0.89      0.86      4378

    accuracy                           0.85      8693
   macro avg       0.85      0.85      0.85      8693
weighted avg       0.85      0.85      0.85      8693

