# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [3]:
spaceship=spaceship.dropna()
# Extraer la letra antes de la primera barra usando regex
spaceship['Cabin'] = spaceship['Cabin'].str.extract(r'^([A-Z])')
#Se generan dataframes para características y targets
target=spaceship["Transported"]
spaceship= spaceship.drop(columns=["PassengerId", "Name", "Transported"])
features = pd.get_dummies(spaceship, columns=["HomePlanet","Cabin", "Destination"])

In [4]:

#Se ejecuta la división de datos para entrenar y probar el modelo, considerando X para features (variables independientes)
#  y Y para target (variable dependiente a predecir). Se asigna el 80% de los datos para entrenar el modelo, 
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=42)

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [5]:
# scaling the data 
from sklearn.preprocessing import MinMaxScaler, StandardScaler

scaler= StandardScaler()

scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [6]:
normalizer = MinMaxScaler()
X_train_norm = normalizer.fit_transform(X_train)
X_test_norm = normalizer.transform(X_test)

In [8]:
#PCA para reducir la dimnesión de destination 'Destination_55 Cancri e', 'Destination_PSO J318.5-22','Destination_TRAPPIST-1e'
from sklearn.decomposition import PCA
pca_train_Destination=PCA(n_components=1)
pca_train_Destination.fit(X_train[['Destination_55 Cancri e', 'Destination_PSO J318.5-22','Destination_TRAPPIST-1e']])
X_train['Destination_PCA']=pca_train_Destination.transform(X_train[['Destination_55 Cancri e', 'Destination_PSO J318.5-22','Destination_TRAPPIST-1e']])
X_test['Destination_PCA']=pca_train_Destination.transform(X_test[['Destination_55 Cancri e', 'Destination_PSO J318.5-22','Destination_TRAPPIST-1e']])
#spaceship_2=spaceship_2.drop(columns=['Destination_55 Cancri e', 'Destination_PSO J318.5-22','Destination_TRAPPIST-1e'])

In [9]:
#PCA para reducir la dimnesión de 'HomePlanet_Earth', 'HomePlanet_Europa','HomePlanet_Mars'
columns=['Cabin_A','Cabin_B', 'Cabin_C', 'Cabin_D', 'Cabin_E', 'Cabin_F', 'Cabin_G','Cabin_T']
from sklearn.decomposition import PCA
pca_train_planet=PCA(n_components=0.7)
pca_train_planet.fit(X_train[columns])
pca_result_train=pca_train_planet.transform(X_train[columns])

num_components = pca_result_train.shape[1]

# Crear nombres de columnas genéricos basados en el número de componentes
pca_column_names = [f'PCA_cabin_{i+1}' for i in range(num_components)]

# Agregar las nuevas columnas al DataFrame X_train
X_train[pca_column_names] = pca_result_train

pca_result_test=pca_train_planet.transform(X_test[columns])

num_components = pca_result_test.shape[1]

# Crear nombres de columnas genéricos basados en el número de componentes
pca_column_names = [f'PCA_cabin_{i+1}' for i in range(num_components)]

# Agregar las nuevas columnas al DataFrame X_train
X_test[pca_column_names] = pca_result_test

In [10]:
#PCA para reducir la dimnesión de 'HomePlanet_Earth', 'HomePlanet_Europa','HomePlanet_Mars'
columns=['HomePlanet_Earth', 'HomePlanet_Europa','HomePlanet_Mars']
from sklearn.decomposition import PCA
pca_train_planet=PCA(n_components=1)
pca_train_planet.fit(X_train[columns])
X_train['homeplanet_PCA']=pca_train_planet.transform(X_train[columns])
X_test['homeplanet_PCA']=pca_train_planet.transform(X_test[columns])

In [11]:
features_selected=['CryoSleep', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall',
       'Spa', 'VRDeck','Destination_PCA', 'PCA_cabin_1', 'PCA_cabin_2', 'PCA_cabin_3',
       'homeplanet_PCA']

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [7]:
import time

start_time = time.time()

In [15]:
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
forest = RandomForestClassifier(n_estimators=100,
                             max_depth=10)
forest.fit(X_train_norm, y_train)
predfores = forest.predict(X_test_norm)
end_time = time.time()
total_time = end_time - start_time
print(f"Tiempo total de ejecución: {total_time:.2f} segundos")


Tiempo total de ejecución: 219.15 segundos


- Evaluate your model

In [16]:
# Calcular métricas de clasificación
print("Accuracy:", accuracy_score(y_test, predfores))
print("\nClassification Report:\n", classification_report(y_test, predfores))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, predfores))

Accuracy: 0.8101361573373677

Classification Report:
               precision    recall  f1-score   support

       False       0.81      0.81      0.81       653
        True       0.81      0.81      0.81       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322


Confusion Matrix:
 [[526 127]
 [124 545]]


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [18]:
#GridSearchCV   definir hyperparametros
from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, 20]
}


- Run Grid Search

In [27]:
#run grid search
grid_search = GridSearchCV(forest, param_grid, cv=5)
grid_search.fit(X_train_norm, y_train)



In [29]:
grid_search.best_estimator_

- Evaluate your model

In [30]:
#Evaluate the model 
pred = grid_search.predict(X_test_norm)
print("Accuracy:", accuracy_score(y_test, pred))

Accuracy: 0.8101361573373677


In [31]:
#Random search  definir hyperparametros
from sklearn.model_selection import RandomizedSearchCV
param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, 20]
}


In [32]:
#run Random search
random_search = RandomizedSearchCV(forest, param_dist, cv=5)
random_search.fit(X_train_norm, y_train)


In [33]:
random_search.best_estimator_

In [34]:
#Evaluate the model
pred = random_search.predict(X_test_norm)
print("Accuracy:", accuracy_score(y_test, pred))


Accuracy: 0.8146747352496218
