# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [28]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [29]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [30]:
# Load cleaned data
spaceship_clean = pd.read_csv("spaceship_clean.csv")
spaceship_clean

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Earth,...,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T
0,0,39.0,0,0.0,0.0,0.0,0.0,0.0,0,0,...,0,1,0,1,0,0,0,0,0,0
1,0,24.0,0,109.0,9.0,25.0,549.0,44.0,1,1,...,0,1,0,0,0,0,0,1,0,0
2,0,58.0,1,43.0,3576.0,0.0,6715.0,49.0,0,0,...,0,1,1,0,0,0,0,0,0,0
3,0,33.0,0,0.0,1283.0,371.0,3329.0,193.0,0,0,...,0,1,1,0,0,0,0,0,0,0
4,0,16.0,0,303.0,70.0,151.0,565.0,2.0,1,1,...,0,1,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6601,0,41.0,1,0.0,6819.0,0.0,1643.0,74.0,0,0,...,0,0,1,0,0,0,0,0,0,0
6602,1,18.0,0,0.0,0.0,0.0,0.0,0.0,0,1,...,1,0,0,0,0,0,0,0,1,0
6603,0,26.0,0,0.0,0.0,1872.0,1.0,0.0,1,1,...,0,1,0,0,0,0,0,0,1,0
6604,0,32.0,0,0.0,1049.0,0.0,353.0,3235.0,0,0,...,0,0,0,0,0,0,1,0,0,0


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [31]:
# Feature Scaling
from sklearn.preprocessing import MinMaxScaler

# Inicializar el escalador MinMax
minmax_scaler = MinMaxScaler()

# Definir las columnas numéricas
num_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

# Escalar las columnas numéricas
spaceship_clean_scaling = spaceship_clean.copy()
spaceship_clean_scaling[num_cols] = minmax_scaler.fit_transform(spaceship_clean_scaling[num_cols])

In [32]:
# Train test split scaling data
features_scaling = spaceship_clean_scaling.drop(columns='Transported')
target_scaling = spaceship_clean_scaling['Transported']
X_train_scaling, X_test_scaling, y_train_scaling, y_test_scaling = train_test_split(features_scaling, target_scaling, test_size=0.2, random_state=42)

In [33]:
# Fit the model for random forest
from sklearn.ensemble import RandomForestClassifier

# Make an instance of the model
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)

In [34]:
# Feature selection
from sklearn.feature_selection import SelectFromModel

# Train the model
random_forest.fit(X_train_scaling, y_train_scaling)

# Get feature importances
feature_importances = random_forest.feature_importances_

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [35]:
# Fit the model for scaling data
sfm = SelectFromModel(random_forest, threshold=0.1)

- Evaluate your model

In [42]:
# Evaluate the model (accuracy, f1-score, confusion matrix)
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# Predict the test data
y_pred_scaling = random_forest.predict(X_test_scaling)

# Accuracy
accuracy_scaling = accuracy_score(y_test_scaling, y_pred_scaling)
print(f"Accuracy: {accuracy_scaling}")

# F1-score
f1_scaling = f1_score(y_test_scaling, y_pred_scaling)
print(f"F1-score: {f1_scaling}")

# Confusion matrix
confusion_matrix_scaling = confusion_matrix(y_test_scaling, y_pred_scaling)
print(f"Confusion matrix:\n{confusion_matrix_scaling}")

Accuracy: 0.8093797276853253
F1-score: 0.8082191780821918
Confusion matrix:
[[539 114]
 [138 531]]


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [37]:
# Use the best model RandomForest with scaling data to improve with fine tuning
from sklearn.model_selection import GridSearchCV

# Define the grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

- Run Grid Search

In [38]:
# Make an instance of the GridSearchCV
grid_search = GridSearchCV(estimator=random_forest, param_grid=param_grid, cv=2, verbose=2)

# Fit the GridSearchCV
grid_search.fit(X_train_scaling, y_train_scaling)

Fitting 2 folds for each of 81 candidates, totalling 162 fits
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.1s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.1s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   0.2s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   0.2s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=   0.4s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=   0.4s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   0.1s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   0.1s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time=   0.2s
[CV] END max_depth=10, min_sa

- Evaluate your model

In [40]:
# Evaluate the model
y_pred_grid = grid_search.predict(X_test_scaling)

# Accuracy
accuracy_grid = accuracy_score(y_test_scaling, y_pred_grid)
print(f'Accuracy: {accuracy_grid}')

# F1-score
f1_grid = f1_score(y_test_scaling, y_pred_grid)
print(f'F1 Score: {f1_grid}')

# Confusion matrix
confusion_matrix_grid = confusion_matrix(y_test_scaling, y_pred_grid)
print(f'Accuracy: {accuracy_grid}')

Accuracy: 0.8161875945537065
F1 Score: 0.8198665678280208
Accuracy: 0.8161875945537065
