# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [13]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [15]:
spaceship_encoded = pd.read_csv(r"C:\Users\Eros\IH-labs\lab-feature-engineering\spaceship_encoded")
print(spaceship_encoded.head())

    Age  RoomService  FoodCourt  ShoppingMall     Spa  VRDeck  Transported  \
0  39.0          0.0        0.0           0.0     0.0     0.0        False   
1  24.0        109.0        9.0          25.0   549.0    44.0         True   
2  58.0         43.0     3576.0           0.0  6715.0    49.0        False   
3  33.0          0.0     1283.0         371.0  3329.0   193.0        False   
4  16.0        303.0       70.0         151.0   565.0     2.0         True   

   HomePlanet_Europa  HomePlanet_Mars  CryoSleep_True  Cabin_B  Cabin_C  \
0               True            False           False     True    False   
1              False            False           False    False    False   
2               True            False           False    False    False   
3               True            False           False    False    False   
4              False            False           False    False    False   

   Cabin_D  Cabin_E  Cabin_F  Cabin_G  Cabin_T  Destination_PSO J318.5-22  \
0  

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [7]:
#your code here
#your code here
from sklearn.preprocessing import StandardScaler

# Select the numerical features
numerical_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the numerical features
spaceship_encoded[numerical_features] = scaler.fit_transform(spaceship_encoded[numerical_features])

spaceship_encoded.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,0.695413,-0.345756,-0.285355,-0.309494,-0.273759,-0.269534,False,True,False,False,True,False,False,False,False,False,False,False,True,False
1,-0.336769,-0.176748,-0.279993,-0.266112,0.206165,-0.230494,True,False,False,False,False,False,False,False,True,False,False,False,True,False
2,2.002842,-0.279083,1.845163,-0.309494,5.596357,-0.226058,False,True,False,False,False,False,False,False,False,False,False,False,True,True
3,0.28254,-0.345756,0.479034,0.334285,2.636384,-0.098291,False,True,False,False,False,False,False,False,False,False,False,False,True,False
4,-0.887266,0.124056,-0.24365,-0.04747,0.220152,-0.267759,True,False,False,False,False,False,False,False,True,False,False,False,True,False


In [21]:
X = spaceship_encoded.drop(columns=['Transported'])
y = spaceship_encoded['Transported']

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((5284, 19), (1322, 19), (5284,), (1322,))

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [23]:
#your code here
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model
accuracy_rf = accuracy_score(y_test, y_pred_rf)

accuracy_rf

0.8063540090771558

- Evaluate your model

In [1]:
#your code here
0.8063540090771558

**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [None]:
#your code here
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define the model
rf = RandomForestClassifier()

# Define the hyperparameters grid
param_grid = {
    'n_estimators': [100, 200, 300],        # Number of trees in the forest
    'max_features': ['auto', 'sqrt', 'log2'],  # Number of features to consider at every split
    'max_depth': [10, 20, 30, None],        # Maximum number of levels in tree
    'min_samples_split': [2, 5, 10],        # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2, 4],          # Minimum number of samples required at each leaf node
    'bootstrap': [True, False]              # Method of selecting samples for training each tree
}

# Define the Grid Search with cross-validation
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit the Grid Search to the data
grid_search.fit(X_train, y_train)

# Get the best parameters and the best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

best_params, best_score


Fitting 5 folds for each of 648 candidates, totalling 3240 fits


- Run Grid Search

In [None]:
# Define the model
rf = RandomForestClassifier()

# Define a reduced hyperparameters grid
param_grid = {
    'n_estimators': [100, 200],                # Number of trees in the forest
    'max_features': ['auto', 'sqrt'],          # Number of features to consider at every split
    'max_depth': [10, 20, None],               # Maximum number of levels in tree
    'min_samples_split': [2, 5],               # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2],                # Minimum number of samples required at each leaf node
    'bootstrap': [True, False]                 # Method of selecting samples for training each tree
}

# Define the Grid Search with cross-validation
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit the Grid Search to the data
grid_search.fit(X_train, y_train)

# Get the best parameters and the best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(best_params, best_score)

- Evaluate your model