# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
#your code here
from sklearn.preprocessing import StandardScaler

# Selecting the numerical columns to scale (you'll typically avoid columns like 'PassengerId')
numerical_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
spaceship[numerical_features] = scaler.fit_transform(spaceship[numerical_features])

# Check the scaled data
print(spaceship[numerical_features].head())

        Age  RoomService  FoodCourt  ShoppingMall       Spa    VRDeck
0  0.702095    -0.337025  -0.284274     -0.287317 -0.273736 -0.266098
1 -0.333233    -0.173528  -0.278689     -0.245971  0.209267 -0.227692
2  2.013510    -0.272527   1.934922     -0.287317  5.634034 -0.223327
3  0.287964    -0.337025   0.511931      0.326250  2.655075 -0.097634
4 -0.885407     0.117466  -0.240833     -0.037590  0.223344 -0.264352


In [4]:
from sklearn.preprocessing import StandardScaler

# Define the scaler
scaler = StandardScaler()

# Select features to scale
numerical_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
spaceship[numerical_features] = scaler.fit_transform(spaceship[numerical_features])

print(spaceship.head())


  PassengerId HomePlanet CryoSleep  Cabin  Destination       Age    VIP  \
0     0001_01     Europa     False  B/0/P  TRAPPIST-1e  0.702095  False   
1     0002_01      Earth     False  F/0/S  TRAPPIST-1e -0.333233  False   
2     0003_01     Europa     False  A/0/S  TRAPPIST-1e  2.013510   True   
3     0003_02     Europa     False  A/0/S  TRAPPIST-1e  0.287964  False   
4     0004_01      Earth     False  F/1/S  TRAPPIST-1e -0.885407  False   

   RoomService  FoodCourt  ShoppingMall       Spa    VRDeck  \
0    -0.337025  -0.284274     -0.287317 -0.273736 -0.266098   
1    -0.173528  -0.278689     -0.245971  0.209267 -0.227692   
2    -0.272527   1.934922     -0.287317  5.634034 -0.223327   
3    -0.337025   0.511931      0.326250  2.655075 -0.097634   
4     0.117466  -0.240833     -0.037590  0.223344 -0.264352   

                Name  Transported  
0    Maham Ofracculy        False  
1       Juanna Vines         True  
2      Altark Susent        False  
3       Solam Susent      

In [5]:
# Define the target variable and features
X = spaceship.drop(['Transported','HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Name', 'Cabin', 'PassengerId'], axis=1)
y = spaceship['Transported']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape)


(6954, 6) (1739, 6)


In [7]:
# Check for missing values in X_train and X_test
print(X_train.isnull().sum())
print(X_test.isnull().sum())

# Make sure that y (target) is encoded as integers (0 and 1 instead of True/False)
y_train = y_train.astype(int)
y_test = y_test.astype(int)

# Check again if there are NaNs in the dataset
if X_train.isnull().values.any():
    X_train.fillna(0, inplace=True)
if X_test.isnull().values.any():
    X_test.fillna(0, inplace=True)

Age             148
RoomService     126
FoodCourt       140
ShoppingMall    165
Spa             134
VRDeck          151
dtype: int64
Age             31
RoomService     55
FoodCourt       43
ShoppingMall    43
Spa             49
VRDeck          37
dtype: int64


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [9]:
#your code here
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Train the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

- Evaluate your model

In [10]:
#your code here
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.77


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [13]:
#your code here
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}


- Run Grid Search

In [14]:
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, verbose=2, n_jobs=-1)

# Fit the grid search
grid_search.fit(X_train, y_train)

# Get the best parameters and evaluate the model
print(f"Best Parameters: {grid_search.best_params_}")

# Evaluate the best model
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)


Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best Parameters: {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 100}


- Evaluate your model

In [15]:
# Evaluate the accuracy
best_accuracy = accuracy_score(y_test, y_pred_best)
print(f"Best Accuracy: {best_accuracy:.2f}")

Best Accuracy: 0.78
