# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [21]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import RandomizedSearchCV

In [23]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [26]:
# Import the necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Load the data
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")

# Display the first few rows of the dataset
spaceship.head()

# Define the numerical features that need to be scaled (replace these with actual feature names)
numerical_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']  # These are the actual numerical features in the dataset

# Initialize the StandardScaler
scaler = StandardScaler()

# Apply scaling to the numerical features
spaceship[numerical_features] = scaler.fit_transform(spaceship[numerical_features])

# Print the scaled data (optional)
print("Scaled numerical features:")
print(spaceship[numerical_features].head())


Scaled numerical features:
        Age  RoomService  FoodCourt  ShoppingMall       Spa    VRDeck
0  0.702095    -0.337025  -0.284274     -0.287317 -0.273736 -0.266098
1 -0.333233    -0.173528  -0.278689     -0.245971  0.209267 -0.227692
2  2.013510    -0.272527   1.934922     -0.287317  5.634034 -0.223327
3  0.287964    -0.337025   0.511931      0.326250  2.655075 -0.097634
4 -0.885407     0.117466  -0.240833     -0.037590  0.223344 -0.264352


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [33]:
# Import the necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load the data
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")

# Remove irrelevant features (PassengerId and Name might not be useful for prediction)
spaceship = spaceship.drop(columns=['PassengerId', 'Name'])

# Handle categorical variables with One-Hot-Encoding
spaceship_encoded = pd.get_dummies(spaceship, columns=['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP'], drop_first=True)

# Define the features (X) and the target variable (y)
X = spaceship_encoded.drop('Transported', axis=1)  # Assuming 'Transported' is your target variable
y = spaceship_encoded['Transported']  # Target variable

# Convert the target variable to binary (True/False to 1/0)
y = y.astype(int)

# Train/Test Split (you can adjust test size if needed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the model (RandomForest in this case)
model = RandomForestClassifier()

# Define the parameter grid you want to search over
param_grid = {
    'n_estimators': [100, 200, 300],        # Number of trees in the forest
    'max_depth': [10, 20, 30],              # Maximum depth of the tree
    'min_samples_split': [2, 5, 10]         # Minimum number of samples required to split a node
}

# Initialize GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, verbose=2, n_jobs=-1)

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Best hyperparameters and best model performance
print("Best Hyperparameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Evaluate the model on the test set
y_pred = grid_search.best_estimator_.predict(X_test)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))


Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best Hyperparameters:  {'max_depth': 30, 'min_samples_split': 2, 'n_estimators': 300}
Best Score:  0.7434595472435854
Classification Report:
              precision    recall  f1-score   support

           0       0.73      0.80      0.76       861
           1       0.78      0.70      0.74       878

    accuracy                           0.75      1739
   macro avg       0.75      0.75      0.75      1739
weighted avg       0.76      0.75      0.75      1739



In [35]:
from sklearn.metrics import classification_report

# Use the best estimator from GridSearch to predict on the test set
y_pred = grid_search.best_estimator_.predict(X_test)

# Evaluate the model's performance
print("Classification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.73      0.80      0.76       861
           1       0.78      0.70      0.74       878

    accuracy                           0.75      1739
   macro avg       0.75      0.75      0.75      1739
weighted avg       0.76      0.75      0.75      1739



**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [37]:
# Define the parameter grid you want to search over
param_grid = {
    'n_estimators': [100, 200, 300],        # Number of trees in the forest
    'max_depth': [10, 20, 30],              # Maximum depth of the tree
    'min_samples_split': [2, 5, 10]         # Minimum number of samples required to split a node
}

# Initialize GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, verbose=2, n_jobs=-1)

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Best hyperparameters and best model performance
print("Best Hyperparameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Evaluate the model on the test set
y_pred = grid_search.best_estimator_.predict(X_test)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))


Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best Hyperparameters:  {'max_depth': 30, 'min_samples_split': 2, 'n_estimators': 200}
Best Score:  0.7446099023010204
Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.81      0.76       861
           1       0.79      0.69      0.74       878

    accuracy                           0.75      1739
   macro avg       0.75      0.75      0.75      1739
weighted avg       0.75      0.75      0.75      1739



In [39]:
# Define the parameter grid you want to search over
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Initialize RandomizedSearchCV with 5-fold cross-validation
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=50, cv=5, verbose=2, random_state=42, n_jobs=-1)

# Fit the model to the training data
random_search.fit(X_train, y_train)

# Best hyperparameters and best model performance
print("Best Hyperparameters: ", random_search.best_params_)
print("Best Score: ", random_search.best_score_)

# Evaluate the model on the test set
y_pred = random_search.best_estimator_.predict(X_test)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))


Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Hyperparameters:  {'n_estimators': 400, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_depth': None, 'bootstrap': True}
Best Score:  0.7993968419800465
Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.78      0.78       861
           1       0.79      0.80      0.79       878

    accuracy                           0.79      1739
   macro avg       0.79      0.79      0.79      1739
weighted avg       0.79      0.79      0.79      1739



- Run Grid Search

In [41]:
# Import the necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load the data
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")

# Remove irrelevant features (PassengerId and Name might not be useful for prediction)
spaceship = spaceship.drop(columns=['PassengerId', 'Name'])

# Handle categorical variables with One-Hot-Encoding
spaceship_encoded = pd.get_dummies(spaceship, columns=['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP'], drop_first=True)

# Define the features (X) and the target variable (y)
X = spaceship_encoded.drop('Transported', axis=1)  # Assuming 'Transported' is your target variable
y = spaceship_encoded['Transported']  # Target variable

# Convert the target variable to binary (True/False to 1/0)
y = y.astype(int)

# Train/Test Split (you can adjust test size if needed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the model (RandomForest in this case)
model = RandomForestClassifier()

# Define the parameter grid you want to search over
param_grid = {
    'n_estimators': [100, 200, 300],        # Number of trees in the forest
    'max_depth': [10, 20, 30],              # Maximum depth of the tree
    'min_samples_split': [2, 5, 10]         # Minimum number of samples required to split a node
}

# Initialize GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, verbose=2, n_jobs=-1)

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Best hyperparameters and best model performance
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Evaluate the model on the test set
y_pred = grid_search.best_estimator_.predict(X_test)

# Print classification report
print("Best Hyperparameters: ", best_params)
print("Best Score: ", best_score)
print("Classification Report:")
print(classification_report(y_test, y_pred))

Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best Hyperparameters:  {'max_depth': 30, 'min_samples_split': 5, 'n_estimators': 100}
Best Score:  0.7447537871931068
Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.81      0.76       861
           1       0.79      0.69      0.73       878

    accuracy                           0.75      1739
   macro avg       0.75      0.75      0.75      1739
weighted avg       0.75      0.75      0.75      1739



- Evaluate your model

In [44]:
from sklearn.metrics import classification_report

# Use the best estimator from GridSearch to predict on the test set
y_pred = grid_search.best_estimator_.predict(X_test)

# Evaluate the model's performance
print("Classification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.81      0.76       861
           1       0.79      0.69      0.73       878

    accuracy                           0.75      1739
   macro avg       0.75      0.75      0.75      1739
weighted avg       0.75      0.75      0.75      1739

