# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [16]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [17]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [18]:
# Drop columns that won't be used
spaceship = spaceship.drop(columns=['PassengerId', 'Name', 'Cabin'])


In [19]:
# Handle missing values
imputer = SimpleImputer(strategy='mean')
spaceship[['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']] = imputer.fit_transform(spaceship[['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']])

In [20]:
# Convert categorical variables to dummy/indicator variables
spaceship = pd.get_dummies(spaceship, drop_first=True)

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [21]:
# Scaling the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(spaceship.drop(columns='Transported'))

In [22]:

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(scaled_features, spaceship['Transported'], test_size=0.2, random_state=42)

# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [23]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [24]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [25]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Initialize the model
rf = RandomForestClassifier(random_state=42)

# Define hyperparameters to fine-tune
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

In [27]:
# Initialize Grid Search
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [28]:
# Fit Grid Search
grid_search.fit(X_train, y_train)
# Best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best Parameters: {best_params}")
print(f"Best Cross-validation Score: {best_score}")

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best Parameters: {'bootstrap': True, 'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}
Best Cross-validation Score: 0.8014114373490424


- Evaluate your model

In [29]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predict using the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {accuracy}")

print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Test Set Accuracy: 0.7860839562967222
Classification Report:
              precision    recall  f1-score   support

       False       0.81      0.74      0.77       861
        True       0.76      0.84      0.80       878

    accuracy                           0.79      1739
   macro avg       0.79      0.79      0.79      1739
weighted avg       0.79      0.79      0.79      1739

Confusion Matrix:
[[633 228]
 [144 734]]


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [30]:
# Define hyperparameters to fine-tune
param_grid = {
    'n_estimators': [50, 100, 200],       # Number of trees
    'max_depth': [None, 10, 20, 30],      # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],      # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2, 4],        # Minimum number of samples required to be at a leaf node
    'bootstrap': [True, False]            # Whether bootstrap samples are used when building trees
}


- Run Grid Search

In [31]:
#  Initialize Grid Search
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit Grid Search
grid_search.fit(X_train, y_train)

# Best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best Parameters: {best_params}")
print(f"Best Cross-validation Score: {best_score}")

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best Parameters: {'bootstrap': True, 'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}
Best Cross-validation Score: 0.8014114373490424


- Evaluate your model

In [32]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predict using the best model found by Grid Search
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {accuracy:.4f}")

print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Test Set Accuracy: 0.7861
Classification Report:
              precision    recall  f1-score   support

       False       0.81      0.74      0.77       861
        True       0.76      0.84      0.80       878

    accuracy                           0.79      1739
   macro avg       0.79      0.79      0.79      1739
weighted avg       0.79      0.79      0.79      1739

Confusion Matrix:
[[633 228]
 [144 734]]


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [33]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the best model used so far
best_model_so_far = RandomForestClassifier(random_state=42)

# Fit the model on the training data (before tuning)
best_model_so_far.fit(X_train, y_train)




RandomForestClassifier(random_state=42)

- Evaluate your model

In [34]:
# Predict and evaluate the model before tuning
y_pred_before_tuning = best_model_so_far.predict(X_test)

# Evaluate the model before tuning
accuracy_before = accuracy_score(y_test, y_pred_before_tuning)
print(f"Accuracy before tuning: {accuracy_before:.4f}")

Accuracy before tuning: 0.7688


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [35]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid for tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}




- Run Grid Search

In [36]:
# Initialize Grid Search
grid_search = GridSearchCV(estimator=best_model_so_far, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit Grid Search to find the best hyperparameters
grid_search.fit(X_train, y_train)

# Extract the best model from Grid Search
best_model_after_tuning = grid_search.best_estimator_

Fitting 5 folds for each of 216 candidates, totalling 1080 fits


- Evaluate your model

In [37]:
# Predict and evaluate the model after tuning
y_pred_after_tuning = best_model_after_tuning.predict(X_test)

# Evaluate the model after tuning
accuracy_after = accuracy_score(y_test, y_pred_after_tuning)
print(f"Accuracy after tuning: {accuracy_after:.4f}")

# Compare the improvement
improvement = accuracy_after - accuracy_before
print(f"Improvement in accuracy: {improvement:.4f}")

print("Classification Report After Tuning:")
print(classification_report(y_test, y_pred_after_tuning))

print("Confusion Matrix After Tuning:")
print(confusion_matrix(y_test, y_pred_after_tuning))


Accuracy after tuning: 0.7861
Improvement in accuracy: 0.0173
Classification Report After Tuning:
              precision    recall  f1-score   support

       False       0.81      0.74      0.77       861
        True       0.76      0.84      0.80       878

    accuracy                           0.79      1739
   macro avg       0.79      0.79      0.79      1739
weighted avg       0.79      0.79      0.79      1739

Confusion Matrix After Tuning:
[[633 228]
 [144 734]]
