# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [13]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np
import scipy.stats as st
import time
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

In [15]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [16]:
spaceship.dropna(inplace = True)
spaceship['Cabin'] = spaceship['Cabin'].str.split('/').str[0]
spaceship = spaceship.drop(columns = ['PassengerId', 'Name'])
spaceship = pd.get_dummies(spaceship)
# sp_features = spaceship.select_dtypes (include = 'number')
# sp_targets = spaceship.select_dtypes (exclude = 'number')
sp_targets = spaceship['Transported']
sp_features = spaceship.drop(columns=['Transported'])

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [17]:
x_train, x_test, y_train, y_test = train_test_split (sp_features, sp_targets,test_size = 0.2, random_state=0)
from sklearn.preprocessing import MinMaxScaler
normalizer = MinMaxScaler()
normalizer.fit(x_train)
X_train_norm_np = normalizer.transform(x_train)
X_test_norm_np = normalizer.transform(x_test)
X_train_norm_df = pd.DataFrame(X_train_norm_np, columns = x_train.columns, index=x_train.index)
X_test_norm_df = pd.DataFrame(X_test_norm_np, columns = x_test.columns, index=x_test.index)

- Evaluate your model

In [18]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=0)
forest.fit(X_train_norm_df, y_train)

print(f"Accuracy: {forest.score(X_test_norm_df, y_test) * 100:.2f}%")

Accuracy: 79.58%


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [24]:
forest = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=0)
forest.fit(X_train_norm_df, y_train)
print(f"Random Forest Accuracy: {forest.score(X_test_norm_df, y_test) * 100:.2f}%")

# --- PART 2: Grid Search with Decision Tree CLASSIFIER ---

parameter_grid = {
    "max_depth": [10, 50, None],
    "min_samples_split": [4, 16],
    "max_leaf_nodes": [250, 100],
    "max_features": ["sqrt", "log2"]
}

Random Forest Accuracy: 79.58%


- Run Grid Search

In [25]:
dt = DecisionTreeClassifier(random_state=123)

confidence_level = 0.95
folds = 10

# FIX 2: Scoring defaults to Accuracy for Classifiers
gs = GridSearchCV(dt, param_grid=parameter_grid, cv=folds, verbose=1) 

start_time = time.time()
gs.fit(X_train_norm_df, y_train)
end_time = time.time()

print(f"\nTime taken: {end_time - start_time: .4f} seconds")
print(f"Best hyperparameters: {gs.best_params_}")
print(f"Best CV Mean Accuracy: {gs.best_score_: .4f}")

Fitting 10 folds for each of 24 candidates, totalling 240 fits

Time taken:  1.3348 seconds
Best hyperparameters: {'max_depth': 10, 'max_features': 'sqrt', 'max_leaf_nodes': 250, 'min_samples_split': 16}
Best CV Mean Accuracy:  0.7668


- Evaluate your model

In [26]:
best_model = gs.best_estimator_

print(f"Best Hyperparameters: {gs.best_params_}")
print(f"Best Validation Accuracy (during training): {gs.best_score_:.4f}")

# 2. Predict on the Test Set
y_pred = best_model.predict(X_test_norm_df)

# 3. Calculate Final Metrics
test_accuracy = accuracy_score(y_test, y_pred)

print("\n--- Final Test Evaluation ---")
print(f"Test Set Accuracy: {test_accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Best Hyperparameters: {'max_depth': 10, 'max_features': 'sqrt', 'max_leaf_nodes': 250, 'min_samples_split': 16}
Best Validation Accuracy (during training): 0.7668

--- Final Test Evaluation ---
Test Set Accuracy: 0.7451

Classification Report:
              precision    recall  f1-score   support

       False       0.78      0.69      0.73       661
        True       0.72      0.80      0.76       661

    accuracy                           0.75      1322
   macro avg       0.75      0.75      0.74      1322
weighted avg       0.75      0.75      0.74      1322

