# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [60]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV

In [62]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [65]:
#your code here

In [67]:
spaceship = spaceship.dropna()

In [69]:
spaceship = pd.get_dummies(spaceship, columns=['HomePlanet', 'Destination', 'Cabin'], drop_first=True)
spaceship

Unnamed: 0,PassengerId,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,...,Cabin_G/992/P,Cabin_G/992/S,Cabin_G/993/S,Cabin_G/994/S,Cabin_G/996/S,Cabin_G/998/S,Cabin_G/999/P,Cabin_G/999/S,Cabin_T/1/P,Cabin_T/3/P
0,0001_01,False,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,...,False,False,False,False,False,False,False,False,False,False
1,0002_01,False,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,...,False,False,False,False,False,False,False,False,False,False
2,0003_01,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,...,False,False,False,False,False,False,False,False,False,False
3,0003_02,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,...,False,False,False,False,False,False,False,False,False,False
4,0004_01,False,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,False,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,...,False,False,False,False,False,False,False,False,False,False
8689,9278_01,True,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,...,False,False,False,False,False,False,False,False,False,False
8690,9279_01,False,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,...,False,False,False,False,False,False,False,False,False,False
8691,9280_01,False,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,...,False,False,False,False,False,False,False,False,False,False


In [71]:
# Dropping unnecessary columns
spaceship.drop(['PassengerId', 'Name'], axis=1, inplace=True)

In [73]:
scaler = StandardScaler()
scaled_columns = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
spaceship[scaled_columns] = scaler.fit_transform(spaceship[scaled_columns])

In [75]:
X = spaceship.drop('Transported', axis=1)
y = spaceship['Transported']

In [77]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [80]:
#your code here

In [82]:
rf = RandomForestClassifier(random_state=42)

In [84]:
# Define the hyperparameters grid to search over
param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [10, 20, None],     # Maximum depth of the tree
    'min_samples_split': [2, 5, 10], # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2, 4]    # Minimum number of samples required at each leaf node
}
param_grid

{'n_estimators': [50, 100, 200],
 'max_depth': [10, 20, None],
 'min_samples_split': [2, 5, 10],
 'min_samples_leaf': [1, 2, 4]}

In [86]:
# Grid Search
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 81 candidates, totalling 243 fits
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   0.7s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.1s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.1s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   1.5s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   1.6s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   1.6s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=50; total time=   0.7s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=50; total time=   0.7s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   2.3s
[CV] END max_depth=10, min_samples

In [87]:
# Best parameters found by Grid Search
print("Best Parameters:", grid_search.best_params_)

Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 50}


- Evaluate your model

In [1]:
#your code here

In [88]:
# Best estimator from Grid Search
best_rf = grid_search.best_estimator_
best_rf

In [92]:
y_pred = best_rf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", conf_matrix)

Accuracy: 0.80

Classification Report:
               precision    recall  f1-score   support

       False       0.80      0.80      0.80       653
        True       0.81      0.80      0.81       669

    accuracy                           0.80      1322
   macro avg       0.80      0.80      0.80      1322
weighted avg       0.80      0.80      0.80      1322


Confusion Matrix:
 [[525 128]
 [131 538]]


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [None]:
#your code here

- Run Grid Search

- Precision and Recall Balance: With a precision of around 80% for both fraudulent and non-fraudulent transactions, the model is making reasonably accurate predictions.
- False Positives and False Negatives: Although the model has a high accuracy, there are still false positives (128) and false negatives (131). In a fraud detection system, false negatives (missed fraud) are typically more critical than false positives.
- False Negatives (131): These are cases where fraudulent transactions were not detected, which could be a concern for security systems.
- False Positives (128): These represent non-fraudulent transactions flagged as fraud, leading to possible inconvenience for customers.

- Evaluate your model