# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [2]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [41]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [44]:
spaceship.dropna(inplace=True)
spaceship['Cabin'] = spaceship['Cabin'].str[0]
spaceship.drop(['PassengerId', 'Name'], axis=1, inplace=True)

In [46]:
numerical_columns = spaceship.select_dtypes(include=[np.number]).columns
numerical_columns = ['Age', 'FoodCourt', 'VRDeck']

In [48]:
from sklearn.preprocessing import MinMaxScaler

# Assuming numerical_columns contains your numerical columns
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(spaceship[numerical_columns])

# Convert scaled_data back to a DataFrame
numerical_scaled_df = pd.DataFrame(scaled_data, columns=numerical_columns)


In [50]:
numerical_scaled_df

Unnamed: 0,Age,FoodCourt,VRDeck
0,0.493671,0.000000,0.000000
1,0.303797,0.000302,0.002164
2,0.734177,0.119948,0.002410
3,0.417722,0.043035,0.009491
4,0.202532,0.002348,0.000098
...,...,...,...
6601,0.518987,0.228726,0.003639
6602,0.227848,0.000000,0.000000
6603,0.329114,0.000000,0.000000
6604,0.405063,0.035186,0.159077


In [56]:
categorical_columns = spaceship.select_dtypes(include=[object]).columns
categorical_columns = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP']
categorical_dummies_df = pd.get_dummies(spaceship[categorical_columns], columns=categorical_columns).reset_index()

In [58]:
categorical_dummies_df

Unnamed: 0,index,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,0,False,True,False,True,False,False,False,True,True,False
1,1,True,False,False,True,False,False,False,True,True,False
2,2,False,True,False,True,False,False,False,True,False,True
3,3,False,True,False,True,False,False,False,True,True,False
4,4,True,False,False,True,False,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...
6601,8688,False,True,False,True,False,True,False,False,False,True
6602,8689,True,False,False,False,True,False,True,False,True,False
6603,8690,True,False,False,True,False,False,False,True,True,False
6604,8691,False,True,False,True,False,True,False,False,True,False


In [60]:
features = pd.concat([numerical_scaled_df, categorical_dummies_df], axis=1)
target = spaceship['Transported']
features

Unnamed: 0,Age,FoodCourt,VRDeck,index,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,0.493671,0.000000,0.000000,0,False,True,False,True,False,False,False,True,True,False
1,0.303797,0.000302,0.002164,1,True,False,False,True,False,False,False,True,True,False
2,0.734177,0.119948,0.002410,2,False,True,False,True,False,False,False,True,False,True
3,0.417722,0.043035,0.009491,3,False,True,False,True,False,False,False,True,True,False
4,0.202532,0.002348,0.000098,4,True,False,False,True,False,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6601,0.518987,0.228726,0.003639,8688,False,True,False,True,False,True,False,False,False,True
6602,0.227848,0.000000,0.000000,8689,True,False,False,False,True,False,True,False,True,False
6603,0.329114,0.000000,0.000000,8690,True,False,False,True,False,False,False,True,True,False
6604,0.405063,0.035186,0.159077,8691,False,True,False,True,False,True,False,False,True,False


In [62]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=0)

In [80]:
X_train

Unnamed: 0,Age,FoodCourt,VRDeck,index,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
2584,0.405063,0.000000,0.000000,3432,False,False,True,False,True,False,False,True,True,False
5530,0.050633,0.000000,0.000000,7312,True,False,False,False,True,False,False,True,True,False
1526,0.379747,0.007916,0.000000,2042,True,False,False,True,False,False,False,True,True,False
3784,0.215190,0.000000,0.000049,4999,False,False,True,True,False,False,False,True,True,False
4357,0.329114,0.000000,0.000000,5755,True,False,False,False,True,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4931,0.670886,0.000000,0.000000,6518,False,False,True,True,False,False,False,True,True,False
3264,0.455696,0.000000,0.000098,4317,True,False,False,True,False,True,False,False,True,False
1653,0.455696,0.159528,0.004721,2214,False,True,False,True,False,False,False,True,True,False
2607,0.430380,0.000134,0.087480,3468,False,True,False,True,False,False,False,True,False,True


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [71]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize Random Forest Classifier
random_forest = RandomForestClassifier(n_estimators=100, random_state=0)

# Fit the model on the training data
random_forest.fit(X_train, y_train)

# Predict on the test data
y_pred = random_forest.predict(X_test)

- Evaluate your model

In [73]:
# Calculate and print performance metrics
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

print('Classification Report:')
print(classification_report(y_test, y_pred))

print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.72
Classification Report:
              precision    recall  f1-score   support

       False       0.71      0.75      0.73       661
        True       0.73      0.69      0.71       661

    accuracy                           0.72      1322
   macro avg       0.72      0.72      0.72      1322
weighted avg       0.72      0.72      0.72      1322

Confusion Matrix:
[[496 165]
 [207 454]]


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [82]:
from sklearn.model_selection import GridSearchCV

# Initialize Random Forest Classifier
random_forest = RandomForestClassifier(random_state=0)

# Define the hyperparameters and their values
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 14],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

- Run Grid Search

In [86]:
# Initialize Grid Search
grid_search = GridSearchCV(estimator=random_forest, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit Grid Search on the training data
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_
print("Best parameters found: ", best_params)

# Predict on the test data with the best estimator
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best parameters found:  {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 100}


- Evaluate your model

In [88]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)

classification_rep = classification_report(y_test, y_pred)
print("Classification Report:\n", classification_rep)

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

Accuracy:  0.7443267776096822
Classification Report:
               precision    recall  f1-score   support

       False       0.73      0.78      0.75       661
        True       0.77      0.70      0.73       661

    accuracy                           0.74      1322
   macro avg       0.75      0.74      0.74      1322
weighted avg       0.75      0.74      0.74      1322

Confusion Matrix:
 [[518 143]
 [195 466]]
