# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [4]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [5]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [7]:
spaceship = spaceship.dropna()

spaceship['Cabin'] = spaceship['Cabin'].str[0]

spaceship.drop(columns=['PassengerId', 'Name'], inplace=True)
spaceship = pd.get_dummies(spaceship)

features = spaceship.drop(columns= 'Transported')
target = spaceship["Transported"]

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

#Normalize Data after Train Split

normalizer = MinMaxScaler() #define normalizer

normalizer.fit(X_train)

X_train_norm = normalizer.transform(X_train) # Normalize 80% training dats
X_test_norm = normalizer.transform(X_test) # Normalize 20% Testing Data

#Apply to test and training data
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [14]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# Initialize the Random Forest Classifier
forest = RandomForestClassifier(n_estimators=100, max_depth=20)

# Fit the model on the training data
forest.fit(X_train_norm, y_train)

# Make predictions on the test data
pred = forest.predict(X_test_norm)

# Calculate and print evaluation metrics for classification
print("Accuracy:", accuracy_score(y_test, pred))
print("F1 Score:", f1_score(y_test, pred, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))


Accuracy: 0.7829046898638427
F1 Score: 0.7828598375915614
Confusion Matrix:
 [[527 134]
 [153 508]]


- Evaluate your model

In [67]:
forest.best_params_

AttributeError: 'RandomForestClassifier' object has no attribute 'best_params_'

**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [19]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# Define the parameter grid for GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 150],        # Number of trees in the forest
    'max_depth': [10, 20, 30, None],       # Maximum depth of each tree
    'min_samples_split': [2, 5, 10],       # Minimum samples required to split a node
    'min_samples_leaf': [1, 2, 4],         # Minimum samples required at each leaf node
    'max_features': ['sqrt', 'log2', None] # Number of features to consider for best split
}

# Initialize the RandomForestClassifier
forest = RandomForestClassifier()

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=forest, 
    param_grid=param_grid, 
    cv=5,                       # 5-fold cross-validation
    scoring='f1_weighted',      # Scoring based on weighted F1 score
    n_jobs=-1,                  # Use all available cores
    verbose=2                   # Print progress during grid search
)

# Fit GridSearchCV on the training data
grid_search.fit(X_train_norm, y_train)

# Get the best model from grid search
best_forest = grid_search.best_estimator_

# Make predictions with the best model
pred = best_forest.predict(X_test_norm)

# Calculate and print evaluation metrics
print("Best Parameters:", grid_search.best_params_)
print("Accuracy:", accuracy_score(y_test, pred))
print("F1 Score:", f1_score(y_test, pred, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))

Fitting 5 folds for each of 324 candidates, totalling 1620 fits
Best Parameters: {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 50}
Accuracy: 0.783661119515885
F1 Score: 0.783660624370594
Confusion Matrix:
 [[519 142]
 [144 517]]


- Run Grid Search

In [23]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# Initialize the Random Forest Classifier
forest = RandomForestClassifier(n_estimators=50, max_depth=10)

# Fit the model on the training data
forest.fit(X_train_norm, y_train)

# Make predictions on the test data
pred = forest.predict(X_test_norm)

# Calculate and print evaluation metrics for classification
print("Accuracy:", accuracy_score(y_test, pred))
print("F1 Score:", f1_score(y_test, pred, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))


Accuracy: 0.7912254160363086
F1 Score: 0.7912211154648991
Confusion Matrix:
 [[520 141]
 [135 526]]


- Evaluate your model