# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [14]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score

from sklearn.metrics import accuracy_score, classification_report

from sklearn.model_selection import GridSearchCV


In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [5]:
#Check the shape of the data
spaceship.shape

#Check for data types
spaceship.dtypes

#Check for missing values
spaceship.isnull().sum()

#- Removing all rows or all columns containing missing data.
#For this exercise, because we have such low amount of null values, we will drop rows containing any missing value.
spaceship.dropna()
spaceship.dropna(subset=['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], inplace=True)

#**Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}
# Transform the Cabin column to extract the first letter
spaceship['Cabin'] = spaceship['Cabin'].str[0]

# Optionally, ensure that only the expected categories are present
valid_categories = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}
spaceship = spaceship[spaceship['Cabin'].isin(valid_categories)]
 
# - Drop PassengerId and Name
spaceship.drop(['PassengerId', 'Name'], axis=1, inplace=True)

spaceship.dropna(inplace=True)
spaceship['Cabin'] = spaceship['Cabin'].str[0]

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [6]:
#your code here

#one-hot encoding
print("Original dataframe")
df_categorical = spaceship.select_dtypes(include=['object'])
display(df_categorical)

#creating dummy variables > pd.get_dummies()
features = pd.get_dummies(spaceship, columns=['HomePlanet', 'CryoSleep','Cabin','Destination','VIP'])
features = features.drop(columns=['Transported'])

print('Dataframe with Dummy variables')
features

# Perform train-test split
target = spaceship['Transported']
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=None, random_state=0)

normalizer = MinMaxScaler()
normalizer.fit(X_train)

X_train_norm = normalizer.transform(X_train)
X_test_norm = normalizer.transform(X_test)

X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)

# Selecting numerical features
numerical_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

# Initializing the StandardScaler
scaler = StandardScaler()

# Applying scaling to numerical features
spaceship[numerical_features] = scaler.fit_transform(spaceship[numerical_features])

# Display the first few rows to check the scaling
spaceship.head()

Original dataframe


Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,VIP
0,Europa,False,B,TRAPPIST-1e,False
1,Earth,False,F,TRAPPIST-1e,False
2,Europa,False,A,TRAPPIST-1e,True
3,Europa,False,A,TRAPPIST-1e,False
4,Earth,False,F,TRAPPIST-1e,False
...,...,...,...,...,...
8688,Europa,False,A,55 Cancri e,True
8689,Earth,True,G,PSO J318.5-22,False
8690,Earth,False,G,TRAPPIST-1e,False
8691,Europa,False,E,55 Cancri e,False


Dataframe with Dummy variables


Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,0.695365,False,-0.346316,-0.286103,-0.282915,-0.275577,-0.27029,False
1,Earth,False,F,TRAPPIST-1e,-0.337089,False,-0.178108,-0.280735,-0.243729,0.206465,-0.231242,True
2,Europa,False,A,TRAPPIST-1e,2.00314,True,-0.279959,1.846533,-0.282915,5.620436,-0.226804,False
3,Europa,False,A,TRAPPIST-1e,0.282383,False,-0.346316,0.479046,0.298603,2.647405,-0.09901,False
4,Earth,False,F,TRAPPIST-1e,-0.887732,False,0.121271,-0.244356,-0.046233,0.220513,-0.268515,True


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [7]:
#your code here
#your code here
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=100,
    learning_rate=0.5, random_state=42)

ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)

print("AdaBoost Classifier Accuracy:", accuracy_score(y_test, y_pred_ada))
print("AdaBoost Classifier Report:")
print(classification_report(y_test, y_pred_ada))

AdaBoost Classifier Accuracy: 0.8083973979893554
AdaBoost Classifier Report:
              precision    recall  f1-score   support

       False       0.83      0.77      0.80       837
        True       0.79      0.85      0.82       854

    accuracy                           0.81      1691
   macro avg       0.81      0.81      0.81      1691
weighted avg       0.81      0.81      0.81      1691



- Evaluate your model

In [8]:
#your code here

# Evaluate the model's performance
accuracy_ada = accuracy_score(y_test, y_pred_ada)
classification_report_ada = classification_report(y_test, y_pred_ada)

# Display the results
accuracy_ada, classification_report_ada

(0.8083973979893554,
 '              precision    recall  f1-score   support\n\n       False       0.83      0.77      0.80       837\n        True       0.79      0.85      0.82       854\n\n    accuracy                           0.81      1691\n   macro avg       0.81      0.81      0.81      1691\nweighted avg       0.81      0.81      0.81      1691\n')

**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [12]:
# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.5, 1.0],
    'base_estimator__max_depth': [1, 2, 3, 4, 5]
}

# Initialize the AdaBoost model with DecisionTreeClassifier as the base estimator
ada_clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())



- Run Grid Search

In [15]:
# Initialize Grid Search with cross-validation
grid_search = GridSearchCV(estimator=ada_clf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the Grid Search to the data
grid_search.fit(X_train, y_train)




- Evaluate your model

In [16]:
# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

best_params, best_score

({'base_estimator__max_depth': 5, 'learning_rate': 0.01, 'n_estimators': 200},
 0.7928243992965479)