# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [22]:
#Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,AdaBoostClassifier, GradientBoostingClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score

In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


**Step 1 - Check The Shape**

In [4]:
spaceship.shape

(8693, 14)

**Step 2: Data Types/Null Values**

In [5]:
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [6]:
spaceship.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

**Step 3: Removal/Modification**

In [7]:
spaceship.dropna(inplace=True)

In [8]:
# so Cabin only displays the letters
spaceship['Cabin'] = spaceship['Cabin'].str[0]
spaceship['Cabin'] = spaceship['Cabin'].str.upper()

In [9]:
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [10]:
spaceship.drop(['PassengerId', 'Name'], axis=1, inplace=True)

In [11]:
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6606 entries, 0 to 8692
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HomePlanet    6606 non-null   object 
 1   CryoSleep     6606 non-null   object 
 2   Cabin         6606 non-null   object 
 3   Destination   6606 non-null   object 
 4   Age           6606 non-null   float64
 5   VIP           6606 non-null   object 
 6   RoomService   6606 non-null   float64
 7   FoodCourt     6606 non-null   float64
 8   ShoppingMall  6606 non-null   float64
 9   Spa           6606 non-null   float64
 10  VRDeck        6606 non-null   float64
 11  Transported   6606 non-null   bool   
dtypes: bool(1), float64(6), object(5)
memory usage: 625.8+ KB


**AI Feedback**

One suggestion for the future is to consider imputing missing values instead of dropping rows, as this can help retain more data. Overall, well done!

**Step 4: make Dummies for certain columns**

In [12]:
spaceship = pd.get_dummies(spaceship, columns=['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP'])

In [13]:
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6606 entries, 0 to 8692
Data columns (total 25 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Age                        6606 non-null   float64
 1   RoomService                6606 non-null   float64
 2   FoodCourt                  6606 non-null   float64
 3   ShoppingMall               6606 non-null   float64
 4   Spa                        6606 non-null   float64
 5   VRDeck                     6606 non-null   float64
 6   Transported                6606 non-null   bool   
 7   HomePlanet_Earth           6606 non-null   bool   
 8   HomePlanet_Europa          6606 non-null   bool   
 9   HomePlanet_Mars            6606 non-null   bool   
 10  CryoSleep_False            6606 non-null   bool   
 11  CryoSleep_True             6606 non-null   bool   
 12  Cabin_A                    6606 non-null   bool   
 13  Cabin_B                    6606 non-null   bool   
 1

In [14]:
spaceship.columns

Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Transported', 'HomePlanet_Earth', 'HomePlanet_Europa',
       'HomePlanet_Mars', 'CryoSleep_False', 'CryoSleep_True', 'Cabin_A',
       'Cabin_B', 'Cabin_C', 'Cabin_D', 'Cabin_E', 'Cabin_F', 'Cabin_G',
       'Cabin_T', 'Destination_55 Cancri e', 'Destination_PSO J318.5-22',
       'Destination_TRAPPIST-1e', 'VIP_False', 'VIP_True'],
      dtype='object')

**Step 5: Perform Train Test Split**

In [15]:
features = spaceship.drop(columns = ['Transported', 'HomePlanet_Earth', 'HomePlanet_Europa',
       'HomePlanet_Mars', 'CryoSleep_False', 'CryoSleep_True', 'Cabin_A',
       'Cabin_B', 'Cabin_C', 'Cabin_D', 'Cabin_E', 'Cabin_F', 'Cabin_G',
       'Cabin_T', 'Destination_55 Cancri e', 'Destination_PSO J318.5-22',
       'Destination_TRAPPIST-1e', 'VIP_False', 'VIP_True'])

target = spaceship['Transported']

**AI feedback**

There seems to be a slight inconsistency in the features used for training. Ensure that dummy variables for categorical features are included in the training process. Despite this, your effort in tuning the hyperparameters is commendable.

In [16]:
x_tr, x_t, y_tr, y_t = train_test_split(features, target, test_size=0.20, random_state=0)


In [17]:
normalizer = MinMaxScaler()

normalizer.fit(x_tr)

In [18]:
x_tr_norm = normalizer.transform(x_tr)

x_t_norm = normalizer.transform(x_t)

In [19]:
x_tr_norm = pd.DataFrame(x_tr_norm, columns=x_tr.columns)
x_tr_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0.405063,0.0,0.0,0.0,0.0,0.0
1,0.050633,0.0,0.0,0.0,0.0,0.0
2,0.379747,0.0,0.007916,0.0,0.051276,0.0
3,0.21519,0.00131,0.0,0.046111,0.016378,4.9e-05
4,0.329114,0.0,0.0,0.0,0.0,0.0


In [20]:
x_t_norm = pd.DataFrame(x_t_norm, columns=x_t.columns)
x_t_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0.632911,0.0,0.0,0.0,0.0,0.0
1,0.227848,0.0,0.0,0.0,0.0,0.0
2,0.189873,0.0,0.0,0.0,0.0,0.0
3,0.658228,0.0,0.0,0.0,0.0,0.0
4,0.78481,0.0,0.054775,0.0,0.07774,0.0


**Step 6: Model Selection**

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

### Bagging and Pasting

In [24]:
bagging_reg = BaggingClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100, max_samples=1000)

In [25]:
bagging_reg.fit(x_tr_norm, y_tr)

- Evaluate your model

In [26]:
pred = bagging_reg.predict(x_t_norm)

print("Accuracy", accuracy_score(pred, y_t))
print("Recall", recall_score(pred, y_t))
print("Precision", precision_score(y_t, pred, average="binary"))


Accuracy 0.7806354009077155
Recall 0.7523809523809524
Precision 0.7523809523809524


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [32]:
grid = {"n_estimators": [50, 100, 200,500],
        "estimator__max_leaf_nodes": [250, 500, 1000, None],
        "estimator__max_depth":[10,30,50]}

In [30]:
ada_cla = AdaBoostClassifier(DecisionTreeClassifier())

In [33]:
model = GridSearchCV(estimator = ada_cla, param_grid = grid, cv=5)

- Run Grid Search

In [35]:
model.fit(x_tr_norm, y_tr)

- After training, we check what are the best values for the hyperparameters that we have tested.

In [36]:
model.best_params_

{'estimator__max_depth': 10,
 'estimator__max_leaf_nodes': 1000,
 'n_estimators': 200}

- You can retrieve the best model with the best parameters when accessing **best_estimator_** attribute

In [37]:
best_model = model.best_estimator_

- Evaluate your model

In [48]:
pred = best_model.predict(x_t_norm)

print("Accuracy", accuracy_score(pred, y_t))
print("Recall", recall_score(pred, y_t))
print("Precision", precision_score(y_t, pred, average="binary"))

Accuracy 0.7700453857791225
Recall 0.7468879668049793
Precision 0.7468879668049793


From doing the grid search we seem to that our accuracy a nd precision are lower then with the bagging and pasting.

**AI Feedback**

Remember, while accuracy is important, it's also crucial to consider other metrics such as F1-score, especially in imbalanced datasets. Continue to explore different evaluation metrics based on the problem context.