# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [21]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [22]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [25]:
spaceship.dropna(axis=0, inplace=True)
spaceship.isna().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64

In [27]:
spaceship['New_cabin'] = spaceship['Cabin'].str[0]
valid_letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'T']
spaceship['New_cabin'] = spaceship['New_cabin'].apply(lambda x: x if x in valid_letters else None)
spaceship['New_cabin']

0       B
1       F
2       A
3       A
4       F
       ..
8688    A
8689    G
8690    G
8691    E
8692    E
Name: New_cabin, Length: 6606, dtype: object

In [29]:
new_spaceship=spaceship.drop(columns=['PassengerId','Name','Cabin'])
new_spaceship

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,New_cabin
0,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B
1,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F
2,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A
3,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,A
4,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,F
...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False,A
8689,Earth,True,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,False,G
8690,Earth,False,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True,G
8691,Europa,False,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False,E


In [31]:
categorical_cols = new_spaceship.select_dtypes(include=['object']).columns
categorical_cols

Index(['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'New_cabin'], dtype='object')

In [33]:
df_dummies = pd.get_dummies(new_spaceship, columns=categorical_cols,dtype=int)
df_dummies

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,...,VIP_False,VIP_True,New_cabin_A,New_cabin_B,New_cabin_C,New_cabin_D,New_cabin_E,New_cabin_F,New_cabin_G,New_cabin_T
0,39.0,0.0,0.0,0.0,0.0,0.0,False,0,1,0,...,1,0,0,1,0,0,0,0,0,0
1,24.0,109.0,9.0,25.0,549.0,44.0,True,1,0,0,...,1,0,0,0,0,0,0,1,0,0
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,0,1,0,...,0,1,1,0,0,0,0,0,0,0
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,0,1,0,...,1,0,1,0,0,0,0,0,0,0
4,16.0,303.0,70.0,151.0,565.0,2.0,True,1,0,0,...,1,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,41.0,0.0,6819.0,0.0,1643.0,74.0,False,0,1,0,...,0,1,1,0,0,0,0,0,0,0
8689,18.0,0.0,0.0,0.0,0.0,0.0,False,1,0,0,...,1,0,0,0,0,0,0,0,1,0
8690,26.0,0.0,0.0,1872.0,1.0,0.0,True,1,0,0,...,1,0,0,0,0,0,0,0,1,0
8691,32.0,0.0,1049.0,0.0,353.0,3235.0,False,0,1,0,...,1,0,0,0,0,0,1,0,0,0


In [35]:
features=df_dummies.drop(columns=['Transported'])

In [37]:
target=new_spaceship['Transported']
target

0       False
1        True
2       False
3       False
4        True
        ...  
8688    False
8689    False
8690     True
8691    False
8692     True
Name: Transported, Length: 6606, dtype: bool

In [43]:
from sklearn.model_selection import train_test_split

# Diviser les données en train (80%) et test (20%)
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

In [48]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [50]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns = X_train.columns)
X_train_scaled.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,...,VIP_False,VIP_True,New_cabin_A,New_cabin_B,New_cabin_C,New_cabin_D,New_cabin_E,New_cabin_F,New_cabin_G,New_cabin_T
0,0.220515,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123,-1.081675,-0.583761,1.959293,-1.353973,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,2.954847,-0.69021,-0.6543,-0.019459
1,-1.704525,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123,0.924492,-0.583761,-0.510388,-1.353973,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,-0.69021,1.528352,-0.019459
2,0.083012,-0.347046,-0.143345,-0.305892,0.71807,-0.271123,0.924492,-0.583761,-0.510388,0.738567,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,1.448834,-0.6543,-0.019459
3,-0.810757,-0.326846,-0.282099,0.6542,0.044548,-0.270259,-1.081675,-0.583761,1.959293,0.738567,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,2.954847,-0.69021,-0.6543,-0.019459
4,-0.191994,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123,0.924492,-0.583761,-0.510388,-1.353973,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,-0.69021,1.528352,-0.019459


In [52]:
X_test_scaled = pd.DataFrame(X_test_scaled, columns = X_test.columns)
X_test_scaled.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,...,VIP_False,VIP_True,New_cabin_A,New_cabin_B,New_cabin_C,New_cabin_D,New_cabin_E,New_cabin_F,New_cabin_G,New_cabin_T
0,1.45804,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123,-1.081675,-0.583761,1.959293,-1.353973,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,1.448834,-0.6543,-0.019459
1,-0.742005,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123,0.924492,-0.583761,-0.510388,-1.353973,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,-0.69021,1.528352,-0.019459
2,-0.94826,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123,0.924492,-0.583761,-0.510388,-1.353973,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,-0.69021,1.528352,-0.019459
3,1.595543,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123,0.924492,-0.583761,-0.510388,-1.353973,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,-0.69021,1.528352,-0.019459
4,2.283058,-0.347046,0.678012,-0.305892,1.228811,-0.271123,-1.081675,1.713031,-0.510388,0.738567,...,-6.129384,6.129384,-0.183429,3.09664,-0.31319,-0.248362,-0.338427,-0.69021,-0.6543,-0.019459


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [56]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=15)

In [58]:
knn.fit(X_train_scaled, y_train)

In [60]:
pred = knn.predict(X_test_scaled)
pred

array([ True,  True,  True, ...,  True,  True,  True])

In [71]:
y_test.values
y_test = y_test.astype(int)

In [73]:
knn.score(X_test_scaled, y_test)

0.783661119515885

- Evaluate your model

In [77]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

pred = knn.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", knn.score(X_test_scaled, y_test))

MAE 0.21633888048411498
RMSE 0.4651224360145563




R2 score 0.783661119515885


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [90]:
grid = {"n_estimators": [50, 100,500],
        "estimator__max_leaf_nodes": [250, 500, 1000, None],
        "estimator__max_depth":[10,30,50]}

- Run Grid Search

In [98]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

ada_reg = AdaBoostRegressor(DecisionTreeRegressor())

In [100]:
model = GridSearchCV(estimator = ada_reg, param_grid = grid, cv=5, verbose=4, n_jobs=-1)

In [104]:
model.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


In [106]:
model.best_params_

{'estimator__max_depth': 10,
 'estimator__max_leaf_nodes': 1000,
 'n_estimators': 50}

In [108]:
best_model = model.best_estimator_

- Evaluate your model

In [115]:
pred = best_model.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", best_model.score(X_test_scaled, y_test))

MAE 0.33737052035113213
RMSE 0.40777275612631414
R2 score 0.33488551744459816


