# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [24]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [25]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [26]:
#your code here
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

spaceship.dropna(inplace = True)
spaceship['Cabin'] = spaceship['Cabin'].apply(lambda x: x[0])
spaceship.drop(columns=["PassengerId", "Name"], inplace = True)


#Dummies
ohe = OneHotEncoder(sparse_output=False)

ohe.fit(spaceship[['HomePlanet','CryoSleep',"Cabin","Destination", "VIP"]])
df = ohe.transform(spaceship[['HomePlanet','CryoSleep',"Cabin","Destination", "VIP"]])
df_total = pd.DataFrame(df, columns=ohe.get_feature_names_out(), index=spaceship.index)
spaceship_df = pd.concat([spaceship, df_total], axis = 1)
spaceship_df.drop(columns = ['HomePlanet','CryoSleep',"Cabin","Destination", "VIP"], inplace= True)


#Normalization/ Standarization
normalizer = MinMaxScaler()
normalizer.fit(spaceship_df[["Age","RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"]])
train_norm = normalizer.transform(spaceship_df[["Age","RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"]])
train_norm_df = pd.DataFrame(train_norm, columns = ["Age","RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"], index=spaceship_df.index)
final_df = pd.concat([spaceship_df, train_norm_df], axis = 1)
final_df.drop(columns = ["Age","RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"], inplace= True)
final_df

Unnamed: 0,Transported,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,False,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,True,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,False,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
3,False,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
4,True,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,False,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
8689,False,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
8690,True,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
8691,False,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [27]:
#your code here
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, root_mean_squared_error

bagging_clas = BaggingClassifier(DecisionTreeClassifier(max_depth=20), n_estimators=100,  max_samples = 1000)
bagging_clas.fit(X_train, y_train)
y_pred_test_bag = bagging_clas.predict(X_test)
final_df

Unnamed: 0,Transported,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,False,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,True,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,False,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
3,False,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
4,True,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,False,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
8689,False,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
8690,True,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
8691,False,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


- Evaluate your model

In [29]:
#your code here
print(f"MAE {mean_absolute_error(y_pred_test_bag, y_test): .2f}")
print(f"MSE {mean_squared_error(y_pred_test_bag, y_test): .2f}")
print(f"RMSE {root_mean_squared_error(y_pred_test_bag, y_test): .2f}")
print(f"R2 score {bagging_clas.score(X_test, y_test): .2f}")

MAE  0.27
MSE  0.27
RMSE  0.52
R2 score  0.73


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [30]:
#your code here
parameter_grid = {"n_estimators": [50, 250],
                  "max_samples": [500,1500]}

- Run Grid Search

In [33]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import time
import scipy.stats as st

dt = BaggingClassifier(random_state=123)
confidence_level = 0.95
folds = 10

# Now we need to create an intance of the GridSearchCV class
gs = GridSearchCV(dt, param_grid=parameter_grid, cv=folds, verbose=10) # Here the "cv" allows you to define the number of folds to use.

start_time = time.time()
gs.fit(X_train, y_train)
end_time = time.time()

print("\n")
print(f"Time taken to find the best combination of hyperparameters among the given ones: {end_time - start_time: .4f} seconds")
print("\n")


print(f"The best combination of hyperparameters has been: {gs.best_params_}")
print(f"The R2 is: {gs.best_score_: .4f}")

results_gs_df = pd.DataFrame(gs.cv_results_).sort_values(by="mean_test_score", ascending=False)

#print(results_df.head())
gs_mean_score = results_gs_df.iloc[0,-3]
gs_sem = results_gs_df.iloc[0,-2] / np.sqrt(10)

gs_tc = st.t.ppf(1-((1-confidence_level)/2), df=folds-1)
gs_lower_bound = gs_mean_score - ( gs_tc * gs_sem )
gs_upper_bound = gs_mean_score + ( gs_tc * gs_sem )

print(f"The R2 confidence interval for the best combination of hyperparameters is: \
    ({gs_lower_bound: .4f}, {gs_mean_score: .4f}, {gs_upper_bound: .4f}) ")

#display(results_df)

# Let's store the best model
best_model = gs.best_estimator_

# Now is time evaluate the model in the test set
y_pred_test_df = best_model.predict(X_test)
y_pred_test_df = best_model.predict(X_test)

y_pred_test_df = best_model.predict(X_test)

print("\n")
print(f"Test MAE: {mean_absolute_error(y_pred_test_df, y_test): .4f}")
print(f"Test MSE: {mean_squared_error(y_pred_test_df, y_test): .4f}")
print(f"Test RMSE: {root_mean_squared_error(y_pred_test_df, y_test): .4f}")
print(f"Test R2 score:  {best_model.score(X_test, y_test): .4f}")
print("\n")


Fitting 10 folds for each of 4 candidates, totalling 40 fits
[CV 1/10; 1/4] START max_samples=500, n_estimators=50...........................
[CV 1/10; 1/4] END max_samples=500, n_estimators=50;, score=0.682 total time=   0.1s
[CV 2/10; 1/4] START max_samples=500, n_estimators=50...........................
[CV 2/10; 1/4] END max_samples=500, n_estimators=50;, score=0.724 total time=   0.1s
[CV 3/10; 1/4] START max_samples=500, n_estimators=50...........................
[CV 3/10; 1/4] END max_samples=500, n_estimators=50;, score=0.749 total time=   0.1s
[CV 4/10; 1/4] START max_samples=500, n_estimators=50...........................
[CV 4/10; 1/4] END max_samples=500, n_estimators=50;, score=0.730 total time=   0.1s
[CV 5/10; 1/4] START max_samples=500, n_estimators=50...........................
[CV 5/10; 1/4] END max_samples=500, n_estimators=50;, score=0.725 total time=   0.1s
[CV 6/10; 1/4] START max_samples=500, n_estimators=50...........................
[CV 6/10; 1/4] END max_sampl

- Evaluate your model

In [None]:
#We can conclude that our model was well implemented as there is not a bigger R2 as well as a smaller RMSE