# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# import matplotlib.pyplot as plt
# import seaborn as sns

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
# drop na rows
drop_rows = spaceship[spaceship.isna().sum(axis=1) > 0]
drop_rows.index
spaceship.drop(drop_rows.index, inplace=True)
spaceship.reset_index(drop=True,inplace=True)

# feature filtering
spaceship['Cabin'] = spaceship['Cabin'].str.split('/').str[0]

# split into features and target
features = spaceship.drop(['PassengerId', 'Name','Transported'], axis=1)
target = spaceship.Transported

# get one hot encode vars
def encode_via_one_hot(features_df):
    features_cat = features_df.select_dtypes(include='object')
    print(features_cat)

    encoder = OneHotEncoder(drop= 'if_binary').fit(features_cat)
    cols = encoder.get_feature_names_out(input_features=features_cat.columns)
    spaceship_encode = pd.DataFrame(encoder.transform(features_cat).toarray(),columns=cols)
    spaceship_encode.reset_index(drop=True, inplace=True)
    return spaceship_encode

# reformat features with one hot encoding for cateogorical features
cat_features_one_hot = encode_via_one_hot(features)
num_features = features.select_dtypes(include='number')

features = pd.concat([num_features, cat_features_one_hot],axis=1)

# scale feature data
def scale_data(feature_data, normalizer):
    normalizer.fit(feature_data)

    norm_arr = normalizer.transform(feature_data)

    feature_data_norm = pd.DataFrame(norm_arr, columns = feature_data.columns)

    return feature_data_norm

scaler = MinMaxScaler()
features = scale_data(features, normalizer=scaler)

     HomePlanet CryoSleep Cabin    Destination    VIP
0        Europa     False     B    TRAPPIST-1e  False
1         Earth     False     F    TRAPPIST-1e  False
2        Europa     False     A    TRAPPIST-1e   True
3        Europa     False     A    TRAPPIST-1e  False
4         Earth     False     F    TRAPPIST-1e  False
...         ...       ...   ...            ...    ...
6601     Europa     False     A    55 Cancri e   True
6602      Earth      True     G  PSO J318.5-22  False
6603      Earth     False     G    TRAPPIST-1e  False
6604     Europa     False     E    55 Cancri e  False
6605     Europa     False     E    TRAPPIST-1e  False

[6606 rows x 5 columns]


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [4]:
#your code here
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

- Evaluate your model

In [5]:
#your code here
# for model evaluation
def eval_model(model, test_data, test_labels):
    pred = model.predict(test_data)

    # print("MAE", mean_absolute_error(pred, test_labels))
    # print("RMSE", mean_squared_error(pred, test_labels, squared=False))
    # print("R2 score", model.score(test_data, test_labels))

    print(f"Precision:",{precision_score(test_labels, pred, average='binary')})
    
    print(f"Accuracy:",{accuracy_score(test_labels, pred)})
    print(f"Recall:",{recall_score(test_labels, pred)})
    print(f"F1:",{f1_score(test_labels, pred)})

**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [6]:
#your code here
grid = {"n_estimators": [50, 100, 200, 500],
        "estimator__max_leaf_nodes": [250, 500], 
        "estimator__max_depth":[10,30,50]} # "estimator__max_leaf_nodes": [250, 500, 1000,None]

- Run Grid Search

In [7]:
import warnings

# Ignore all warnings
warnings.filterwarnings('ignore')

In [None]:
ada_class = AdaBoostClassifier(DecisionTreeClassifier())
model = GridSearchCV(estimator = ada_class, param_grid = grid, cv=5)
model.fit(X_train, y_train)
best_model = model.best_estimator_
best_model

- Evaluate your model

In [None]:
eval_model(model, X_test, y_test)