# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [2]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [5]:
# Perform Feature sclaing and Feature selection
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Handling missing values
spaceship = spaceship.apply(lambda col: col.fillna(col.median()) if col.dtype in ['int64', 'float64'] else col.fillna(col.mode()[0]))

# Split the data into X and y
X = spaceship.drop(columns = ['Transported'])
y = spaceship['Transported']

# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# could not convert string to float: 'Earth'
# The error is because the column 'Planet' is a string and we need to convert it to a number
# Convert the column 'HomePlanet' to a number
X_train['HomePlanet'] = X_train['HomePlanet'].astype('category').cat.codes
X_test['HomePlanet'] = X_test['HomePlanet'].astype('category').cat.codes

# could not convert string to float: 'F/575/P'
# The error is because the column 'Cabin' is a string and we need to convert it to a number
# Convert the column 'Cabin' to a number
X_train['Cabin'] = X_train['Cabin'].astype('category').cat.codes
X_test['Cabin'] = X_test['Cabin'].astype('category').cat.codes

# Convert the column 'Destination' to a number
X_train['Destination'] = X_train['Destination'].astype('category').cat.codes
X_test['Destination'] = X_test['Destination'].astype('category').cat.codes

# Convert the column 'Name' to a number
X_train['Name'] = X_train['Name'].astype('category').cat.codes
X_test['Name'] = X_test['Name'].astype('category').cat.codes



# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Feature selection
selector = SelectKBest(f_classif, k = 3)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

# Display the selected feature indices
selected_features = selector.get_support(indices=True)
print("Selected feature indices:", selected_features)

# Optionally, if you want to see the names of the selected features
selected_feature_names = X.columns[selected_features]
print("Selected feature names:", selected_feature_names)

# Check the shapes of the resulting datasets
print("X_train_selected shape:", X_train_selected.shape)
print("X_test_selected shape:", X_test_selected.shape)




Selected feature indices: [ 2  7 10]
Selected feature names: Index(['CryoSleep', 'RoomService', 'Spa'], dtype='object')
X_train_selected shape: (6954, 3)
X_test_selected shape: (1739, 3)


  spaceship = spaceship.apply(lambda col: col.fillna(col.median()) if col.dtype in ['int64', 'float64'] else col.fillna(col.mode()[0]))
  spaceship = spaceship.apply(lambda col: col.fillna(col.median()) if col.dtype in ['int64', 'float64'] else col.fillna(col.mode()[0]))


In [6]:
# Perform Train and Test split
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Handling missing values
spaceship = spaceship.apply(lambda col: col.fillna(col.median()) if col.dtype in ['int64', 'float64'] else col.fillna(col.mode()[0]))

# Split the data into X and y
X = spaceship.drop(columns = ['Transported'])
y = spaceship['Transported']

# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# could not convert string to float: 'Earth'
# The error is because the column 'Planet' is a string and we need to convert it to a number
# Convert the column 'HomePlanet' to a number
X_train['HomePlanet'] = X_train['HomePlanet'].astype('category').cat.codes
X_test['HomePlanet'] = X_test['HomePlanet'].astype('category').cat.codes

# could not convert string to float: 'F/575/P'
# The error is because the column 'Cabin' is a string and we need to convert it to a number
# Convert the column 'Cabin' to a number
X_train['Cabin'] = X_train['Cabin'].astype('category').cat.codes
X_test['Cabin'] = X_test['Cabin'].astype('category').cat.codes

# Convert the column 'Destination' to a number
X_train['Destination'] = X_train['Destination'].astype('category').cat.codes
X_test['Destination'] = X_test['Destination'].astype('category').cat.codes

# Convert the column 'Name' to a number
X_train['Name'] = X_train['Name'].astype('category').cat.codes
X_test['Name'] = X_test['Name'].astype('category').cat.codes

# Create a pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('feature_selection', SelectKBest(f_classif, k = 3)),
    ('model', RandomForestClassifier(random_state = 42))
])

# Fit the model
pipe.fit(X_train, y_train)

# Predict the target
y_pred = pipe.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print("Accuracy:", accuracy)

Accuracy: 0.7061529614721104


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [8]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

# Create a RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train_selected, y_train)

# Fit the model
rf_clf.fit(X_train, y_train)

# Predict the target
y_pred = rf_clf.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print("Accuracy:", accuracy)

Accuracy: 0.7809085681426107


- Evaluate your model

In [9]:
# Evaluate your model
from sklearn.metrics import classification_report

# Print the classification report
print(classification_report(y_test, y_pred))

# Print the feature importances
print("Feature importances:", rf_clf.feature_importances_)
print("Selected feature names:", selected_feature_names)




              precision    recall  f1-score   support

       False       0.83      0.70      0.76       861
        True       0.74      0.86      0.80       878

    accuracy                           0.78      1739
   macro avg       0.79      0.78      0.78      1739
weighted avg       0.79      0.78      0.78      1739

Feature importances: [0.10508564 0.0315417  0.08604683 0.13717838 0.01758201 0.09181163
 0.00178289 0.09599075 0.07452939 0.06209901 0.10951763 0.0854912
 0.10134295]
Selected feature names: Index(['CryoSleep', 'RoomService', 'Spa'], dtype='object')


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [10]:
# define hyperparameters to fine tune
param_grid = {
    'model__n_estimators': [50, 100, 200],
    'model__max_depth': [10, 20, 30, 40, 50]
}


- Run Grid Search

In [11]:
# Fine tune the model
from sklearn.model_selection import GridSearchCV

# Create a GridSearchCV
grid_search = GridSearchCV(pipe, param_grid, cv = 5, n_jobs = -1)

# Fit the model
grid_search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best hyperparameters:", grid_search.best_params_)

# Predict the target
y_pred = grid_search.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print("Accuracy:", accuracy)

# Print the classification report
print(classification_report(y_test, y_pred))

# Print the feature importances
print("Feature importances:", grid_search.best_estimator_.named_steps['model'].feature_importances_)
print("Selected feature names:", selected_feature_names)

Best hyperparameters: {'model__max_depth': 10, 'model__n_estimators': 100}
Accuracy: 0.7274295572167913
              precision    recall  f1-score   support

       False       0.76      0.66      0.71       861
        True       0.70      0.79      0.75       878

    accuracy                           0.73      1739
   macro avg       0.73      0.73      0.73      1739
weighted avg       0.73      0.73      0.73      1739

Feature importances: [0.33413448 0.33562842 0.3302371 ]
Selected feature names: Index(['CryoSleep', 'RoomService', 'Spa'], dtype='object')


- Evaluate your model

In [12]:
# evaluate the model

# Print the classification report
print(classification_report(y_test, y_pred))

# Print the feature importances
print("Feature importances:", grid_search.best_estimator_.named_steps['model'].feature_importances_)

# Print the selected feature names
print("Selected feature names:", selected_feature_names)



              precision    recall  f1-score   support

       False       0.76      0.66      0.71       861
        True       0.70      0.79      0.75       878

    accuracy                           0.73      1739
   macro avg       0.73      0.73      0.73      1739
weighted avg       0.73      0.73      0.73      1739

Feature importances: [0.33413448 0.33562842 0.3302371 ]
Selected feature names: Index(['CryoSleep', 'RoomService', 'Spa'], dtype='object')
