# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [53]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score

from sklearn.metrics import accuracy_score, classification_report

from sklearn.model_selection import GridSearchCV


In [54]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [55]:
#Check the shape of the data
spaceship.shape

#Check for data types
spaceship.dtypes

#Check for missing values
spaceship.isnull().sum()

#- Removing all rows or all columns containing missing data.
#For this exercise, because we have such low amount of null values, we will drop rows containing any missing value.
spaceship.dropna()
spaceship.dropna(subset=['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], inplace=True)

#**Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}
# Transform the Cabin column to extract the first letter
spaceship['Cabin'] = spaceship['Cabin'].str[0]

# Optionally, ensure that only the expected categories are present
valid_categories = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}
spaceship = spaceship[spaceship['Cabin'].isin(valid_categories)]
 
# - Drop PassengerId and Name
spaceship.drop(['PassengerId', 'Name'], axis=1, inplace=True)

spaceship.dropna(inplace=True)
spaceship['Cabin'] = spaceship['Cabin'].str[0]

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [56]:
# After preprocessing (dummy encoding and missing value handling)
num_features = X.shape[1]
print(f"Number of features available: {num_features}")

# Adjust 'k' to a value less than or equal to the available number of features
k = min(10, num_features)  # Adjust 'k' accordingly, here set to 10 or lower if fewer features are available

# Apply SelectKBest with Chi-Squared Test
select_k_best = SelectKBest(score_func=chi2, k=k)
X_kbest = select_k_best.fit_transform(X, y)
selected_features_kbest = X.columns[select_k_best.get_support()]

print("Features selected by SelectKBest:", selected_features_kbest)

# Continue with RFE and Feature Importance as before...

Number of features available: 6
Features selected by SelectKBest: Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], dtype='object')


In [76]:
# Step 0: Assume X and y are your feature set and target variable
# For demonstration purposes, we'll simulate loading and preprocessing

from sklearn.feature_selection import SelectKBest, chi2, RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Select the features and target variable
X = spaceship.drop(columns=['Transported','HomePlanet','CryoSleep','Cabin','Destination','VIP'])
y = spaceship['Transported']

# Fill in any missing values
X.fillna(X.median(), inplace=True)
y.fillna(y.mode()[0], inplace=True)

# Convert categorical variables to dummies (one-hot encoding)
X = pd.get_dummies(X)

# Step 5: Apply SelectKBest with Chi-Squared Test
k = min(10, X.shape[1])  # Ensure 'k' does not exceed the number of features
select_k_best = SelectKBest(score_func=chi2, k=k)
X_kbest = select_k_best.fit_transform(X, y)
selected_features_kbest = X.columns[select_k_best.get_support()]

print("Features selected by SelectKBest:", selected_features_kbest)

# Step 6: Recursive Feature Elimination (RFE)
log_reg = LogisticRegression(solver='liblinear')
rfe = RFE(log_reg, n_features_to_select=5)  # Further refine to top 5 features
X_rfe = rfe.fit_transform(X_kbest, y)
selected_features_rfe = selected_features_kbest[rfe.get_support()]

print("Features selected by RFE:", selected_features_rfe)

# Step 7: Feature Importance using RandomForest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_rfe, y)
importances = rf.feature_importances_

# Combine feature names and their importances
feature_importance = list(zip(selected_features_rfe, importances))
feature_importance.sort(key=lambda x: x[1], reverse=True)  # Sort by importance

print("Feature importances from RandomForest:")
for feature, importance in feature_importance:
    print(f"{feature}: {importance:.4f}")

# Optional: Select top features based on importance threshold
threshold = 0.05  # Example threshold for feature importance
top_features_final = [f for f, importance in feature_importance if importance >= threshold]

print("Final top features selected based on importance threshold:", top_features_final)

# Final refined feature set
X_final_refined = X_rfe[:, [i for i, f in enumerate(selected_features_rfe) if f in top_features_final]]

# Step 8: Split the final refined data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_final_refined, y, test_size=0.20, random_state=42)

# Step 9: Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 10: Initialize and train the AdaBoost classifier
ada_clf = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=1), 
    n_estimators=100, 
    learning_rate=0.5, 
    random_state=42
)

ada_clf.fit(X_train, y_train)

# Step 11: Make predictions on the test set
y_pred_ada = ada_clf.predict(X_test)

# Step 12: Evaluate the model's performance
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print(f"AdaBoost Classifier Accuracy: {accuracy_ada:.4f}")

print("AdaBoost Classifier Report:")
print(classification_report(y_test, y_pred_ada))

# Final refined feature set shape
print("Shape of the final refined feature set:", X_final_refined.shape)



# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Step 6: Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 7: Initialize and train the AdaBoost classifier
ada_clf = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=1), 
    n_estimators=100, 
    learning_rate=0.5, 
    random_state=42
)

ada_clf.fit(X_train, y_train)

# Step 8: Make predictions on the test set
y_pred_ada = ada_clf.predict(X_test)

ada_clf.fit(X_train, y_train)

# Step 9: Evaluate the model's performance
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print(f"AdaBoost Classifier Accuracy: {accuracy_ada:.4f}")

print("AdaBoost Classifier Report:")
print(classification_report(y_test, y_pred_ada))

# Step 1: SelectKBest with Chi-Squared Test
k = 6  # Start with a relatively large k to retain more features initially
select_k_best = SelectKBest(score_func=chi2, k=k)
X_kbest = select_k_best.fit_transform(X, y)
selected_features_kbest = X.columns[select_k_best.get_support()]

print("Features selected by SelectKBest:", selected_features_kbest)

# Step 2: Recursive Feature Elimination (RFE)
log_reg = LogisticRegression(solver='liblinear')
rfe = RFE(log_reg, n_features_to_select=10)  # Further refine to top 10 features
X_rfe = rfe.fit_transform(X_kbest, y)
selected_features_rfe = selected_features_kbest[rfe.get_support()]

print("Features selected by RFE:", selected_features_rfe)

# Step 3: Feature Importance using RandomForest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_rfe, y)
importances = rf.feature_importances_

# Combine feature names and their importances
feature_importance = list(zip(selected_features_rfe, importances))
feature_importance.sort(key=lambda x: x[1], reverse=True)  # Sort by importance

print("Feature importances from RandomForest:")
for feature, importance in feature_importance:
    print(f"{feature}: {importance:.4f}")

# Optional: Select top features based on importance threshold
threshold = 0.05  # Example threshold for feature importance
top_features_final = [f for f, importance in feature_importance if importance >= threshold]

print("Final top features selected based on importance threshold:", top_features_final)

# Final refined feature set
X_final_refined = X_rfe[:, [i for i, f in enumerate(selected_features_rfe) if f in top_features_final]]

# You can now use X_final_refined for further modeling
print("Shape of the final refined feature set:", X_final_refined.shape)



Features selected by SelectKBest: Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], dtype='object')
Features selected by RFE: Index(['Age', 'RoomService', 'FoodCourt', 'Spa', 'VRDeck'], dtype='object')
Feature importances from RandomForest:
RoomService: 0.2439
Spa: 0.2378
VRDeck: 0.2093
FoodCourt: 0.1771
Age: 0.1318
Final top features selected based on importance threshold: ['RoomService', 'Spa', 'VRDeck', 'FoodCourt', 'Age']




AdaBoost Classifier Accuracy: 0.7783
AdaBoost Classifier Report:
              precision    recall  f1-score   support

       False       0.84      0.71      0.77       699
        True       0.73      0.85      0.79       654

    accuracy                           0.78      1353
   macro avg       0.78      0.78      0.78      1353
weighted avg       0.79      0.78      0.78      1353

Shape of the final refined feature set: (6764, 5)




AdaBoost Classifier Accuracy: 0.7938
AdaBoost Classifier Report:
              precision    recall  f1-score   support

       False       0.85      0.73      0.79       699
        True       0.75      0.86      0.80       654

    accuracy                           0.79      1353
   macro avg       0.80      0.80      0.79      1353
weighted avg       0.80      0.79      0.79      1353

Features selected by SelectKBest: Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], dtype='object')
Features selected by RFE: Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], dtype='object')
Feature importances from RandomForest:
Spa: 0.2089
RoomService: 0.2001
VRDeck: 0.1827
FoodCourt: 0.1598
ShoppingMall: 0.1345
Age: 0.1141
Final top features selected based on importance threshold: ['Spa', 'RoomService', 'VRDeck', 'FoodCourt', 'ShoppingMall', 'Age']
Shape of the final refined feature set: (6764, 6)


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [78]:
#your code here
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=100,
    learning_rate=0.5, random_state=42)

ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)

print("AdaBoost Classifier Accuracy:", accuracy_score(y_test, y_pred_ada))
print("AdaBoost Classifier Report:")
print(classification_report(y_test, y_pred_ada))

AdaBoost Classifier Accuracy: 0.7937915742793792
AdaBoost Classifier Report:
              precision    recall  f1-score   support

       False       0.85      0.73      0.79       699
        True       0.75      0.86      0.80       654

    accuracy                           0.79      1353
   macro avg       0.80      0.80      0.79      1353
weighted avg       0.80      0.79      0.79      1353



- Evaluate your model

In [79]:
#your code here

# Evaluate the model's performance
accuracy_ada = accuracy_score(y_test, y_pred_ada)
classification_report_ada = classification_report(y_test, y_pred_ada)

# Display the results
accuracy_ada, classification_report_ada

(0.7937915742793792,
 '              precision    recall  f1-score   support\n\n       False       0.85      0.73      0.79       699\n        True       0.75      0.86      0.80       654\n\n    accuracy                           0.79      1353\n   macro avg       0.80      0.80      0.79      1353\nweighted avg       0.80      0.79      0.79      1353\n')

**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [80]:
# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.5, 1.0],
    'base_estimator__max_depth': [1, 2, 3, 4]
}

# Define a more refined parameter grid
param_grid_refined = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.05, 0.1, 0.2, 0.3],
    'base_estimator__max_depth': [2, 3, 4]
}

# Initialize the AdaBoost model with DecisionTreeClassifier as the base estimator
ada_clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())



- Run Grid Search

In [65]:
# Initialize Grid Search with cross-validation
grid_search = GridSearchCV(estimator=ada_clf, param_grid=param_grid_refined, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the Grid Search to the data
grid_search.fit(X_train, y_train)




- Evaluate your model

In [81]:
# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

best_params, best_score

({'base_estimator__max_depth': 2, 'learning_rate': 0.1, 'n_estimators': 200},
 0.7924291446837867)