# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [2]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [4]:
# Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif

In [5]:
# Handling missing values - filling numerical columns with median and categorical with mode
spaceship['Age'].fillna(spaceship['Age'].median(), inplace=True)
spaceship['RoomService'].fillna(spaceship['RoomService'].median(), inplace=True)
spaceship['FoodCourt'].fillna(spaceship['FoodCourt'].median(), inplace=True)
spaceship['ShoppingMall'].fillna(spaceship['ShoppingMall'].median(), inplace=True)
spaceship['Spa'].fillna(spaceship['Spa'].median(), inplace=True)
spaceship['VRDeck'].fillna(spaceship['VRDeck'].median(), inplace=True)

In [6]:
spaceship['HomePlanet'].fillna(spaceship['HomePlanet'].mode()[0], inplace=True)
spaceship['CryoSleep'].fillna(spaceship['CryoSleep'].mode()[0], inplace=True)
spaceship['Cabin'].fillna(spaceship['Cabin'].mode()[0], inplace=True)
spaceship['Destination'].fillna(spaceship['Destination'].mode()[0], inplace=True)
spaceship['VIP'].fillna(spaceship['VIP'].mode()[0], inplace=True)
spaceship['Name'].fillna(spaceship['Name'].mode()[0], inplace=True)

In [7]:
# Converting categorical columns to category type
categorical_columns = ['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP', 'Name']
for col in categorical_columns:
    spaceship[col] = spaceship[col].astype('category')

In [8]:
spaceship['Transported'] = spaceship['Transported'].astype('int')


In [11]:
X = spaceship.drop(columns=['Transported', 'PassengerId', 'Name'])
y = spaceship['Transported']

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [14]:
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Feature selection using mutual_info_classif
selector = SelectKBest(score_func=mutual_info_classif, k=10)  # Select top 10 features
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

# Random Forest with Decision Tree Classifier
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest_clf.fit(X_train_selected, y_train)
y_pred_rf = random_forest_clf.predict(X_test_selected)

# Calculate accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf:.4f}")

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred_rf)
print("Confusion Matrix:\n", conf_matrix)

# Classification Report
class_report = classification_report(y_test, y_pred_rf)
print("Classification Report:\n", class_report)

Random Forest Accuracy: 0.7809
Confusion Matrix:
 [[621 240]
 [141 737]]
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.72      0.77       861
           1       0.75      0.84      0.79       878

    accuracy                           0.78      1739
   macro avg       0.78      0.78      0.78      1739
weighted avg       0.78      0.78      0.78      1739



- Evaluate your model

**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [20]:
# Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV

# Load the data
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")

# Handling missing values - filling numerical columns with median and categorical with mode
spaceship['Age'].fillna(spaceship['Age'].median(), inplace=True)
spaceship['RoomService'].fillna(spaceship['RoomService'].median(), inplace=True)
spaceship['FoodCourt'].fillna(spaceship['FoodCourt'].median(), inplace=True)
spaceship['ShoppingMall'].fillna(spaceship['ShoppingMall'].median(), inplace=True)
spaceship['Spa'].fillna(spaceship['Spa'].median(), inplace=True)
spaceship['VRDeck'].fillna(spaceship['VRDeck'].median(), inplace=True)
spaceship['HomePlanet'].fillna(spaceship['HomePlanet'].mode()[0], inplace=True)
spaceship['CryoSleep'].fillna(spaceship['CryoSleep'].mode()[0], inplace=True)
spaceship['Cabin'].fillna(spaceship['Cabin'].mode()[0], inplace=True)
spaceship['Destination'].fillna(spaceship['Destination'].mode()[0], inplace=True)
spaceship['VIP'].fillna(spaceship['VIP'].mode()[0], inplace=True)

# Converting categorical columns to category type
categorical_columns = ['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP']
for col in categorical_columns:
    spaceship[col] = spaceship[col].astype('category')
spaceship['Transported'] = spaceship['Transported'].astype('int')

# Splitting data into features and target variable
X = spaceship.drop(columns=['Transported', 'PassengerId', 'Name'])
y = spaceship['Transported']

# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns)
    ])

# Create pipeline with preprocessing and feature selection
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('selector', SelectKBest(score_func=mutual_info_classif, k=10))
])

# Fit and transform the training data
X_train_transformed = pipeline.fit_transform(X_train, y_train)
X_test_transformed = pipeline.transform(X_test)

# Hyperparameter tuning for Random Forest Classifier
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train_transformed, y_train)

# Best parameters
print("Best Parameters: ", grid_search.best_params_)

# Train the best model with the tuned hyperparameters
best_rf_clf = grid_search.best_estimator_
best_rf_clf.fit(X_train_transformed, y_train)

# Predictions and evaluation
y_pred_rf = best_rf_clf.predict(X_test_transformed)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy after Hyperparameter Tuning: {accuracy_rf:.4f}")

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred_rf)
print("Confusion Matrix:\n", conf_matrix)

# Classification Report
class_report = classification_report(y_test, y_pred_rf)
print("Classification Report:\n", class_report)




Fitting 5 folds for each of 324 candidates, totalling 1620 fits


KeyboardInterrupt: 

- Run Grid Search

In [1]:
# Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV

# Load the data
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")

# Handling missing values - filling numerical columns with median and categorical with mode
spaceship['Age'].fillna(spaceship['Age'].median(), inplace=True)
spaceship['RoomService'].fillna(spaceship['RoomService'].median(), inplace=True)
spaceship['FoodCourt'].fillna(spaceship['FoodCourt'].median(), inplace=True)
spaceship['ShoppingMall'].fillna(spaceship['ShoppingMall'].median(), inplace=True)
spaceship['Spa'].fillna(spaceship['Spa'].median(), inplace=True)
spaceship['VRDeck'].fillna(spaceship['VRDeck'].median(), inplace=True)
spaceship['HomePlanet'].fillna(spaceship['HomePlanet'].mode()[0], inplace=True)
spaceship['CryoSleep'].fillna(spaceship['CryoSleep'].mode()[0], inplace=True)
spaceship['Cabin'].fillna(spaceship['Cabin'].mode()[0], inplace=True)
spaceship['Destination'].fillna(spaceship['Destination'].mode()[0], inplace=True)
spaceship['VIP'].fillna(spaceship['VIP'].mode()[0], inplace=True)

# Converting categorical columns to category type
categorical_columns = ['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP']
for col in categorical_columns:
    spaceship[col] = spaceship[col].astype('category')
spaceship['Transported'] = spaceship['Transported'].astype('int')

# Splitting data into features and target variable
X = spaceship.drop(columns=['Transported', 'PassengerId', 'Name'])
y = spaceship['Transported']

# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns)
    ])

# Create pipeline with preprocessing and feature selection
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('selector', SelectKBest(score_func=mutual_info_classif, k=10))
])

# Fit and transform the training data
X_train_transformed = pipeline.fit_transform(X_train, y_train)
X_test_transformed = pipeline.transform(X_test)

# Hyperparameter tuning for Random Forest Classifier
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train_transformed, y_train)

# Best parameters
print("Best Parameters: ", grid_search.best_params_)

# Train the best model with the tuned hyperparameters
best_rf_clf = grid_search.best_estimator_
best_rf_clf.fit(X_train_transformed, y_train)

# Predictions and evaluation
y_pred_rf = best_rf_clf.predict(X_test_transformed)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy after Hyperparameter Tuning: {accuracy_rf:.4f}")

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred_rf)
print("Confusion Matrix:\n", conf_matrix)

# Classification Report
class_report = classification_report(y_test, y_pred_rf)
print("Classification Report:\n", class_report)




Fitting 5 folds for each of 324 candidates, totalling 1620 fits


540 fits failed out of a total of 1620.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
225 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\gianf\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\gianf\anaconda3\Lib\site-packages\sklearn\base.py", line 1466, in wrapper
    estimator._validate_params()
  File "c:\Users\gianf\anaconda3\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\gianf\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParam

Best Parameters:  {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Random Forest Accuracy after Hyperparameter Tuning: 0.7832
Confusion Matrix:
 [[654 207]
 [170 708]]
Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.76      0.78       861
           1       0.77      0.81      0.79       878

    accuracy                           0.78      1739
   macro avg       0.78      0.78      0.78      1739
weighted avg       0.78      0.78      0.78      1739



- Evaluate your model