# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [2]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [4]:
### Dropping missing values
spaceship.dropna(inplace=True)

### Drop unnecessary columns
spaceship.drop(columns=['Name', 'PassengerId'], inplace=True)

### Transforming 'Cabin' to just the first letter
spaceship['Cabin'] = spaceship['Cabin'].str[0]

# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
spaceship['HomePlanet'] = le.fit_transform(spaceship['HomePlanet'])
spaceship['CryoSleep'] = le.fit_transform(spaceship['CryoSleep'])
spaceship['Cabin'] = le.fit_transform(spaceship['Cabin'])
spaceship['Destination'] = le.fit_transform(spaceship['Destination'])
spaceship['VIP'] = le.fit_transform(spaceship['VIP'])

spaceship.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,1,0,1,2,39.0,0,0.0,0.0,0.0,0.0,0.0,False
1,0,0,5,2,24.0,0,109.0,9.0,25.0,549.0,44.0,True
2,1,0,0,2,58.0,1,43.0,3576.0,0.0,6715.0,49.0,False
3,1,0,0,2,33.0,0,0.0,1283.0,371.0,3329.0,193.0,False
4,0,0,5,2,16.0,0,303.0,70.0,151.0,565.0,2.0,True


In [5]:
### Splitting the data into features and target

features = spaceship.drop(columns = ["Transported", "FoodCourt", "Spa", "ShoppingMall", "VIP", "Destination"])
target = spaceship["Transported"].astype(int)

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=42)

In [6]:
### Normalizing and transforming the data

from sklearn.preprocessing import MinMaxScaler

normalizer = MinMaxScaler()
normalizer.fit(X_train)

X_train_norm = normalizer.transform(X_train)
X_test_norm = normalizer.transform(X_test)

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [7]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score

In [29]:
### Bagging Classifier
base_tree = DecisionTreeClassifier(random_state=42)

bagging = BaggingClassifier(
    estimator = base_tree,
    random_state = 42,
    n_jobs = -1
)

In [30]:
bagging.fit(X_train_norm, y_train)

- Evaluate your model

In [31]:
### Accuracy score

pred = bagging1.predict(X_train_norm)
print("Train Accuracy:", accuracy_score(y_train, pred))

pred = bagging1.predict(X_test_norm)
print("Test Accuracy:", accuracy_score(y_test, pred))

Train Accuracy: 0.7738455715367146
Test Accuracy: 0.7465960665658093


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [11]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

- Run Grid Search

In [40]:
grid = {"n_estimators": [50, 100, 200],                         # belongs to BaggingClassifier
        "estimator__max_leaf_nodes": [10, 20, 50, 100],         # belongs to DecisionTreeClassifier
        "estimator__max_depth":[5, 10, 20]}                     # belongs to DecisionTreeClassifier

In [38]:
model = GridSearchCV(estimator = bagging, param_grid = grid, cv=5, verbose=10, n_jobs=-1)

In [39]:
model.fit(X_train_norm, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV 4/5; 1/36] START estimator__max_depth=5, estimator__max_leaf_nodes=10, n_estimators=50
[CV 1/5; 1/36] START estimator__max_depth=5, estimator__max_leaf_nodes=10, n_estimators=50
[CV 2/5; 1/36] START estimator__max_depth=5, estimator__max_leaf_nodes=10, n_estimators=50
[CV 3/5; 1/36] START estimator__max_depth=5, estimator__max_leaf_nodes=10, n_estimators=50
[CV 5/5; 1/36] START estimator__max_depth=5, estimator__max_leaf_nodes=10, n_estimators=50
[CV 2/5; 2/36] START estimator__max_depth=5, estimator__max_leaf_nodes=10, n_estimators=100
[CV 1/5; 2/36] START estimator__max_depth=5, estimator__max_leaf_nodes=10, n_estimators=100
[CV 3/5; 2/36] START estimator__max_depth=5, estimator__max_leaf_nodes=10, n_estimators=100
[CV 4/5; 1/36] END estimator__max_depth=5, estimator__max_leaf_nodes=10, n_estimators=50;, score=0.732 total time=   0.2s
[CV 2/5; 1/36] END estimator__max_depth=5, estimator__max_leaf_nodes=10, n_estimators

In [42]:
model.best_params_

{'estimator__max_depth': 20,
 'estimator__max_leaf_nodes': 20,
 'n_estimators': 50}

In [46]:
# Let's run another iteration of the model

grid = {"n_estimators": [30, 40, 50, 60, 70],                   # belongs to BaggingClassifier
        "estimator__max_leaf_nodes": [15, 20, 25, 30, 40],         # belongs to DecisionTreeClassifier
        "estimator__max_depth":[25, 20, 30, 50]}                     # belongs to DecisionTreeClassifier

In [47]:
model = GridSearchCV(estimator = bagging, param_grid = grid, cv=5, verbose=10, n_jobs=-1)

In [48]:
model.fit(X_train_norm, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
[CV 1/5; 1/100] START estimator__max_depth=25, estimator__max_leaf_nodes=15, n_estimators=30
[CV 2/5; 1/100] START estimator__max_depth=25, estimator__max_leaf_nodes=15, n_estimators=30
[CV 3/5; 1/100] START estimator__max_depth=25, estimator__max_leaf_nodes=15, n_estimators=30
[CV 4/5; 1/100] START estimator__max_depth=25, estimator__max_leaf_nodes=15, n_estimators=30
[CV 5/5; 1/100] START estimator__max_depth=25, estimator__max_leaf_nodes=15, n_estimators=30
[CV 1/5; 2/100] START estimator__max_depth=25, estimator__max_leaf_nodes=15, n_estimators=40
[CV 2/5; 2/100] START estimator__max_depth=25, estimator__max_leaf_nodes=15, n_estimators=40
[CV 3/5; 2/100] START estimator__max_depth=25, estimator__max_leaf_nodes=15, n_estimators=40
[CV 1/5; 1/100] END estimator__max_depth=25, estimator__max_leaf_nodes=15, n_estimators=30;, score=0.766 total time=   0.1s
[CV 3/5; 1/100] END estimator__max_depth=25, estimator__max_leaf_node

In [49]:
model.best_params_

{'estimator__max_depth': 25,
 'estimator__max_leaf_nodes': 25,
 'n_estimators': 50}

In [None]:
best_model = model.best_estimator_
best_model

- Evaluate your model

In [52]:
### Accuracy score

pred = best_model.predict(X_train_norm)
print("Train Accuracy:", accuracy_score(y_train, pred))

pred = best_model.predict(X_test_norm)
print("Test Accuracy:", accuracy_score(y_test, pred))

Train Accuracy: 0.7628690386071159
Test Accuracy: 0.7473524962178517


In [None]:
# Well, accuracy got worse :(

Exception ignored in: <function ResourceTracker.__del__ at 0x1049f9c60>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x106fc1c60>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x118c81c60>
Traceback (most recent call last

- Let's try for another method. One with good train result but with overfitting issues to see if we can make it better. 

Random forest:
- Train Accuracy: 0.9163512490537472
- Test Accuracy: 0.7420574886535553

In [54]:
forest = RandomForestClassifier()
forest.fit(X_train_norm, y_train)

In [56]:
### Accuracy score

pred = forest.predict(X_train_norm)
print("Train Accuracy:", accuracy_score(y_train, pred))

pred = forest.predict(X_test_norm)
print("Test Accuracy:", accuracy_score(y_test, pred))

Train Accuracy: 0.9163512490537472
Test Accuracy: 0.7443267776096822


In [66]:
grid = {
    "n_estimators": [150, 200, 300, 500],
    "max_depth": [10, 15],
    "max_leaf_nodes": [40, 50, 60, 70]
}

In [67]:
model = GridSearchCV(estimator = forest, 
                     param_grid = grid, 
                     cv=5, 
                     verbose=10, 
                     n_jobs=-1)

In [68]:
model.fit(X_train_norm, y_train)

Fitting 5 folds for each of 32 candidates, totalling 160 fits
[CV 1/5; 1/32] START max_depth=10, max_leaf_nodes=40, n_estimators=150..........
[CV 2/5; 1/32] START max_depth=10, max_leaf_nodes=40, n_estimators=150..........
[CV 3/5; 1/32] START max_depth=10, max_leaf_nodes=40, n_estimators=150..........
[CV 4/5; 1/32] START max_depth=10, max_leaf_nodes=40, n_estimators=150..........
[CV 5/5; 1/32] START max_depth=10, max_leaf_nodes=40, n_estimators=150..........
[CV 1/5; 2/32] START max_depth=10, max_leaf_nodes=40, n_estimators=200..........
[CV 2/5; 2/32] START max_depth=10, max_leaf_nodes=40, n_estimators=200..........
[CV 2/5; 1/32] END max_depth=10, max_leaf_nodes=40, n_estimators=150;, score=0.730 total time=   0.3s
[CV 1/5; 1/32] END max_depth=10, max_leaf_nodes=40, n_estimators=150;, score=0.764 total time=   0.4s
[CV 3/5; 2/32] START max_depth=10, max_leaf_nodes=40, n_estimators=200..........
[CV 3/5; 1/32] END max_depth=10, max_leaf_nodes=40, n_estimators=150;, score=0.737 tot

In [69]:
model.best_params_

{'max_depth': 10, 'max_leaf_nodes': 60, 'n_estimators': 200}

In [70]:
best_model = model.best_estimator_
best_model

In [71]:
### Accuracy score

pred = best_model.predict(X_train_norm)
print("Train Accuracy:", accuracy_score(y_train, pred))

pred = best_model.predict(X_test_norm)
print("Test Accuracy:", accuracy_score(y_test, pred))

Train Accuracy: 0.7747918243754731
Test Accuracy: 0.7602118003025718


In [72]:
# Better