# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
# Dropping na values
spaceship = spaceship.dropna()

In [4]:
# Retrieve Cabin's first letter
spaceship['Cabin'] = spaceship['Cabin'].str[0]

In [5]:
# Drop PassengerId and Name
spaceship.drop(columns=['PassengerId', 'Name'], inplace=True)

In [6]:
# Perform one-hot encoding on categorical columns
spaceship = df = pd.get_dummies(spaceship, columns=['HomePlanet', 'Cabin', 'Destination'])

In [7]:
spaceship.head()

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Earth,...,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,False,39.0,False,0.0,0.0,0.0,0.0,0.0,False,False,...,True,False,False,False,False,False,False,False,False,True
1,False,24.0,False,109.0,9.0,25.0,549.0,44.0,True,True,...,False,False,False,False,True,False,False,False,False,True
2,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,False,...,False,False,False,False,False,False,False,False,False,True
3,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,False,...,False,False,False,False,False,False,False,False,False,True
4,False,16.0,False,303.0,70.0,151.0,565.0,2.0,True,True,...,False,False,False,False,True,False,False,False,False,True


In [None]:
# I am going to scale the data after splitting it

**Perform Train Test Split**

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
# Splitting the data into features (X) and target variable (y)
X = spaceship.drop(columns=['Transported'])
y = spaceship['Transported']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)

In [10]:
from sklearn.preprocessing import StandardScaler

In [11]:
# Scaling the data
scaler = StandardScaler()

# Fit the scaler only on the training data
X_train_scaled = scaler.fit_transform(X_train)

# Use the same scaler to transform the test data
X_test_scaled = scaler.transform(X_test)

print("Original Training Data:\n", X_train)
print("Scaled Training Data:\n", X_train_scaled)
print("Original Test Data:\n", X_test)
print("Scaled Test Data:\n", X_test_scaled)

Original Training Data:
      CryoSleep   Age    VIP  RoomService  FoodCourt  ShoppingMall     Spa  \
7537      True  26.0  False          0.0        0.0           0.0     0.0   
6310     False  30.0  False         77.0       71.0        1147.0     0.0   
1277     False  39.0  False       1535.0        0.0         340.0     0.0   
4047     False  25.0  False        412.0        0.0         567.0   775.0   
1609     False  23.0  False       2210.0        0.0          89.0     0.0   
...        ...   ...    ...          ...        ...           ...     ...   
7241     False  63.0  False          0.0      243.0           0.0   777.0   
470       True  18.0  False          0.0        0.0           0.0     0.0   
6491     False  22.0  False          0.0      752.0          18.0     0.0   
8258      True  13.0  False          0.0        0.0           0.0     0.0   
3908     False  44.0  False         36.0      588.0           0.0  1675.0   

      VRDeck  HomePlanet_Earth  HomePlanet_Europa 

In [15]:
scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)

In [16]:
scaled_df.head()

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,...,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,1.354535,-0.197988,-0.158191,-0.348809,-0.293194,-0.30638,-0.26684,-0.274003,-1.095787,1.724219,...,3.072896,-0.310633,-0.243603,-0.339468,-0.695595,-0.655775,-0.013758,1.927083,-0.329323,-1.49194
1,-0.738261,0.076237,-0.158191,-0.229224,-0.249256,1.706069,-0.26684,-0.274003,-1.095787,-0.579973,...,-0.325426,-0.310633,-0.243603,2.945786,-0.695595,-0.655775,-0.013758,-0.518919,3.036528,-1.49194
2,-0.738261,0.693244,-0.158191,2.03513,-0.293194,0.290161,-0.26684,0.393211,-1.095787,-0.579973,...,-0.325426,-0.310633,-0.243603,-0.339468,1.437617,-0.655775,-0.013758,-0.518919,-0.329323,0.670268
3,-0.738261,-0.266544,-0.158191,0.29105,-0.293194,0.68844,0.405894,-0.274003,-1.095787,-0.579973,...,-0.325426,-0.310633,-0.243603,-0.339468,1.437617,-0.655775,-0.013758,-0.518919,-0.329323,0.670268
4,-0.738261,-0.403657,-0.158191,3.083442,-0.293194,-0.150227,-0.26684,-0.274003,-1.095787,-0.579973,...,-0.325426,-0.310633,4.10504,-0.339468,-0.695595,-0.655775,-0.013758,-0.518919,-0.329323,0.670268


In [19]:
scaled_df['VIP'].unique()

array([-0.15819054,  6.32149036])

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [21]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [22]:
# bagging classifier
bagging_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),  # Base estimator (e.g., decision tree)
    n_estimators=100,  # Number of base estimators
    max_samples=0.8,  # Proportion of training samples to use for each base estimator
    bootstrap=True,   # Whether to use bootstrap sampling (True for bagging)
    random_state=22   # Random seed for reproducibility
)

In [24]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Fit the BaggingClassifier to the training data
bagging_clf.fit(X_train, y_train)

# Make predictions on the test data
pred = bagging_clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, pred)

# Calculate precision
precision = precision_score(y_test, pred)

# Calculate recall
recall = recall_score(y_test, pred)

# Calculate F1-score
f1 = f1_score(y_test, pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Accuracy: 0.8048411497730711
Precision: 0.8304821150855366
Recall: 0.7818448023426061
F1-score: 0.8054298642533937


In [25]:
# Pasting classifier
pasting_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),  # Base estimator (e.g., decision tree)
    n_estimators=100,  # Number of base estimators
    max_samples=0.8,  # Proportion of training samples to use for each base estimator
    bootstrap=False,  # Whether to use bootstrap sampling (False for pasting)
    random_state=22   # Random seed for reproducibility
)

In [27]:
# Fit the PastingClassifier to the training data
pasting_clf.fit(X_train, y_train)



In [28]:
# Make predictions on the test data
pred = pasting_clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, pred)

# Calculate precision
precision = precision_score(y_test, pred)

# Calculate recall
recall = recall_score(y_test, pred)

# Calculate F1-score
f1 = f1_score(y_test, pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Accuracy: 0.783661119515885
Precision: 0.802130898021309
Recall: 0.7715959004392386
F1-score: 0.7865671641791045


- Random Forests

In [29]:
from sklearn.ensemble import RandomForestClassifier

In [30]:
# Create a Random Forest classifier
random_forest_clf = RandomForestClassifier(
    n_estimators=100,  # Number of trees in the forest
    max_features='sqrt',  # Number of features to consider when looking for the best split
    random_state=22  # Random seed for reproducibility
)

# Train the Random Forest classifier
random_forest_clf.fit(X_train, y_train)

# Make predictions on the test data
pred = random_forest_clf.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Accuracy: 0.8048411497730711
Precision: 0.8294573643410853
Recall: 0.7833089311859444
F1-score: 0.8057228915662651


- Gradient Boosting

In [31]:
from sklearn.ensemble import GradientBoostingClassifier

# Create a Gradient Boosting classifier
gradient_boosting_clf = GradientBoostingClassifier(
    n_estimators=100,  # Number of boosting stages (trees)
    learning_rate=0.1,  # Learning rate (shrinkage parameter)
    max_depth=3,  # Maximum depth of the individual trees
    random_state=22  # Random seed for reproducibility
)

# Train the Gradient Boosting classifier
gradient_boosting_clf.fit(X_train, y_train)

# Make predictions on the test data
pred = gradient_boosting_clf.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Accuracy: 0.8010590015128594
Precision: 0.7924791086350975
Recall: 0.8330893118594437
F1-score: 0.8122769450392576


- Adaptive Boosting

In [32]:
from sklearn.ensemble import AdaBoostClassifier

# Create an AdaBoost classifier
adaboost_clf = AdaBoostClassifier(
    n_estimators=100,  # Number of weak learners (usually decision trees)
    learning_rate=1.0,  # Learning rate (contribution of each weak learner)
    random_state=22  # Random seed for reproducibility
)

# Train the AdaBoost classifier
adaboost_clf.fit(X_train, y_train)

# Make predictions on the test data
pred = adaboost_clf.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Accuracy: 0.7912254160363086
Precision: 0.7919655667144907
Recall: 0.808199121522694
F1-score: 0.8


Which model is the best and why?

* BaggingClassifier and Random Forest achieve the highest accuracy (0.8048).
* BaggingClassifier and Random Forest also achieve the highest precision (0.8305 for BaggingClassifier and 0.8295 for Random Forest).
* GradientBoosting achieves the highest recall (0.8331).
* BaggingClassifier achieves the highest F1-score (0.8054).

Based on these metrics, BaggingClassifier seems to perform slightly better overall, as it achieves high accuracy, precision, recall, and F1-score. However, the differences between the models are relatively small.