# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
spaceship.shape

(8693, 14)

In [4]:
#your code here
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

In [5]:
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [6]:
spaceship = spaceship.dropna()

In [7]:
spaceship['Cabin'] = spaceship['Cabin'].str.extract('([A-Z])')[0]

In [8]:
spaceship = spaceship.drop(columns=['PassengerId', 'Name'])

In [9]:
spaceship = pd.get_dummies(spaceship, columns=['HomePlanet', 'Destination', 'Cabin'], drop_first=True)

**Perform Train Test Split**

In [10]:
from sklearn.model_selection import train_test_split

# Define features and target variable
X = spaceship.drop(columns=['Transported'])  # I remove transported from the other columns to choose the Features
y = spaceship['Transported']                 # I specify that the Target variable is transported

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # less than 100000 I can use 80%-20% if were more so 70%-30%.

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

# Ensure the proportions are correct
train_proportion = 5284 / (5284 + 1322)
test_proportion = 1322 / (5284 + 1322)

print(f"Proportion of training set: {train_proportion:.2f}")
print(f"Proportion of test set: {test_proportion:.2f}")

X_train shape: (5284, 19)
X_test shape: (1322, 19)
y_train shape: (5284,)
y_test shape: (1322,)
Proportion of training set: 0.80
Proportion of test set: 0.20


In [11]:
#your code here
from sklearn.model_selection import train_test_split

# Define features and target variable
X = spaceship.drop(columns=['Transported'])  # Features
y = spaceship['Transported']                   # Target variable

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [13]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier  # Importing DecisionTreeClassifier 
from sklearn.metrics import classification_report, confusion_matrix  # Ensure you include these imports

# Initialize the Bagging Classifier without specifying base_estimator
bagging_model = BaggingClassifier(n_estimators=100, random_state=42)  # Defaults to DecisionTreeClassifier

# Fit the model
bagging_model.fit(X_train, y_train)

# Make predictions
y_pred_bagging = bagging_model.predict(X_test)

# Evaluate the model
print("Bagging Model Performance:")
print(confusion_matrix(y_test, y_pred_bagging))
print(classification_report(y_test, y_pred_bagging))

Bagging Model Performance:
[[528 125]
 [137 532]]
              precision    recall  f1-score   support

       False       0.79      0.81      0.80       653
        True       0.81      0.80      0.80       669

    accuracy                           0.80      1322
   macro avg       0.80      0.80      0.80      1322
weighted avg       0.80      0.80      0.80      1322



- Random Forests

In [17]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import (
    BaggingClassifier, 
    RandomForestClassifier, 
    GradientBoostingClassifier, 
    AdaBoostClassifier)
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
#your code here
# **Random Forest**
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
print("Random Forest Model Performance:")
print(confusion_matrix(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

Random Forest Model Performance:
[[536 117]
 [134 535]]
              precision    recall  f1-score   support

       False       0.80      0.82      0.81       653
        True       0.82      0.80      0.81       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322



- Gradient Boosting

In [18]:
#your code here
# **Gradient Boosting**
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)
print("Gradient Boosting Model Performance:")
print(confusion_matrix(y_test, y_pred_gb))
print(classification_report(y_test, y_pred_gb))

Gradient Boosting Model Performance:
[[494 159]
 [ 96 573]]
              precision    recall  f1-score   support

       False       0.84      0.76      0.79       653
        True       0.78      0.86      0.82       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322



- Adaptive Boosting

In [21]:
#your code here
# **Adaptive Boosting (AdaBoost)**
ada_model = AdaBoostClassifier(n_estimators=100, random_state=42)  # Removed base_estimator
ada_model.fit(X_train, y_train)
y_pred_ada = ada_model.predict(X_test)
print("AdaBoost Model Performance:")
print(confusion_matrix(y_test, y_pred_ada))
print(classification_report(y_test, y_pred_ada))



AdaBoost Model Performance:
[[495 158]
 [112 557]]
              precision    recall  f1-score   support

       False       0.82      0.76      0.79       653
        True       0.78      0.83      0.80       669

    accuracy                           0.80      1322
   macro avg       0.80      0.80      0.80      1322
weighted avg       0.80      0.80      0.80      1322



Which model is the best and why?

Random Forest:

Overall Performance: Random Forest achieved the highest accuracy at 81%. It effectively balances true positive detection (transported passengers) and manages both false positives and negatives well.
F1-Score: With an F1-score of 81% for both positive and negative classes, Random Forest demonstrates consistent performance across all target classes.
Strong Generalization: Random Forest generalizes well to unseen data by averaging multiple decision trees, effectively handling feature diversity.
Performance on True Positives: Random Forest shows a slight edge in predicting the positive class, supported by its recall and precision metrics.