# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
#your code here

# 1. Clean missing values
spaceship_clean = spaceship.dropna().copy()

# 2. Cabin Transformation (Cardinality Reduction)
# Extract the 'Deck' from the Cabin string (e.g., 'B/0/P' -> 'B')
spaceship_clean['Cabin_Deck'] = spaceship_clean['Cabin'].str[0]

# 3. Create Total Spending Feature (Aggregation)
spending_cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
spaceship_clean['Total_Spent'] = spaceship_clean[spending_cols].sum(axis=1)

# 4. Feature Selection: Drop non-predictive IDs and original Cabin string
spaceship_clean = spaceship_clean.drop(['PassengerId', 'Name', 'Cabin'], axis=1)

# 5. Encoding Categorical Variables
# Convert HomePlanet, CryoSleep, Destination, VIP, and Cabin_Deck into dummies
spaceship_final = pd.get_dummies(spaceship_clean, drop_first=True)

# 6. Target Conversion
spaceship_final['Transported'] = spaceship_final['Transported'].astype(int)

**Perform Train Test Split**

In [6]:
X = spaceship_final.drop(['Transported'], axis=1)
y = spaceship_final['Transported'].astype(int)

# 2. Stratified Split (The "Fair" Comparison)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Scaling (Essential for Gradient Boosting stability)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [8]:
#your code here

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Initialize the "Base Expert" (usually a Decision Tree)
base_estimator = DecisionTreeClassifier(random_state=42)

# 2. BAGGING (bootstrap=True)
bagging_model = BaggingClassifier(
    estimator=base_estimator, 
    n_estimators=100, 
    bootstrap=True, 
    random_state=42
)

# 3. PASTING (bootstrap=False)
pasting_model = BaggingClassifier(
    estimator=base_estimator, 
    n_estimators=100, 
    bootstrap=False, 
    random_state=42
)

# 4. Train and Evaluate
bagging_model.fit(X_train_scaled, y_train)
pasting_model.fit(X_train_scaled, y_train)

print(f"Bagging Accuracy: {accuracy_score(y_test, bagging_model.predict(X_test_scaled)):.4f}")
print(f"Pasting Accuracy: {accuracy_score(y_test, pasting_model.predict(X_test_scaled)):.4f}")

Bagging Accuracy: 0.7753
Pasting Accuracy: 0.7602


- Random Forests

In [9]:
#your code here

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# 1. Initialize the Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)

# 2. Train the model
rf_model.fit(X_train_scaled, y_train)

# 3. Predict and Evaluate
rf_preds = rf_model.predict(X_test_scaled)

print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_preds):.4f}")
print("\nDetailed Report:\n", classification_report(y_test, rf_preds))

Random Forest Accuracy: 0.7708

Detailed Report:
               precision    recall  f1-score   support

           0       0.77      0.77      0.77       656
           1       0.77      0.77      0.77       666

    accuracy                           0.77      1322
   macro avg       0.77      0.77      0.77      1322
weighted avg       0.77      0.77      0.77      1322



- Gradient Boosting

In [10]:
#your code here

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

# 1. Initialize the Gradient Boosting Classifier
gb_model = GradientBoostingClassifier(
    n_estimators=100, 
    learning_rate=0.1, 
    max_depth=3, 
    random_state=42
)

# 2. Train the model
gb_model.fit(X_train_scaled, y_train)

# 3. Predict and Evaluate
gb_preds = gb_model.predict(X_test_scaled)

print(f"Gradient Boosting Accuracy: {accuracy_score(y_test, gb_preds):.4f}")
print("\nClassification Report:\n", classification_report(y_test, gb_preds))

Gradient Boosting Accuracy: 0.7844

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.73      0.77       656
           1       0.76      0.84      0.80       666

    accuracy                           0.78      1322
   macro avg       0.79      0.78      0.78      1322
weighted avg       0.79      0.78      0.78      1322



- Adaptive Boosting

In [11]:
#your code here

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Initialize the base "Stub" (optional, but good for control)
base_stub = DecisionTreeClassifier(max_depth=1)

# 2. Initialize AdaBoost
ada_model = AdaBoostClassifier(
    estimator=base_stub, 
    n_estimators=100, 
    learning_rate=1.0, 
    random_state=42
)

# 3. Train and Predict
ada_model.fit(X_train_scaled, y_train)
ada_preds = ada_model.predict(X_test_scaled)

print(f"AdaBoost Accuracy: {accuracy_score(y_test, ada_preds):.4f}")

AdaBoost Accuracy: 0.7829


In [12]:

# Comparing all results
final_comparison = {
    "Bagging": accuracy_score(y_test, bagging_model.predict(X_test_scaled)),
    "Random Forest": accuracy_score(y_test, rf_model.predict(X_test_scaled)),
    "Gradient Boosting": accuracy_score(y_test, gb_model.predict(X_test_scaled)),
    "AdaBoost": accuracy_score(y_test, ada_preds)
}


comparison_df = pd.DataFrame.from_dict(final_comparison, orient='index', columns=['Accuracy'])
print(comparison_df.sort_values(by='Accuracy', ascending=False))

                   Accuracy
Gradient Boosting  0.784418
AdaBoost           0.782905
Bagging            0.775340
Random Forest      0.770802


Which model is the best and why?

In [14]:
#comment here
print ("The ensemble methods, particularly Ada Boosting and Gradient Boosting, outperform the basic Bagging and Random Forest models. This highlights the effectiveness of combining multiple weak learners to create a robust predictive model.")


The ensemble methods, particularly Ada Boosting and Gradient Boosting, outperform the basic Bagging and Random Forest models. This highlights the effectiveness of combining multiple weak learners to create a robust predictive model.
