# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [12]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score, roc_auc_score


In [13]:
spaceship = pd.read_csv('https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv')
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection

In [14]:
# drop empty values
initial_row_count = len(spaceship)
spaceship = spaceship.dropna()
final_row_count = len(spaceship)
dropped_rows = initial_row_count - final_row_count
# print(f'Dropped rows: {dropped_rows}\n{spaceship}')

In [15]:
# 'Deck' column to binaries

# extract the deck as the first letter, then create binary features for each deck
spaceship['Deck'] = spaceship['Cabin'].str.extract(r'([A-GT])', expand=False)

# create binary columns for each deck
spaceship = pd.get_dummies(spaceship, columns=['Deck'], prefix='Deck')

# spaceship.head()

In [16]:
# drop columns
spaceship = spaceship.drop(columns=['PassengerId', 'Name'])

# spaceship.head()

In [24]:
# dummies from categorical data
spaceship_dummies = pd.get_dummies(spaceship, columns=['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP'])

spaceship_dummies.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck_A,Deck_B,Deck_C,...,Cabin_G/998/S,Cabin_G/999/P,Cabin_G/999/S,Cabin_T/1/P,Cabin_T/3/P,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,False,False,True,False,...,False,False,False,False,False,False,False,True,True,False
1,24.0,109.0,9.0,25.0,549.0,44.0,True,False,False,False,...,False,False,False,False,False,False,False,True,True,False
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,True,False,False,...,False,False,False,False,False,False,False,True,False,True
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,True,False,False,...,False,False,False,False,False,False,False,True,True,False
4,16.0,303.0,70.0,151.0,565.0,2.0,True,False,False,False,...,False,False,False,False,False,False,False,True,True,False


**Perform Train Test Split**

In [18]:
# separate features (X) and target (y)
X = spaceship_dummies.drop(columns=['Transported'])  # drop 'Transported' column as it's the target
y = spaceship_dummies['Transported']  # define target column

# train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# display resulting shapes
print(f'Training Features Shape: {X_train.shape}')
print(f'Testing Features Shape: {X_test.shape}')
print(f'Training Target Shape: {y_train.shape}')
print(f'Testing Target Shape: {y_test.shape}')

Training Features Shape: (5284, 5329)
Testing Features Shape: (1322, 5329)
Training Target Shape: (5284,)
Testing Target Shape: (1322,)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [19]:
# initialize models
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,
    bootstrap=True,  # Bagging
    random_state=42
)

pasting_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,
    bootstrap=False,  # Pasting
    random_state=42
)

# fit models
bagging_model.fit(X_train, y_train)
pasting_model.fit(X_train, y_train)

# predict & evaluate
y_pred_bagging = bagging_model.predict(X_test)
y_pred_pasting = pasting_model.predict(X_test)

print('Bagging Accuracy:', accuracy_score(y_test, y_pred_bagging))
print('Bagging ROC-AUC:', roc_auc_score(y_test, bagging_model.predict_proba(X_test)[:, 1]))
print('Pasting Accuracy:', accuracy_score(y_test, y_pred_pasting))
print('Pasting ROC-AUC:', roc_auc_score(y_test, pasting_model.predict_proba(X_test)[:, 1]))


Bagging Accuracy: 0.8093797276853253
Bagging ROC-AUC: 0.882405455332065
Pasting Accuracy: 0.81089258698941
Pasting ROC-AUC: 0.8730351121762957


### Results

You ran two methods, *Bagging* and *Pasting*, to build a model that predicts whether someone was 'transported' or not (a *yes* or *no* question). Here's what happened:

1. **Bagging:**
   - **Accuracy:** ~81% (0.809)
     - Out of 100 predictions, the model got 81 correct.
   - **ROC-AUC:** ~88% (0.882)
     - This measures how well the model separates *yes* from *no*. A perfect score is 100% (1.0), and this model did 88%, which is pretty good.

2. **Pasting:**
   - **Accuracy:** ~81% (0.811)
     - Similar to Bagging, this method also got 81 out of 100 predictions correct.
   - **ROC-AUC:** ~87% (0.873)
     - Slightly lower than Bagging, meaning it’s a tiny bit worse at separating *yes* from *no*.



#### What Does This Mean?
- Both methods did almost equally well at predicting whether someone was 'transported'.
- **Bagging**:
  - Is slightly better at separating the two groups (yes/no), based on the ROC-AUC score.
  - It uses 'sampling with replacement', meaning it trains on some repeated data points.
- **Pasting**:
  - Has slightly better accuracy but a lower ROC-AUC score, meaning it’s not as good at distinguishing between the groups.
  - It uses 'sampling without replacement', meaning every training example is unique.



#### Which one is Better?
- **Accuracy (~81%) is very similar for both,** so you could argue they perform about the same in terms of how often they are 'right'.
- However, if you care more about **ROC-AUC (88% vs. 87%),** Bagging is slightly better at telling apart *yes* and *no*.

- Random Forests

In [20]:
#your code here

- Gradient Boosting

In [21]:
#your code here

- Adaptive Boosting

In [22]:
#your code here

Which model is the best and why?

In [23]:
#comment here