# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [8]:
#shape
spaceship.shape
spaceship=spaceship.dropna()
#check cabin
spaceship['Cabin']=spaceship['Cabin'].str.split('/').str[0]

In [10]:
#Select only  numerical
speceship_numerical = spaceship.select_dtypes(include=['number'])
X = speceship_numerical

In [12]:
#define target
y = spaceship["Transported"]


In [17]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler on X and transform the data
X_scaled = scaler.fit_transform(X)


In [21]:
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)


In [23]:
from sklearn.feature_selection import SelectKBest, f_classif

# Choose how many top features you want to keep (e.g., 5 or 10)
selector = SelectKBest(score_func=f_classif, k=5)

# Fit the selector on the scaled features
X_selected = selector.fit_transform(X_scaled, y)

# (Optional) Get the names of the selected features
selected_features = X.columns[selector.get_support()]
X_selected_df = pd.DataFrame(X_selected, columns=selected_features)

# Show selected features
print("Selected features:", list(selected_features))


Selected features: ['Age', 'RoomService', 'FoodCourt', 'Spa', 'VRDeck']


**Perform Train Test Split**

In [25]:
# Split the data (80% train, 20% test is common)
X_train, X_test, y_train, y_test = train_test_split(
    X_selected_df, y, test_size=0.2, random_state=42, stratify=y
)

# Check the shape
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

Train shape: (5284, 5)
Test shape: (1322, 5)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [31]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Use 'estimator' instead of 'base_estimator' for newer versions of scikit-learn
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    bootstrap=True,  # Bagging: with replacement
    random_state=42
)

bagging_model.fit(X_train, y_train)
y_pred_bagging = bagging_model.predict(X_test)
acc_bagging = accuracy_score(y_test, y_pred_bagging)
print("Bagging Accuracy:", acc_bagging)


Bagging Accuracy: 0.7723146747352496


- Random Forests

In [29]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
acc_rf = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", acc_rf)


Random Forest Accuracy: 0.7715582450832073


- Gradient Boosting

In [33]:
from sklearn.ensemble import GradientBoostingClassifier

gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)
acc_gb = accuracy_score(y_test, y_pred_gb)
print("Gradient Boosting Accuracy:", acc_gb)


Gradient Boosting Accuracy: 0.7829046898638427


- Adaptive Boosting

In [35]:
from sklearn.ensemble import AdaBoostClassifier

ada_model = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_model.fit(X_train, y_train)
y_pred_ada = ada_model.predict(X_test)
acc_ada = accuracy_score(y_test, y_pred_ada)
print("AdaBoost Accuracy:", acc_ada)




AdaBoost Accuracy: 0.7783661119515886


Which model is the best and why?

In [38]:
print("\nModel Comparison:")
print(f"Bagging Accuracy       : {acc_bagging:.4f}")
print(f"Random Forest Accuracy : {acc_rf:.4f}")
print(f"Gradient Boosting      : {acc_gb:.4f}")
print(f"AdaBoost Accuracy       : {acc_ada:.4f}")



Model Comparison:
Bagging Accuracy       : 0.7723
Random Forest Accuracy : 0.7716
Gradient Boosting      : 0.7829
AdaBoost Accuracy       : 0.7784


Random Forest often performs best on tabular data because:

It reduces overfitting better than a single tree.

It’s robust to noise and handles non-linearities well.

Gradient Boosting may outperform Random Forest in some cases:

Especially when data has complex interactions.

But it's more sensitive to noise and harder to tune.

Bagging is good when you want to reduce variance, but doesn’t perform as well as Random Forests unless finely tuned.

AdaBoost works well on less complex problems and is sensitive to outliers.

