# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier

In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [6]:
#your code here
# Drop 'Name' and 'Cabin' as they are too specific or messy
spaceship.drop(['Name', 'Cabin', 'PassengerId'], axis=1, inplace=True)

# Fill numerical features with median
num_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
spaceship[num_features] = spaceship[num_features].fillna(spaceship[num_features].median())

# Fill categorical features with mode
cat_features = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP']
spaceship[cat_features] = spaceship[cat_features].fillna(spaceship[cat_features].mode().iloc[0])

# Encode categorical columns
spaceship['CryoSleep'] = spaceship['CryoSleep'].map({True: 1, False: 0})
spaceship['VIP'] = spaceship['VIP'].map({True: 1, False: 0})
spaceship['Transported'] = spaceship['Transported'].map({True: 1, False: 0})
spaceship = pd.get_dummies(spaceship, columns=['HomePlanet', 'Destination'], drop_first=True)

# Scale numerical features
scaler = StandardScaler()
spaceship[num_features] = scaler.fit_transform(spaceship[num_features])

# Check cleaned data
spaceship.head()


  spaceship[cat_features] = spaceship[cat_features].fillna(spaceship[cat_features].mode().iloc[0])


Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0,0.711945,0,-0.333105,-0.281027,-0.283579,-0.270626,-0.263003,0,True,False,False,True
1,0,-0.334037,0,-0.168073,-0.275387,-0.241771,0.217158,-0.224205,1,False,False,False,True
2,0,2.036857,1,-0.268001,1.959998,-0.283579,5.695623,-0.219796,0,True,False,False,True
3,0,0.293552,0,-0.333105,0.52301,0.336851,2.687176,-0.092818,0,True,False,False,True
4,0,-0.891895,0,0.125652,-0.237159,-0.031059,0.231374,-0.26124,1,False,False,False,True


**Perform Train Test Split**

In [9]:
#your code here
# Define features and target
X = spaceship.drop('Transported', axis=1)
y = spaceship['Transported']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [27]:
#your code here
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bagging_model = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)
bagging_model.fit(X_train, y_train)
y_pred_bagging = bagging_model.predict(X_test)
print("Bagging Accuracy:", accuracy_score(y_test, y_pred_bagging))

Bagging Accuracy: 0.7780333525014376


- Random Forests

In [16]:
#your code here
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))


Random Forest Accuracy: 0.7791834387579069


- Gradient Boosting

In [19]:
#your code here
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)
print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_gb))

Gradient Boosting Accuracy: 0.7837837837837838


- Adaptive Boosting

In [22]:
#your code here
ada_model = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_model.fit(X_train, y_train)
y_pred_ada = ada_model.predict(X_test)
print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred_ada))

AdaBoost Accuracy: 0.7711328349626222


Which model is the best and why?

In [29]:
#comment here
# Classification reports
print("\nClassification Reports:\n")
print("Bagging:\n", classification_report(y_test, y_pred_bagging))
print("Random Forest:\n", classification_report(y_test, y_pred_rf))
print("Gradient Boosting:\n", classification_report(y_test, y_pred_gb))
print("AdaBoost:\n", classification_report(y_test, y_pred_ada))


Classification Reports:

Bagging:
               precision    recall  f1-score   support

           0       0.78      0.77      0.77       861
           1       0.78      0.79      0.78       878

    accuracy                           0.78      1739
   macro avg       0.78      0.78      0.78      1739
weighted avg       0.78      0.78      0.78      1739

Random Forest:
               precision    recall  f1-score   support

           0       0.79      0.76      0.77       861
           1       0.77      0.80      0.78       878

    accuracy                           0.78      1739
   macro avg       0.78      0.78      0.78      1739
weighted avg       0.78      0.78      0.78      1739

Gradient Boosting:
               precision    recall  f1-score   support

           0       0.83      0.71      0.77       861
           1       0.75      0.85      0.80       878

    accuracy                           0.78      1739
   macro avg       0.79      0.78      0.78      1739
we

Which model is best?

Usually, Gradient Boosting and Random Forest tend to perform best on such data.

You will compare the accuracy and F1-score of all models.

The best model is the one with the highest test accuracy and balanced precision-recall scores.

For Spaceship Titanic, Gradient Boosting often yields better performance because it corrects errors iteratively and handles non-linearity well, but verify by the actual printed scores.

