# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libaries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, confusion_matrix

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
#your code here

# Split 'Cabin' into Deck, Num, and Side
spaceship[['Deck', 'Num', 'Side']] = spaceship['Cabin'].str.split('/', expand=True)
spaceship.drop(columns=['Cabin', 'Name'], inplace=True)  # Drop Cabin and Name columns

# Handle missing values
imputer = SimpleImputer(strategy='most_frequent')

# Encode categorical variables
encoder = OneHotEncoder(handle_unknown='ignore')

# Scale numerical features
scaler = StandardScaler()

# Define columns
num_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
cat_features = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Side']

# Create pipelines for numerical and categorical data
num_pipeline = Pipeline([
    ('imputer', imputer),
    ('scaler', scaler)
])

cat_pipeline = Pipeline([
    ('imputer', imputer),
    ('encoder', encoder)
])

# Combine pipelines
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

# Apply transformations
X = preprocessor.fit_transform(spaceship)
y = spaceship['Transported'].astype(int)  # Convert target variable to integer type

**Perform Train Test Split**

In [4]:
#your code here

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [5]:
#your code here

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Bagging with Decision Tree
bagging_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=100, max_samples=0.8, bootstrap=True, random_state=42
)
bagging_clf.fit(X_train, y_train)
y_pred_bagging = bagging_clf.predict(X_test)

bagging_accuracy = accuracy_score(y_test, y_pred_bagging)
print(f"Bagging Accuracy: {bagging_accuracy}")

Bagging Accuracy: 0.7860839562967222


- Random Forests

In [6]:
#your code here

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)

rf_accuracy = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {rf_accuracy}")


Random Forest Accuracy: 0.7837837837837838


- Gradient Boosting

In [7]:
#your code here

from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb_clf.fit(X_train, y_train)
y_pred_gb = gb_clf.predict(X_test)

gb_accuracy = accuracy_score(y_test, y_pred_gb)
print(f"Gradient Boosting Accuracy: {gb_accuracy}")


Gradient Boosting Accuracy: 0.7780333525014376


- Adaptive Boosting

In [8]:
#your code here

from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)

ada_accuracy = accuracy_score(y_test, y_pred_ada)
print(f"AdaBoost Accuracy: {ada_accuracy}")




AdaBoost Accuracy: 0.757906843013226


In [9]:
# Compare the accuracies
print(f"Bagging Accuracy: {bagging_accuracy}")
print(f"Random Forest Accuracy: {rf_accuracy}")
print(f"Gradient Boosting Accuracy: {gb_accuracy}")
print(f"AdaBoost Accuracy: {ada_accuracy}")


Bagging Accuracy: 0.7860839562967222
Random Forest Accuracy: 0.7837837837837838
Gradient Boosting Accuracy: 0.7780333525014376
AdaBoost Accuracy: 0.757906843013226


Which model is the best and why?

Bagging with a Decision Tree as the base estimator provided the best balance between bias and variance, leading to the highest accuracy on the test set. Therefore, it is considered the best model among those tested for the Spaceship Titanic dataset.