# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [141]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [142]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [144]:
#your code here
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Cleaning
spaceship = spaceship.drop(['PassengerId', 'Name'], axis=1)

spaceship.fillna(spaceship.median(numeric_only=True), inplace=True)
spaceship['CryoSleep'] = spaceship['CryoSleep'].fillna(False)
spaceship['VIP'] = spaceship['VIP'].fillna(False)

spaceship['HomePlanet'] = spaceship['HomePlanet'].fillna(spaceship['HomePlanet'].mode()[0])
spaceship['Destination'] = spaceship['Destination'].fillna(spaceship['Destination'].mode()[0])

# OneHotEncoding for categorical columns
spaceship = pd.get_dummies(spaceship, columns=['HomePlanet', 'Cabin', 'Destination'], drop_first=True)

# Scaling
scaler = StandardScaler()
numerical_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
spaceship[numerical_cols] = scaler.fit_transform(spaceship[numerical_cols])


X = spaceship.drop(['Transported'], axis=1)
y = spaceship['Transported']

print(spaceship.head())

   CryoSleep       Age    VIP  RoomService  FoodCourt  ShoppingMall       Spa  \
0      False  0.711945  False    -0.333105  -0.281027     -0.283579 -0.270626   
1      False -0.334037  False    -0.168073  -0.275387     -0.241771  0.217158   
2      False  2.036857   True    -0.268001   1.959998     -0.283579  5.695623   
3      False  0.293552  False    -0.333105   0.523010      0.336851  2.687176   
4      False -0.891895  False     0.125652  -0.237159     -0.031059  0.231374   

     VRDeck  Transported  HomePlanet_Europa  ...  Cabin_G/998/S  \
0 -0.263003        False               True  ...          False   
1 -0.224205         True              False  ...          False   
2 -0.219796        False               True  ...          False   
3 -0.092818        False               True  ...          False   
4 -0.261240         True              False  ...          False   

   Cabin_G/999/P  Cabin_G/999/S  Cabin_T/0/P  Cabin_T/1/P  Cabin_T/2/P  \
0          False          False     

  spaceship['CryoSleep'] = spaceship['CryoSleep'].fillna(False)
  spaceship['VIP'] = spaceship['VIP'].fillna(False)


**Perform Train Test Split**

In [146]:
#your code here
# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")


Training data shape: (6085, 6571)
Test data shape: (2608, 6571)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [149]:
#your code here
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Bagging
bagging_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, max_samples=0.8, bootstrap=True, random_state=42)
bagging_clf.fit(X_train, y_train)
bagging_score = bagging_clf.score(X_test, y_test)
print(f"Bagging Classifier Score: {bagging_score * 100:.2f}%")

# Pasting (bootstrap=False)
pasting_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, max_samples=0.8, bootstrap=False, random_state=42)
pasting_clf.fit(X_train, y_train)
pasting_score = pasting_clf.score(X_test, y_test)
print(f"Pasting Classifier Score: {pasting_score* 100:.2f}%")


Bagging Classifier Score: 78.07%
Pasting Classifier Score: 77.80%


- Random Forests

In [151]:
#your code here

from sklearn.ensemble import RandomForestClassifier

# Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
rf_score = rf_clf.score(X_test, y_test)
print(f"Random Forest Classifier Score: {rf_score* 100:.2f}%")

Random Forest Classifier Score: 78.14%


- Gradient Boosting

In [153]:
#your code here

from sklearn.ensemble import GradientBoostingClassifier

# Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_clf.fit(X_train, y_train)
gb_score = gb_clf.score(X_test, y_test)
print(f"Gradient Boosting Classifier Score: {gb_score* 100:.2f}%")

Gradient Boosting Classifier Score: 78.87%


- Adaptive Boosting

In [155]:
#your code here
from sklearn.ensemble import AdaBoostClassifier

# AdaBoost Classifier
ada_clf = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_clf.fit(X_train, y_train)
ada_score = ada_clf.score(X_test, y_test)
print(f"AdaBoost Classifier Score: {ada_score* 100:.2f}%")



AdaBoost Classifier Score: 78.49%


Which model is the best and why?

#comment here
The Gradient Boosting Classifier achieved the highest accuracy at 78.87%.
This makes it the best performing model among those you tried.
The second-best model is AdaBoost at 78.49%, which also performed well.