# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [4]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [6]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [9]:
from sklearn.preprocessing import StandardScaler

# Selecting the numerical columns for scaling
numerical_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

# Fill missing values in numerical columns with 0 (or any suitable value)
spaceship[numerical_features] = spaceship[numerical_features].fillna(0)

# Apply StandardScaler
scaler = StandardScaler()
spaceship[numerical_features] = scaler.fit_transform(spaceship[numerical_features])

# Display the scaled features
spaceship[numerical_features].head()


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0.721984,-0.333105,-0.281027,-0.283579,-0.270626,-0.263003
1,-0.283969,-0.168073,-0.275387,-0.241771,0.217158,-0.224205
2,1.996191,-0.268001,1.959998,-0.283579,5.695623,-0.219796
3,0.319603,-0.333105,0.52301,0.336851,2.687176,-0.092818
4,-0.820477,0.125652,-0.237159,-0.031059,0.231374,-0.26124


In [11]:
# Drop PassengerId and Name
spaceship = spaceship.drop(columns=['PassengerId', 'Name'])

# Convert categorical features using one-hot encoding
spaceship = pd.get_dummies(spaceship, columns=['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Cabin'], drop_first=True)

# Display the updated dataframe with selected features
spaceship.head()


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,...,Cabin_G/996/S,Cabin_G/998/P,Cabin_G/998/S,Cabin_G/999/P,Cabin_G/999/S,Cabin_T/0/P,Cabin_T/1/P,Cabin_T/2/P,Cabin_T/2/S,Cabin_T/3/P
0,0.721984,-0.333105,-0.281027,-0.283579,-0.270626,-0.263003,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
1,-0.283969,-0.168073,-0.275387,-0.241771,0.217158,-0.224205,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,1.996191,-0.268001,1.959998,-0.283579,5.695623,-0.219796,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
3,0.319603,-0.333105,0.52301,0.336851,2.687176,-0.092818,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
4,-0.820477,0.125652,-0.237159,-0.031059,0.231374,-0.26124,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False


**Perform Train Test Split**

In [13]:
from sklearn.model_selection import train_test_split

# Define the target variable 'Transported' and features
X = spaceship.drop(columns=['Transported'])
y = spaceship['Transported']

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shapes of the resulting data
X_train.shape, X_test.shape, y_train.shape, y_test.shape


((6954, 6571), (1739, 6571), (6954,), (1739,))

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [18]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Define the base estimator (Decision Tree)
tree_clf = DecisionTreeClassifier()

# Bagging Classifier (with replacement)
bagging_clf = BaggingClassifier(estimator=tree_clf, n_estimators=100, bootstrap=True, n_jobs=-1, random_state=42)

# Train the bagging model
bagging_clf.fit(X_train, y_train)

# Make predictions
y_pred_bagging = bagging_clf.predict(X_test)

# Evaluate the model
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
print(f"Bagging Accuracy: {accuracy_bagging}")


Bagging Accuracy: 0.7740080506037953


In [19]:
# Pasting Classifier (without replacement)
pasting_clf = BaggingClassifier(estimator=tree_clf, n_estimators=100, bootstrap=False, n_jobs=-1, random_state=42)

# Train the pasting model
pasting_clf.fit(X_train, y_train)

# Make predictions
y_pred_pasting = pasting_clf.predict(X_test)

# Evaluate the model
accuracy_pasting = accuracy_score(y_test, y_pred_pasting)
print(f"Pasting Accuracy: {accuracy_pasting}")


Pasting Accuracy: 0.7469810235767682


- Random Forests

In [22]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Define the Random Forest Classifier
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest model
random_forest_clf.fit(X_train, y_train)

# Make predictions
y_pred_rf = random_forest_clf.predict(X_test)

# Evaluate the model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf}")


Random Forest Accuracy: 0.7740080506037953


- Gradient Boosting

In [24]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Define the Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Train the Gradient Boosting model
gb_clf.fit(X_train, y_train)

# Make predictions
y_pred_gb = gb_clf.predict(X_test)

# Evaluate the model
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f"Gradient Boosting Accuracy: {accuracy_gb}")


Gradient Boosting Accuracy: 0.7855089131684876


- Adaptive Boosting

In [28]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

# Define the AdaBoost Classifier with the SAMME algorithm
ada_clf = AdaBoostClassifier(n_estimators=100, learning_rate=0.1, algorithm='SAMME', random_state=42)

# Train the AdaBoost model
ada_clf.fit(X_train, y_train)

# Make predictions
y_pred_ada = ada_clf.predict(X_test)

# Evaluate the model
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print(f"AdaBoost Accuracy: {accuracy_ada}")


AdaBoost Accuracy: 0.7435307648073606


Which model is the best and why?

In [None]:
Gradient Boosting Accuracy: 0.7855089131684876