# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [11]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

pd.set_option('display.max_columns', None)
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.neighbors import KNeighborsClassifier

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


**Perform Train Test Split**

In [3]:
spaceship = spaceship.dropna()

#transform column to bin cabins

def bin_cabin(check_char, replacement):
    spaceship['Cabin'] = spaceship['Cabin'].apply(lambda x: replacement if x.startswith(check_char) else x)
    return spaceship

bin_cabin('A', 'A')
bin_cabin('B', 'B')
bin_cabin('C', 'C')
bin_cabin('D', 'D')
bin_cabin('E', 'E')
bin_cabin('F', 'F')
bin_cabin('G', 'G')
bin_cabin('T', 'T')

spaceship=spaceship.drop(columns = ['PassengerId','Name'])

spaceship_dummies = pd.get_dummies(spaceship, columns=['HomePlanet', 'CryoSleep','Cabin','Destination','VIP','Transported'])

features = spaceship_dummies.drop(columns=['Transported_True','Transported_False','VIP_True','Destination_TRAPPIST-1e','Cabin_T','CryoSleep_True','HomePlanet_Mars'])
target = spaceship_dummies["Transported_False"]

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

scaler = MinMaxScaler()
scaler.fit(X_train)

X_train_scale = scaler.transform(X_train)

X_test_scale = scaler.transform(X_test)

X_train_scale = pd.DataFrame(X_train_scale, columns = X_train.columns)
X_test_scale = pd.DataFrame(X_test_scale, columns = X_test.columns)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [20]:
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=60),
                               n_estimators=200,
                               max_samples = 1000)

bagging_reg.fit(X_train_scale, y_train)

pred = bagging_reg.predict(X_test_scale)

print("MAE", mean_absolute_error(pred, y_test)) 
print("RMSE", mean_squared_error(pred, y_test, squared=False)) 
print("R2 score", bagging_reg.score(X_test_scale, y_test)) 

MAE 0.27809046170042767
RMSE 0.377083528445944
R2 score 0.4312320502990278


- Random Forests

In [22]:
forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)

forest.fit(X_train_scale, y_train)

pred = forest.predict(X_test_scale)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", forest.score(X_test_scale, y_test))

MAE 0.267969458431422
RMSE 0.38331854915291613
R2 score 0.41226755950121374


- Gradient Boosting

In [17]:
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)

gb_reg.fit(X_train_scale, y_train)

pred = gb_reg.predict(X_test_scale)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", gb_reg.score(X_test_scale, y_test))

MAE 0.25448174346141916
RMSE 0.41220653053270145
R2 score 0.32034310474477223


- Adaptive Boosting

In [18]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=100)

ada_reg.fit(X_train_scale, y_train)

pred = ada_reg.predict(X_test_scale)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", ada_reg.score(X_test_scale, y_test))

MAE 0.2433126332174434
RMSE 0.423469507189407
R2 score 0.282694305923043


Which model is the best and why?

In [9]:
#bagging and pasting seems to be the best model as it has the highest R2 score and lower MAE and RMSE score