# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [8]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [59]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")


In [107]:
spaceship_num.columns

Index(['Earth', 'Europa', 'Mars', 'CryoSleep', 'A', 'B', 'C', 'D', 'E', 'F',
       'G', 'T', '55 Cancri e', 'PSO J318.5-22', 'TRAPPIST-1e', 'Age', 'VIP',
       'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Transported'],
      dtype='object')

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [63]:
spaceship = spaceship.dropna(how='any')
spaceship['Cabin'] = spaceship['Cabin'].apply(lambda x: x[0])
spaceship = spaceship.drop(columns = ['PassengerId','Name'])
HomePlanet_d = pd.get_dummies(spaceship['HomePlanet'])
Cabin_d = pd.get_dummies(spaceship['Cabin'])
Destination_d = pd.get_dummies(spaceship['Destination'])
spaceship_num = pd.concat([HomePlanet_d,spaceship['CryoSleep'],Cabin_d,Destination_d,spaceship['Age'],spaceship['VIP'],spaceship['RoomService'],spaceship['FoodCourt'],spaceship['ShoppingMall'],spaceship['Spa'],spaceship['VRDeck'],spaceship['Transported']], axis = 1)

**Perform Train Test Split**

In [150]:
features = spaceship_num.drop(columns = 'CryoSleep')
target = spaceship_num['CryoSleep']

In [152]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=15)

In [154]:
normalizer = MinMaxScaler()

normalizer.fit(X_train)

X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns, index = X_train.index)
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns, index = X_test.index)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [157]:
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

In [158]:
bagging_reg.fit(X_train_norm, y_train)

In [159]:
pred = bagging_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", bagging_reg.score(X_test_norm, y_test))

MAE 0.06039094192914163
RMSE 0.17653662645361975
R2 score 0.8654739990087119


- Random Forests

In [161]:
forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)

In [162]:
forest.fit(X_train_norm, y_train)

In [163]:
pred = forest.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", forest.score(X_test_norm, y_test))

MAE 0.05755008765468217
RMSE 0.18425997799341748
R2 score 0.8534456829700737


- Gradient Boosting

In [165]:
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)

In [166]:
gb_reg.fit(X_train_norm, y_train)

In [167]:
pred = gb_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", gb_reg.score(X_test_norm, y_test))

MAE 0.05558064782359087
RMSE 0.1910971065518263
R2 score 0.8423678465158336


- Adaptive Boosting

In [169]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=100)

In [170]:
ada_reg.fit(X_train_norm, y_train)

In [171]:
pred = ada_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", ada_reg.score(X_test_norm, y_test))

MAE 0.06186270972697822
RMSE 0.19762324587477217
R2 score 0.8314174408845312


Which model is the best and why?

Bagging and pasting was the most effective model because it has the highest R2 score. That said, it is not a particularly reliable model and should not be used for any serious predictions.