# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [39]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

import warnings
# Suppress future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [24]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [25]:
encoder = LabelEncoder()
spaceship['PassengerId'] = encoder.fit_transform(spaceship['PassengerId'])
spaceship['HomePlanet'] = encoder.fit_transform(spaceship['HomePlanet'])
spaceship['CryoSleep'] = encoder.fit_transform(spaceship['CryoSleep'])
spaceship['Cabin'] = encoder.fit_transform(spaceship['Cabin'])
spaceship['Destination'] = encoder.fit_transform(spaceship['Destination'])
spaceship['VIP'] = encoder.fit_transform(spaceship['VIP'])
spaceship['Name'] = encoder.fit_transform(spaceship['Name'])
spaceship['Transported'] = encoder.fit_transform(spaceship['Transported'])

In [26]:
#your code here
spaceship_cleaned = spaceship.dropna()
spaceship_cleaned.shape

(7620, 14)

In [27]:
spaceship_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7620 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   7620 non-null   int32  
 1   HomePlanet    7620 non-null   int32  
 2   CryoSleep     7620 non-null   int32  
 3   Cabin         7620 non-null   int32  
 4   Destination   7620 non-null   int32  
 5   Age           7620 non-null   float64
 6   VIP           7620 non-null   int32  
 7   RoomService   7620 non-null   float64
 8   FoodCourt     7620 non-null   float64
 9   ShoppingMall  7620 non-null   float64
 10  Spa           7620 non-null   float64
 11  VRDeck        7620 non-null   float64
 12  Name          7620 non-null   int32  
 13  Transported   7620 non-null   int64  
dtypes: float64(6), int32(7), int64(1)
memory usage: 684.6 KB


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [28]:
#your code here
features = spaceship_cleaned.drop(columns = ["Transported"])
target = spaceship_cleaned["Transported"]

**Perform Train Test Split**

In [29]:
#your code here
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

In [30]:
normalizer = MinMaxScaler()

normalizer.fit(X_train)

In [31]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

In [32]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_train_norm.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0.796825,0.0,0.0,0.42439,0.333333,0.192308,0.0,0.0,0.0,0.000383,0.0,0.046814,0.47138
1,0.656121,0.0,0.0,0.367073,1.0,0.410256,0.0,0.000209,0.0,0.0,0.0,0.037225,0.684881
2,0.672112,0.333333,0.0,0.112195,0.666667,0.461538,0.0,0.0,0.088015,0.070535,0.150711,4.9e-05,0.292812
3,0.322711,0.0,0.5,0.882927,0.666667,0.282051,0.0,0.0,0.0,0.0,0.0,0.0,0.090405
4,0.422572,0.0,0.5,0.91814,0.666667,0.358974,0.0,0.0,0.0,0.0,0.0,0.0,0.484952


In [33]:
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)
X_test_norm.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0.10665,0.333333,0.0,0.199695,0.0,0.589744,0.0,0.0832,0.05823,0.000724,0.001131,0.0,0.636492
1,0.493902,0.0,0.0,0.662652,0.666667,0.282051,0.0,0.047882,0.001509,0.000681,0.000377,0.000197,0.990912
2,0.736769,0.666667,0.0,0.399695,0.666667,0.410256,0.0,0.204998,0.002013,0.029627,0.0,0.0,0.837366
3,0.921077,0.0,0.5,0.778354,0.333333,0.24359,0.0,0.0,0.0,0.0,0.0,0.0,0.261891
4,0.205821,0.0,0.0,0.839939,0.666667,0.24359,0.0,0.002932,0.0,0.014686,0.077051,0.0,0.125457


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [37]:
#your code here
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

In [38]:
bagging_reg.fit(X_train_norm, y_train)

In [40]:
pred = bagging_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", bagging_reg.score(X_test_norm, y_test))

MAE 0.28581530954464024
RMSE 0.3759168967382884
R2 score 0.4345813784264594


- Random Forests

In [41]:
#your code here
forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)

In [42]:
forest.fit(X_train_norm, y_train)

In [43]:
pred = forest.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", forest.score(X_test_norm, y_test))

MAE 0.2724119214776643
RMSE 0.3831585372427741
R2 score 0.412587171242756


- Gradient Boosting

In [44]:
#your code here
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)

In [45]:
gb_reg.fit(X_train_norm, y_train)

In [46]:
pred = gb_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", gb_reg.score(X_test_norm, y_test))

MAE 0.2651601450128624
RMSE 0.45196693793509335
R2 score 0.18266565787750033


- Adaptive Boosting

In [47]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=100)

In [48]:
ada_reg.fit(X_train_norm, y_train)

In [49]:
pred = ada_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", ada_reg.score(X_test_norm, y_test))

MAE 0.21805344134619942
RMSE 0.4213599538483868
R2 score 0.28961639566340047


Which model is the best and why?

Metrics with Gradient Boosting
MAE 0.2651601450128624
RMSE 0.45196693793509335
R2 score 0.18266565787750033

Metrics with Adaptive Boosting
MAE 0.21805344134619942
RMSE 0.4213599538483868
R2 score 0.28961639566340047

The best metrics appears to be the ones that come from AdaBoosting. 