# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [101]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

In [102]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [104]:
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [105]:
spaceship.dropna(inplace=True)

In [106]:
spaceship.select_dtypes("object")

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,VIP,Name
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,False,Maham Ofracculy
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,False,Juanna Vines
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,True,Altark Susent
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,False,Solam Susent
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,False,Willy Santantines
...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,True,Gravior Noxnuther
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,False,Kurta Mondalley
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,False,Fayey Connon
8691,9280_01,Europa,False,E/608/S,55 Cancri e,False,Celeon Hontichre


In [107]:
features = spaceship.drop(columns= ["Transported","HomePlanet","CryoSleep","Cabin","Destination","VIP","Name","PassengerId"])
target = spaceship["Transported"]

**Perform Train Test Split**

In [109]:
X_train,X_test,y_train,y_test = train_test_split(features,target,test_size=0.20,random_state =0 )

In [110]:
# create an instance normalizer
normalizer = MinMaxScaler()

In [111]:
normalizer.fit(X_train)

In [112]:
X_train_norm = normalizer.transform(X_train)
X_test_norm = normalizer.transform(X_test)

In [113]:
X_train_norm

array([[4.05063291e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00],
       [5.06329114e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00],
       [3.79746835e-01, 0.00000000e+00, 7.91600979e-03, 0.00000000e+00,
        5.12763299e-02, 0.00000000e+00],
       ...,
       [4.55696203e-01, 0.00000000e+00, 1.59527723e-01, 0.00000000e+00,
        3.48893252e-01, 4.72069237e-03],
       [4.30379747e-01, 0.00000000e+00, 1.34169658e-04, 0.00000000e+00,
        3.05694395e-02, 8.74803304e-02],
       [1.77215190e-01, 2.01612903e-04, 2.95508671e-02, 0.00000000e+00,
        2.57943592e-02, 1.05232101e-02]])

In [114]:
X_train_norm = pd.DataFrame(X_train_norm,columns = X_train.columns)
X_train_norm

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0.405063,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.050633,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.379747,0.000000,0.007916,0.000000,0.051276,0.000000
3,0.215190,0.001310,0.000000,0.046111,0.016378,0.000049
4,0.329114,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...
5279,0.670886,0.000000,0.000000,0.000000,0.000000,0.000000
5280,0.455696,0.000000,0.000000,0.000000,0.032355,0.000098
5281,0.455696,0.000000,0.159528,0.000000,0.348893,0.004721
5282,0.430380,0.000000,0.000134,0.000000,0.030569,0.087480


In [115]:
X_test_norm = pd.DataFrame(X_test_norm,columns= X_test.columns)
X_test_norm

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0.632911,0.000000,0.000000,0.00000,0.000000,0.000000
1,0.227848,0.000000,0.000000,0.00000,0.000000,0.000000
2,0.189873,0.000000,0.000000,0.00000,0.000000,0.000000
3,0.658228,0.000000,0.000000,0.00000,0.000000,0.000000
4,0.784810,0.000000,0.054775,0.00000,0.077740,0.000000
...,...,...,...,...,...,...
1317,0.240506,0.000000,0.000000,0.05468,0.000045,0.001672
1318,0.468354,0.030242,0.115185,0.00000,0.000045,0.008409
1319,0.544304,0.000202,0.178748,0.00000,0.000312,0.000000
1320,0.177215,0.000000,0.000000,0.00000,0.000000,0.000000


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [118]:
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

In [119]:
bagging_reg.fit(X_train_norm, y_train)

In [155]:
pred = bagging_reg.predict(X_test_norm)

print(f"MAE, {mean_absolute_error(pred ,y_test): .2f}")
print(f"RMSE, {mean_squared_error(pred, y_test, squared=False): .2f}")
print(f"R2 score, {bagging_reg.score(X_test_norm, y_test): .2f}")

MAE,  0.31
RMSE,  0.40
R2 score,  0.36


- Random Forests

In [157]:
forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)

In [159]:
forest.fit(X_train_norm, y_train)

In [161]:
pred = forest.predict(X_test_norm)
print(f"MAE, {mean_absolute_error(pred ,y_test): .2f}")
print(f"RMSE, {mean_squared_error(pred, y_test, squared=False): .2f}")
print(f"R2 score, {forest.score(X_test_norm, y_test): .2f}")

MAE,  0.31
RMSE,  0.41
R2 score,  0.34


- Gradient Boosting

In [163]:
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)

In [167]:
gb_reg.fit(X_train_norm,y_train)

In [169]:
pred = gb_reg.predict(X_test_norm)
print(f"MAE, {mean_absolute_error(pred ,y_test): .2f}")
print(f"RMSE, {mean_squared_error(pred, y_test, squared=False): .2f}")
print(f"R2 score, {gb_reg.score(X_test_norm, y_test): .2f}")

MAE,  0.31
RMSE,  0.47
R2 score,  0.11


- Adaptive Boosting

In [171]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=100)

In [173]:
ada_reg.fit(X_train_norm,y_train)

In [175]:
pred = ada_reg.predict(X_test_norm)
print(f"MAE, {mean_absolute_error(pred ,y_test): .2f}")
print(f"RMSE, {mean_squared_error(pred, y_test, squared=False): .2f}")
print(f"R2 score, {ada_reg.score(X_test_norm, y_test): .2f}")

MAE,  0.34
RMSE,  0.48
R2 score,  0.07


Which model is the best and why?

In [127]:
Bagging and Pasting (Bagging Regressor) is best. Its R2 score is high compared to all other model.