# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [71]:
#Libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

In [72]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [74]:
spaceship = spaceship.dropna()
spaceship.shape

(6606, 14)

In [75]:
spaceship['Cabin'] = spaceship['Cabin'].str[0]
spaceship['Cabin'].unique()

array(['B', 'F', 'A', 'G', 'E', 'C', 'D', 'T'], dtype=object)

In [76]:
X = spaceship.drop(columns=['PassengerId', 'Name', 'Transported'])
X

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,A,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0
8689,Earth,True,G,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0
8690,Earth,False,G,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0
8691,Europa,False,E,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0


In [77]:
y = spaceship['Transported']

In [78]:
from sklearn.preprocessing import OneHotEncoder

categorical_cols = X.select_dtypes(include=['object']).columns
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_cat_encoded = pd.DataFrame(ohe.fit_transform(X[categorical_cols]), columns=ohe.get_feature_names_out(), index=X.index)
X_cat_encoded



Unnamed: 0,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
3,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
8689,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
8690,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
8691,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [79]:
X_non_cat = X.select_dtypes(exclude=['object'])
X_non_cat

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,39.0,0.0,0.0,0.0,0.0,0.0
1,24.0,109.0,9.0,25.0,549.0,44.0
2,58.0,43.0,3576.0,0.0,6715.0,49.0
3,33.0,0.0,1283.0,371.0,3329.0,193.0
4,16.0,303.0,70.0,151.0,565.0,2.0
...,...,...,...,...,...,...
8688,41.0,0.0,6819.0,0.0,1643.0,74.0
8689,18.0,0.0,0.0,0.0,0.0,0.0
8690,26.0,0.0,0.0,1872.0,1.0,0.0
8691,32.0,0.0,1049.0,0.0,353.0,3235.0


In [80]:
X_final = pd.concat([X_cat_encoded, X_non_cat], axis=1)
X_final

Unnamed: 0,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,...,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,39.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,24.0,109.0,9.0,25.0,549.0,44.0
2,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,58.0,43.0,3576.0,0.0,6715.0,49.0
3,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,33.0,0.0,1283.0,371.0,3329.0,193.0
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,16.0,303.0,70.0,151.0,565.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,41.0,0.0,6819.0,0.0,1643.0,74.0
8689,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,18.0,0.0,0.0,0.0,0.0,0.0
8690,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,26.0,0.0,0.0,1872.0,1.0,0.0
8691,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,32.0,0.0,1049.0,0.0,353.0,3235.0


**Perform Train Test Split**

In [82]:
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size = 0.20, random_state=0)

In [83]:
X_train_final_np = X_train.values
X_test_final_np = X_test.values

In [84]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

In [85]:
knn.fit(X_train_final_np, y_train)

In [86]:
print(f"The R2 of the model on the TRAIN set is: {knn.score(X_train_final_np, y_train): .2f}")
print(f"The R2 of the model on the TEST set is: {knn.score(X_test_final_np, y_test): .2f}")

The R2 of the model on the TRAIN set is:  0.83
The R2 of the model on the TEST set is:  0.77


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [235]:
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=200,
                               max_samples = 1000) 

In [237]:
bagging_reg.fit(X_train_final_np, y_train)

In [238]:
y_pred_test_bag = bagging_reg.predict(X_test_final_np)

print(f"MAE {mean_absolute_error(y_pred_test_bag, y_test): .2f}")
print(f"RMSE {mean_squared_error(y_pred_test_bag, y_test, squared=False): .2f}")
print(f"R2 score {bagging_reg.score(X_test_final_np, y_test): .2f}")

MAE  0.28
RMSE  0.38
R2 score  0.43


The R2 score of the model Bagging is lower than in the model KNN, this means that it is less suited to data, but the error rate remains fairly low, which means that they are not too frequent.

- Random Forests

In [131]:
forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)

In [136]:
forest.fit(X_train_final_np, y_train)

In [140]:
y_pred_test_rf = forest.predict(X_test_final_np)

print(f"MAE, {mean_absolute_error(y_pred_test_rf, y_test): .2f}")
print(f"RMSE, {mean_squared_error(y_pred_test_rf, y_test, squared=False): .2f}")
print(f"R2 score, {forest.score(X_test_final_np, y_test): .2f}")

MAE,  0.27
RMSE,  0.38
R2 score,  0.41


We have similar results in bagging and random forest for errors. The R² score is low, similar to that of the bagging model (0.43). This means that the Random Forest model explains only 41% of the variance in the test data, is below that of the KNN (0.77), indicating that the model isn't optimal for this dataset.

- Gradient Boosting

In [195]:
gb_reg = GradientBoostingRegressor(max_depth=5,
                                   n_estimators=100)

In [197]:
gb_reg.fit(X_train_final_np, y_train)

In [198]:
y_pred_test_gb = gb_reg.predict(X_test_final_np)

print(f"MAE, {mean_absolute_error(y_pred_test_gb, y_test): .2f}")
print(f"RMSE, {mean_squared_error(y_pred_test_gb, y_test, squared=False): .2f}")
print(f"R2 score, {gb_reg.score(X_test_final_np, y_test): .2f}")

MAE,  0.28
RMSE,  0.38
R2 score,  0.43


By lowering the number of max_depth=5 we obtain a better R2 of 0.43 compared with 0.27 when max_depth=200. 

- Adaptive Boosting

In [222]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=100)

In [224]:
ada_reg.fit(X_train_final_np, y_train)

In [225]:
y_pred_pred_ada = ada_reg.predict(X_test_final_np)

print(f"MAE, {mean_absolute_error(y_pred_pred_ada, y_test): .2f}")
print(f"RMSE, {mean_squared_error(y_pred_pred_ada, y_test, squared=False): .2f}")
print(f"R2 score, {ada_reg.score(X_test_final_np, y_test): .2f}")

MAE,  0.25
RMSE,  0.43
R2 score,  0.27


The MAE is lower, it's the better score of all the methods, but R2 is very small.

Which model is the best and why?

In the Bagging model, and GradientBoosting model, we have the same R2 (0.43), but the error
rate is fairly lower in the Bagging model (0.27 against 0.28).

But the higher R2 is for the KNN model : 0.83 for the train set, and 0.77 for the test set.
Furthermore, the difference between train and test set is low.