# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [4]:
#your code here
spaceship.dropna(inplace = True)
#spaceship.head()
spaceship["Cabin"] = spaceship["Cabin"].apply(lambda x: x[0])
spaceship.drop("PassengerId", axis=1, inplace=True)
spaceship.drop("Name", axis=1, inplace=True)
categorical_spaceship = spaceship.select_dtypes(include='object')
categorical_spaceship.head()
dummies_categ = pd.get_dummies(categorical_spaceship)
dummies_categ['Transported'] = spaceship['Transported']
dummies_categ['Transported']

0       False
1        True
2       False
3       False
4        True
        ...  
8688    False
8689    False
8690     True
8691    False
8692     True
Name: Transported, Length: 6606, dtype: bool

In [5]:
X = spaceship._get_numeric_data().drop(columns='Transported')
y = spaceship['Transported']



**Perform Train Test Split**

In [6]:
#your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

normalizer = MinMaxScaler()
normalizer.fit(X_train)

In [13]:
X_train_norm = normalizer.transform(X_train)
X_test_norm = normalizer.transform(X_test)

In [14]:
X_train_norm = pd.DataFrame(X_train_norm, columns=X_train.columns)
X_train_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0.405063,0.0,0.0,0.0,0.0,0.0
1,0.050633,0.0,0.0,0.0,0.0,0.0
2,0.379747,0.0,0.007916,0.0,0.051276,0.0
3,0.21519,0.00131,0.0,0.046111,0.016378,4.9e-05
4,0.329114,0.0,0.0,0.0,0.0,0.0


In [15]:
X_test_norm = pd.DataFrame(X_test_norm, columns= X_test.columns)
X_test_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0.632911,0.0,0.0,0.0,0.0,0.0
1,0.227848,0.0,0.0,0.0,0.0,0.0
2,0.189873,0.0,0.0,0.0,0.0,0.0
3,0.658228,0.0,0.0,0.0,0.0,0.0
4,0.78481,0.0,0.054775,0.0,0.07774,0.0


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [16]:
#your code here
# Bagging involves training multiple instances of the same base model on different subsets of the training data. 
# The final prediction is obtained by averaging or voting over predictions from these models.
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                           n_estimators=100,
                             max_samples= 1000  )

In [18]:
bagging_reg.fit(X_train_norm, y_train)

In [32]:
# Evaluate model perf
pred = bagging_reg.predict(X_test_norm)
print("MAE Bagging Pasting" , mean_absolute_error(pred, y_test))
print("RMSE Bagging Pasting" , mean_squared_error(pred, y_test, squared=False))
print("R2 score Bagging Pasting" , bagging_reg.score(X_test_norm, y_test))

MAE Bagging Pasting 0.31592267142378333
RMSE Bagging Pasting 0.40107717931193615
R2 score Bagging Pasting 0.35654838494072405




- Random Forests

In [20]:
#your code here
# Random patches : randomizing the features that each predictor trains with.
#init a random forest
forest = RandomForestRegressor(n_estimators=100, max_depth=20)

In [21]:
#training model
forest.fit(X_train_norm, y_train)

In [31]:
#Evaluate the model
forest.predict(X_test_norm)
print("MAE random forests" , mean_absolute_error(pred, y_test))
print("RMSE random forests", mean_squared_error(pred, y_test, squared=False))
print("R2 score random forests" , forest.score(X_test_norm, y_test))

MAE random forests 0.31180330215445456
RMSE random forests 0.473048925645427
R2 score random forests 0.33874113443405485




- Gradient Boosting

In [23]:
#your code here
# each estimator will predict the error caused by its predecessor.
# Initialize a AdaBoost model
gb_reg = GradientBoostingRegressor(max_depth=20, n_estimators=100)

In [24]:
#Training the model
gb_reg.fit(X_train_norm, y_train)

In [30]:
#Evaluate the model
pred = gb_reg.predict(X_test_norm)

print("MAE Gradient Boosting", mean_absolute_error(pred, y_test))
print("RMSE Gradient Boosting", mean_squared_error(pred, y_test, squared=False))
print("R2 score Gradient Boosting", gb_reg.score(X_test_norm, y_test))

MAE Gradient Boosting 0.31180330215445456
RMSE Gradient Boosting 0.473048925645427
R2 score Gradient Boosting 0.1048988557828292




- Adaptive Boosting

In [26]:
#your code here
# instead of training our estimators independently by training them in parallel, 
# each estimators will learn at its predecessor's errors and focus on those datapoints where it failed.
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20), n_estimators=100)

In [27]:
#Training model
ada_reg.fit(X_train_norm, y_train)

In [29]:
# Evaluate ADA:
pred = ada_reg.predict(X_test_norm)

print("MAE ADA", mean_absolute_error(pred, y_test))
print("RMSE ADA", mean_squared_error(pred, y_test, squared=False))
print("R2 score ADA", ada_reg.score(X_test_norm, y_test))

MAE ADA 0.33796629307171155
RMSE ADA 0.47462307114102936
R2 score ADA 0.09893176136262949




Which model is the best and why?

In [None]:
#comment here
# We have : 
"""
MAE Bagging Pasting 0.31592267142378333
RMSE Bagging Pasting 0.40107717931193615
R2 score Bagging Pasting 0.35654838494072405

MAE random forests 0.31180330215445456
RMSE random forests 0.473048925645427
R2 score random forests 0.33874113443405485

MAE Gradient Boosting 0.31180330215445456
RMSE Gradient Boosting 0.473048925645427
R2 score Gradient Boosting 0.1048988557828292

MAE ADA 0.33796629307171155
RMSE ADA 0.47462307114102936
R2 score ADA 0.09893176136262949

In term of MAE , Gradient Boosting and random forests are the best 
with an equally lowest value , so the predictions are closer to the real values.

"""