# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [138]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [140]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [142]:
spaceship.dropna(axis=0, inplace=True)
spaceship.isna().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64

In [144]:
spaceship['New_cabin'] = spaceship['Cabin'].str[0]
valid_letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'T']
spaceship['New_cabin'] = spaceship['New_cabin'].apply(lambda x: x if x in valid_letters else None)
spaceship['New_cabin']

0       B
1       F
2       A
3       A
4       F
       ..
8688    A
8689    G
8690    G
8691    E
8692    E
Name: New_cabin, Length: 6606, dtype: object

In [146]:
new_spaceship=spaceship.drop(columns=['PassengerId','Name','Cabin'])
new_spaceship

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,New_cabin
0,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B
1,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F
2,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A
3,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,A
4,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,F
...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False,A
8689,Earth,True,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,False,G
8690,Earth,False,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True,G
8691,Europa,False,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False,E


In [148]:
categorical_cols = new_spaceship.select_dtypes(include=['object']).columns
categorical_cols

Index(['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'New_cabin'], dtype='object')

In [150]:
df_dummies = pd.get_dummies(new_spaceship, columns=categorical_cols,dtype=int)
df_dummies

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,...,VIP_False,VIP_True,New_cabin_A,New_cabin_B,New_cabin_C,New_cabin_D,New_cabin_E,New_cabin_F,New_cabin_G,New_cabin_T
0,39.0,0.0,0.0,0.0,0.0,0.0,False,0,1,0,...,1,0,0,1,0,0,0,0,0,0
1,24.0,109.0,9.0,25.0,549.0,44.0,True,1,0,0,...,1,0,0,0,0,0,0,1,0,0
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,0,1,0,...,0,1,1,0,0,0,0,0,0,0
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,0,1,0,...,1,0,1,0,0,0,0,0,0,0
4,16.0,303.0,70.0,151.0,565.0,2.0,True,1,0,0,...,1,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,41.0,0.0,6819.0,0.0,1643.0,74.0,False,0,1,0,...,0,1,1,0,0,0,0,0,0,0
8689,18.0,0.0,0.0,0.0,0.0,0.0,False,1,0,0,...,1,0,0,0,0,0,0,0,1,0
8690,26.0,0.0,0.0,1872.0,1.0,0.0,True,1,0,0,...,1,0,0,0,0,0,0,0,1,0
8691,32.0,0.0,1049.0,0.0,353.0,3235.0,False,0,1,0,...,1,0,0,0,0,0,1,0,0,0


In [152]:
features=df_dummies.drop(columns=['Transported'])

In [154]:
target=new_spaceship['Transported']
target

0       False
1        True
2       False
3       False
4        True
        ...  
8688    False
8689    False
8690     True
8691    False
8692     True
Name: Transported, Length: 6606, dtype: bool

**Perform Train Test Split**

In [156]:
from sklearn.model_selection import train_test_split

# Diviser les données en train (80%) et test (20%)
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

In [158]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [72]:
#normalizer = MinMaxScaler()

#normalizer.fit(X_train)

In [74]:
#X_train_norm = normalizer.transform(X_train)

#X_test_norm = normalizer.transform(X_test)

In [160]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns = X_train.columns)
X_train_scaled.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,...,VIP_False,VIP_True,New_cabin_A,New_cabin_B,New_cabin_C,New_cabin_D,New_cabin_E,New_cabin_F,New_cabin_G,New_cabin_T
0,0.220515,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123,-1.081675,-0.583761,1.959293,-1.353973,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,2.954847,-0.69021,-0.6543,-0.019459
1,-1.704525,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123,0.924492,-0.583761,-0.510388,-1.353973,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,-0.69021,1.528352,-0.019459
2,0.083012,-0.347046,-0.143345,-0.305892,0.71807,-0.271123,0.924492,-0.583761,-0.510388,0.738567,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,1.448834,-0.6543,-0.019459
3,-0.810757,-0.326846,-0.282099,0.6542,0.044548,-0.270259,-1.081675,-0.583761,1.959293,0.738567,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,2.954847,-0.69021,-0.6543,-0.019459
4,-0.191994,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123,0.924492,-0.583761,-0.510388,-1.353973,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,-0.69021,1.528352,-0.019459


In [162]:
X_test_scaled = pd.DataFrame(X_test_scaled, columns = X_test.columns)
X_test_scaled.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,...,VIP_False,VIP_True,New_cabin_A,New_cabin_B,New_cabin_C,New_cabin_D,New_cabin_E,New_cabin_F,New_cabin_G,New_cabin_T
0,1.45804,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123,-1.081675,-0.583761,1.959293,-1.353973,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,1.448834,-0.6543,-0.019459
1,-0.742005,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123,0.924492,-0.583761,-0.510388,-1.353973,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,-0.69021,1.528352,-0.019459
2,-0.94826,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123,0.924492,-0.583761,-0.510388,-1.353973,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,-0.69021,1.528352,-0.019459
3,1.595543,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123,0.924492,-0.583761,-0.510388,-1.353973,...,0.163149,-0.163149,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,-0.69021,1.528352,-0.019459
4,2.283058,-0.347046,0.678012,-0.305892,1.228811,-0.271123,-1.081675,1.713031,-0.510388,0.738567,...,-6.129384,6.129384,-0.183429,3.09664,-0.31319,-0.248362,-0.338427,-0.69021,-0.6543,-0.019459


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [213]:
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000,
                              bootstrap=True)

In [215]:
X_train_scaled.shape[0]/100

52.84

In [217]:
bagging_reg.fit(X_train_scaled, y_train)

In [219]:
pred = bagging_reg.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", bagging_reg.score(X_test_scaled, y_test))

MAE 0.2793487854911705
RMSE 0.37875709260300444
R2 score 0.42617225921167645




- Random Forests

In [173]:
forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)

In [175]:
forest.fit(X_train_scaled, y_train)

In [177]:
pred = forest.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", forest.score(X_test_scaled, y_test))

MAE 0.2693068901628681
RMSE 0.3840146075812158
R2 score 0.4101311246569793




In [179]:
df_compare=pd.DataFrame(y_test.values, columns=["y_true"])
df_compare["pred"]=pred
df_compare

Unnamed: 0,y_true,pred
0,True,1.000000
1,False,0.803760
2,True,0.877560
3,False,0.862167
4,True,0.400000
...,...,...
1317,False,0.661774
1318,True,0.880000
1319,True,1.000000
1320,True,0.727128


- Gradient Boosting

In [181]:
gb_reg = GradientBoostingRegressor(n_estimators=100)     

In [183]:
gb_reg.fit(X_train_scaled, y_train)

In [185]:
pred = gb_reg.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", gb_reg.score(X_test_scaled, y_test))

MAE 0.28800235366689086
RMSE 0.3766654386243346
R2 score 0.43249258938375046




In [187]:
df_compare=pd.DataFrame(y_test.values, columns=["y_true"])
df_compare["pred"]=pred
df_compare

Unnamed: 0,y_true,pred
0,True,0.941826
1,False,0.700116
2,True,0.755980
3,False,0.664291
4,True,0.212686
...,...,...
1317,False,0.519390
1318,True,0.753303
1319,True,1.024660
1320,True,0.700116


- Adaptive Boosting

In [189]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=100)

In [191]:
ada_reg.fit(X_train_scaled, y_train)

In [192]:
pred = ada_reg.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", ada_reg.score(X_test_scaled, y_test))

MAE 0.24545032518934629
RMSE 0.42549383838504407
R2 score 0.275819973985448




In [195]:
df_compare=pd.DataFrame(y_test.values, columns=["y_true"])
df_compare["pred"]=pred
df_compare

Unnamed: 0,y_true,pred
0,True,1.000000
1,False,0.516545
2,True,0.514706
3,False,0.540323
4,True,1.000000
...,...,...
1317,False,0.497811
1318,True,1.000000
1319,True,1.000000
1320,True,0.500000


Which model is the best and why?

Bagging & Pasting is the best model of all i tried on this notebook. This because the MAE and RMSE are lowest than the others models , having a better R2 Score than the others. Even though, is not a good model, we should try other options.