# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [51]:
#Libraries





import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.tree import DecisionTreeRegressor
# New in here:
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [52]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [53]:
#clean the data 

spaceship.dropna(inplace=True)
spaceship.reset_index()

spaceship["Cabin"] = spaceship["Cabin"].str.split("/").str[0]

In [54]:
#drop coloums not needed 
spaceship.drop(["PassengerId", "Name"], axis=1, inplace=True)
spaceship.head(3)

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False


In [55]:
#correct all the cat cols and add in dummy 

from numpy import dtype


categorical_cols = spaceship.select_dtypes(include="object").columns
spaceship = pd.get_dummies(spaceship, 
                        columns=categorical_cols, 
                        drop_first=True,
                        dtype=int)

In [56]:
display(spaceship.dtypes)
display(spaceship.head(3))

Age                          float64
RoomService                  float64
FoodCourt                    float64
ShoppingMall                 float64
Spa                          float64
VRDeck                       float64
Transported                     bool
HomePlanet_Europa              int64
HomePlanet_Mars                int64
CryoSleep_True                 int64
Cabin_B                        int64
Cabin_C                        int64
Cabin_D                        int64
Cabin_E                        int64
Cabin_F                        int64
Cabin_G                        int64
Cabin_T                        int64
Destination_PSO J318.5-22      int64
Destination_TRAPPIST-1e        int64
VIP_True                       int64
dtype: object

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,False,1,0,0,1,0,0,0,0,0,0,0,1,0
1,24.0,109.0,9.0,25.0,549.0,44.0,True,0,0,0,0,0,0,0,1,0,0,0,1,0
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,1,0,0,0,0,0,0,0,0,0,0,1,1


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [57]:
features = spaceship.drop("Transported", axis=1)
target = spaceship["Transported"]

In [58]:
display(features.head(3))
display(target.head(3))

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,1,0,0,1,0,0,0,0,0,0,0,1,0
1,24.0,109.0,9.0,25.0,549.0,44.0,0,0,0,0,0,0,0,1,0,0,0,1,0
2,58.0,43.0,3576.0,0.0,6715.0,49.0,1,0,0,0,0,0,0,0,0,0,0,1,1


0    False
1     True
2    False
Name: Transported, dtype: bool

**Perform Train Test Split**

In [59]:
X_train, X_test, y_train, y_test = train_test_split(
                                                     features, 
                                                     target, 
                                                     test_size=0.2, 
                                                     random_state=0
)

In [60]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



In [61]:
#train the model

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train_scaled, y_train)

In [62]:
knn.score(X_test_scaled, y_test)

0.7586989409984871

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [82]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier


baggin_class = BaggingRegressor(DecisionTreeRegressor (max_depth=15),
                                 n_estimators = 100,
                                 max_samples = 2000,
                                 bootstrap = True)




In [83]:
baggin_class.fit (X_train_scaled, y_train)

In [84]:
pred = baggin_class.predict(X_test_scaled)

In [85]:
print(f"R2: ", r2_score (y_test, pred))
print(f"RMAE: ", root_mean_squared_error (y_test, pred))
print(f"MAE: ", mean_absolute_error (y_test, pred))

R2:  0.42889129927954384
RMAE:  0.3778586709076742
MAE:  0.27245242083990995


+++ Pasting 

In [87]:
baggin_class = BaggingRegressor(DecisionTreeRegressor (max_depth=15),
                                 n_estimators = 100,
                                 max_samples = 2000,
                                 bootstrap = False)

In [88]:
baggin_class.fit (X_train_scaled, y_train)

In [89]:
pred = baggin_class.predict(X_test_scaled)

print(f"R2: ", r2_score (y_test, pred))
print(f"RMAE: ", root_mean_squared_error (y_test, pred))
print(f"MAE: ", mean_absolute_error (y_test, pred))

R2:  0.42437064034310745
RMAE:  0.37935120919040594
MAE:  0.2728732511884679


- Random Forests

In [90]:
from sklearn.ensemble import RandomForestClassifier


forest = RandomForestRegressor (n_estimators= 100,
                                max_depth= 15)

In [91]:
forest.fit(X_train_scaled, y_train)

In [92]:
pred = forest.predict(X_test_scaled)

In [93]:
print(f"R2: ", r2_score (y_test, pred))
print(f"RMAE: ", root_mean_squared_error (y_test, pred))
print(f"MAE: ", mean_absolute_error (y_test, pred))

R2:  0.41784920905087863
RMAE:  0.38149403368503726
MAE:  0.2680519226637408


- Gradient Boosting

In [94]:
from sklearn.ensemble import GradientBoostingClassifier

gb_class = GradientBoostingRegressor (max_depth = 15, 
                                        n_estimators = 100)

In [95]:
gb_class.fit(X_train_scaled, y_train)

In [96]:
pred = gb_class.predict(X_test_scaled)

In [97]:
print(f"R2: ", r2_score (y_test, pred))
print(f"RMAE: ", root_mean_squared_error (y_test, pred))
print(f"MAE: ", mean_absolute_error (y_test, pred))

R2:  0.34760468415619217
RMAE:  0.40385496030252244
MAE:  0.2631564219395186


- Adaptive Boosting

In [98]:
from sklearn.ensemble import AdaBoostClassifier


ada_class = AdaBoostRegressor (DecisionTreeRegressor (max_depth=15),
                             n_estimators= 100)

In [99]:
ada_class.fit(X_train_scaled, y_train)

In [100]:
pred = ada_class.predict(X_test_scaled)

In [101]:
print(f"R2: ", r2_score (y_test, pred))
print(f"RMAE: ", root_mean_squared_error (y_test, pred))
print(f"MAE: ", mean_absolute_error (y_test, pred))

R2:  0.3219961278432252
RMAE:  0.4117049526532244
MAE:  0.27910094868642443


Which model is the best and why?

In [None]:
#comment here