# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
#dropping null values
spaceship.dropna(inplace=True)
#taking only first character of values in column cabin
spaceship['Cabin'] = spaceship['Cabin'].str[0]
#dropping passenger ID & name
spaceship.drop(['PassengerId', 'Name'], axis=1, inplace=True)
spaceship.head(30)
#casting cryosleep to boolean to avoid doing drop dummies on this
spaceship['CryoSleep'] = spaceship['CryoSleep'].astype(bool)
#same approach for VIP
spaceship['VIP'] = spaceship['VIP'].astype(bool)
#applying the drop dummies method on 3 columns identified above
spaceship_dd = pd.get_dummies(spaceship, columns=['HomePlanet', 'Cabin', 'Destination'], drop_first=True)
spaceship_dd.head(30)
#Displaying final df
spaceship_dd.head(30)

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,False,39.0,False,0.0,0.0,0.0,0.0,0.0,False,True,False,True,False,False,False,False,False,False,False,True
1,False,24.0,False,109.0,9.0,25.0,549.0,44.0,True,False,False,False,False,False,False,True,False,False,False,True
2,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,True,False,False,False,False,False,False,False,False,False,True
3,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,True,False,False,False,False,False,False,False,False,False,True
4,False,16.0,False,303.0,70.0,151.0,565.0,2.0,True,False,False,False,False,False,False,True,False,False,False,True
5,False,44.0,False,0.0,483.0,0.0,291.0,0.0,True,False,False,False,False,False,False,True,False,False,True,False
6,False,26.0,False,42.0,1539.0,3.0,0.0,0.0,True,False,False,False,False,False,False,True,False,False,False,True
8,False,35.0,False,0.0,785.0,17.0,216.0,0.0,True,False,False,False,False,False,False,True,False,False,False,True
9,True,14.0,False,0.0,0.0,0.0,0.0,0.0,True,True,False,True,False,False,False,False,False,False,False,False
11,False,45.0,False,39.0,7295.0,589.0,110.0,124.0,True,True,False,True,False,False,False,False,False,False,False,False


**Perform Train Test Split**

In [4]:
#creating the features & target dfs
X = spaceship_dd.drop('Transported', axis=1)
y = spaceship_dd['Transported']
#perform train / test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [5]:
X_train

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Europa,HomePlanet_Mars,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
3432,True,32.0,False,0.0,0.0,0.0,0.0,0.0,False,True,False,False,False,True,False,False,False,False,True
7312,True,4.0,False,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False,True,False,False,True
2042,False,30.0,False,0.0,236.0,0.0,1149.0,0.0,False,False,False,False,False,False,True,False,False,False,True
4999,False,17.0,False,13.0,0.0,565.0,367.0,1.0,False,True,False,False,False,True,False,False,False,False,True
5755,True,26.0,False,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6518,False,53.0,False,0.0,0.0,0.0,0.0,0.0,False,True,False,False,False,True,False,False,False,False,True
4317,False,36.0,False,0.0,0.0,0.0,725.0,2.0,False,False,False,False,False,False,True,False,False,False,False
2214,False,36.0,False,0.0,4756.0,0.0,7818.0,96.0,True,False,False,False,True,False,False,False,False,False,True
3468,False,34.0,True,0.0,4.0,0.0,685.0,1779.0,True,False,False,False,True,False,False,False,False,False,True


In [11]:
#feature scaling
normalizer = MinMaxScaler()

normalizer.fit(X_train)

In [12]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

In [13]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_train_norm.head()

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Europa,HomePlanet_Mars,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,1.0,0.405063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
1,1.0,0.050633,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,0.0,0.379747,0.0,0.0,0.007916,0.0,0.051276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.0,0.21519,0.0,0.00131,0.0,0.046111,0.016378,4.9e-05,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.329114,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


In [14]:
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)
X_test_norm.head()

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Europa,HomePlanet_Mars,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,1.0,0.632911,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,1.0,0.227848,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,1.0,0.189873,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1.0,0.658228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,0.0,0.78481,1.0,0.0,0.054775,0.0,0.07774,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [17]:
#let's first apply KNN for results comparison (w/o normalization)
#since the target value is boolean (true / false) we should apply the classificator KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
from sklearn.metrics import accuracy_score, r2_score, root_mean_squared_error
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

correct_predictions = (y_test == y_pred).sum()
print(f"Correct Predictions: {correct_predictions} out of {len(y_test)}")

Accuracy: 78.44%
Correct Predictions: 1037 out of 1322


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [18]:
#Bagging involves training multiple instances of the same base model on different subsets of the training data
#The final prediction is obtained by averaging or voting over predictions from these models
#we will use baggin classifier since the target is Boolean

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bagging_class = BaggingClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

In [19]:
bagging_class.fit(X_train_norm, y_train)

In [21]:
from sklearn.metrics import accuracy_score

pred = bagging_class.predict(X_test_norm)

accuracy = accuracy_score(y_test, pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

correct_predictions = (y_test == pred).sum()
print(f"Correct Predictions: {correct_predictions} out of {len(y_test)}")

Accuracy: 79.20%
Correct Predictions: 1047 out of 1322


- Random Forests

In [23]:
#random forest classifier
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100,
                             max_depth=20)
forest.fit(X_train_norm, y_train)

In [25]:
pred_rf = forest.predict(X_test_norm)

accuracy = accuracy_score(y_test, pred_rf)
print(f"Accuracy: {accuracy * 100:.2f}%")

correct_predictions = (y_test == pred_rf).sum()
print(f"Correct Predictions: {correct_predictions} out of {len(y_test)}")

Accuracy: 78.67%
Correct Predictions: 1040 out of 1322


- Gradient Boosting

In [30]:
#gradient boosting classifier
from sklearn.ensemble import GradientBoostingClassifier

fgb_clf = GradientBoostingClassifier(n_estimators=100, max_depth=20)
fgb_clf.fit(X_train_norm, y_train)

In [31]:
#Evaluate results
pred_gb = fgb_clf.predict(X_test_norm)

accuracy = accuracy_score(y_test, pred_gb)
print(f"Accuracy: {accuracy * 100:.2f}%")

correct_predictions = (y_test == pred_gb).sum()
print(f"Correct Predictions: {correct_predictions} out of {len(y_test)}")

Accuracy: 78.29%
Correct Predictions: 1035 out of 1322


- Adaptive Boosting

In [34]:
#adaptative boosting

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

ada_clf = AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=1),n_estimators=100,learning_rate=0.1)
ada_clf.fit(X_train_norm, y_train)



In [35]:
#Evaluate results
pred_ab = ada_clf.predict(X_test_norm)

accuracy = accuracy_score(y_test, pred_ab)
print(f"Accuracy: {accuracy * 100:.2f}%")

correct_predictions = (y_test == pred_ab).sum()
print(f"Correct Predictions: {correct_predictions} out of {len(y_test)}")

Accuracy: 78.14%
Correct Predictions: 1033 out of 1322


Which model is the best and why?

In [None]:
#cbagging model seems to generate the best prediction