# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

#from sklearn.tree import DecisionTreeRegressor
#from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, accuracy_score, precision_score, recall_score


In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
spaceship = spaceship.dropna()

In [4]:
spaceship.Cabin = spaceship.Cabin.apply(lambda x: x[0])
spaceship = spaceship.drop(['PassengerId', 'Name'], axis=1)
spaceship.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True


In [5]:
# Establish 'features' and 'target'.
features = spaceship.drop('Transported', axis=1)
target = spaceship.Transported

**Perform Train Test Split**

In [6]:
# Perform the split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

In [7]:
# Initiate OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)

# Split training set into categorical and numerical
X_train_cat = X_train.select_dtypes(['object', 'bool'])
X_train_num = X_train.drop(X_train_cat, axis=1)

# Fit OneHotEncoder with the categorical data and transform it into numerical values
ohe.fit(X_train_cat)
X_train_trans_np = ohe.transform(X_train_cat)

# Create a dataframe using the transformed values and the original index
X_train_trans_df = pd.DataFrame(X_train_trans_np, columns=ohe.get_feature_names_out(), index=X_train.index)

# Concatenate the newly transformed train dataframe with the train numerical dataframe
df_train = pd.concat([X_train_trans_df, X_train_num], axis=1)
df_train.head()

Unnamed: 0,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,...,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
3432,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,1.0,0.0,32.0,0.0,0.0,0.0,0.0,0.0
7312,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0
2042,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,30.0,0.0,236.0,0.0,1149.0,0.0
4999,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,1.0,0.0,17.0,13.0,0.0,565.0,367.0,1.0
5755,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,26.0,0.0,0.0,0.0,0.0,0.0


In [8]:
# Split test set into categorical and numerical
X_test_cat = X_test.select_dtypes(['object', 'bool'])
X_test_num = X_test.drop(X_test_cat, axis=1)

# Transform the categorical data into numerical values
X_test_trans_np = ohe.transform(X_test_cat)

# Create a dataframe using the transformed values and the original index
X_test_trans_df = pd.DataFrame(X_test_trans_np, columns=ohe.get_feature_names_out(), index=X_test.index)

# Concatenate the newly transformed test dataframe with the test numerical dataframe
df_test = pd.concat([X_test_trans_df, X_test_num], axis=1)
df_test.head()

Unnamed: 0,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,...,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
2453,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,50.0,0.0,0.0,0.0,0.0,0.0
1334,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,18.0,0.0,0.0,0.0,0.0,0.0
8272,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,15.0,0.0,0.0,0.0,0.0,0.0
5090,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,52.0,0.0,0.0,0.0,0.0,0.0
4357,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,62.0,0.0,1633.0,0.0,1742.0,0.0


In [9]:
normalizer = StandardScaler()
normalizer.fit(df_train)

df_train_norm = normalizer.transform(df_train)
df_test_norm = normalizer.transform(df_test)

df_train_norm = pd.DataFrame(df_train_norm, columns = df_train.columns, index = df_train.index)
df_train_norm.head()

df_test_norm = pd.DataFrame(df_test_norm, columns = df_test.columns, index = df_test.index)
df_test_norm.head()

Unnamed: 0,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,...,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
2453,-1.081675,-0.583761,1.959293,-1.353973,1.353973,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,...,-0.328263,-1.476894,0.163149,-0.163149,1.45804,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123
1334,0.924492,-0.583761,-0.510388,-1.353973,1.353973,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,...,3.046335,-1.476894,0.163149,-0.163149,-0.742005,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123
8272,0.924492,-0.583761,-0.510388,-1.353973,1.353973,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,...,-0.328263,-1.476894,0.163149,-0.163149,-0.94826,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123
5090,0.924492,-0.583761,-0.510388,-1.353973,1.353973,-0.183429,-0.322931,-0.31319,-0.248362,-0.338427,...,3.046335,-1.476894,0.163149,-0.163149,1.595543,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123
4357,-1.081675,1.713031,-0.510388,0.738567,-0.738567,-0.183429,3.09664,-0.31319,-0.248362,-0.338427,...,-0.328263,0.677097,-6.129384,6.129384,2.283058,-0.347046,0.678012,-0.305892,1.228811,-0.271123


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [10]:
bagging_reg = BaggingClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

In [11]:
bagging_reg.fit(df_train, y_train)

In [12]:
pred = bagging_reg.predict(df_test)

print("Accuracy:", accuracy_score(y_test, pred))
print("Precision:", precision_score(y_test, pred))
print("Recall:", recall_score(y_test, pred))

Accuracy: 0.7783661119515886
Precision: 0.7822085889570553
Recall: 0.7715582450832073


- Random Forests

In [13]:
forest = RandomForestClassifier(n_estimators=100,
                             max_depth=20)

In [14]:
forest.fit(df_train, y_train)

In [15]:
pred = forest.predict(df_test)

print("Accuracy:", accuracy_score(y_test, pred))
print("Precision:", precision_score(y_test, pred))
print("Recall:", recall_score(y_test, pred))

Accuracy: 0.7844175491679274
Precision: 0.79375
Recall: 0.7685325264750378


- Gradient Boosting

In [16]:
gb_reg = GradientBoostingClassifier(max_depth=20,
                                   n_estimators=100)

In [17]:
gb_reg.fit(df_train, y_train)

In [18]:
pred = gb_reg.predict(df_test)

print("Accuracy:", accuracy_score(y_test, pred))
print("Precision:", precision_score(y_test, pred))
print("Recall:", recall_score(y_test, pred))

Accuracy: 0.7889561270801816
Precision: 0.7808823529411765
Recall: 0.8033282904689864


- Adaptive Boosting

In [20]:
ada_reg = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20),
                            n_estimators=100)

In [21]:
ada_reg.fit(df_train, y_train)

In [22]:
pred = ada_reg.predict(df_test)

print("Accuracy:", accuracy_score(y_test, pred))
print("Precision:", precision_score(y_test, pred))
print("Recall:", recall_score(y_test, pred))

Accuracy: 0.7692889561270801
Precision: 0.7572254335260116
Recall: 0.7927382753403933


Which model is the best and why?

It appears as though GradientBoostingClassifier is the most accurate. This could be because this model learns from the errors made in earlier estimations.