# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [58]:
#Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [5]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [6]:
spaceship.shape

(8693, 14)

In [7]:
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

In [8]:
spaceship.isna().any()

PassengerId     False
HomePlanet       True
CryoSleep        True
Cabin            True
Destination      True
Age              True
VIP              True
RoomService      True
FoodCourt        True
ShoppingMall     True
Spa              True
VRDeck           True
Name             True
Transported     False
dtype: bool

In [9]:
cleaned_df = spaceship.dropna()

In [10]:
cleaned_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [11]:
# Transforming the Cabin column to retrieve the first part of the value
def get_first_value(value):
    if isinstance(value, str) and '/' in value: # checks if value is string and contains '/'
        new_value = value.split('/')
        if len(new_value) > 0: # checking if there are at least
            return new_value[0]
        
    return 0

# Applyin the above function to update the column
cleaned_df['Cabin'] = cleaned_df['Cabin'].apply(get_first_value)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df['Cabin'] = cleaned_df['Cabin'].apply(get_first_value)


In [12]:
cleaned_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [13]:
# Dropping the columns that will not be used
cleaned_df = cleaned_df.drop(['PassengerId', 'Name'], axis=1)

In [14]:
cleaned_df.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True


In [15]:
# Since we only accept numeric values, we have to turn the categorical variables into numerics
categorical_cols = ['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP']
spaceship_dummy = pd.get_dummies(cleaned_df[categorical_cols], drop_first=False, dtype=int)
spaceship_dummy

Unnamed: 0,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,1,0
1,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,1,0
2,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,1
3,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,1,1,0
4,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,0,1,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1
8689,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,1,0
8690,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,1,0
8691,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,1,0


In [16]:
cleaned_df.shape[0]

6606

In [17]:
spaceship_dummy.shape[0]

6606

In [18]:
cleaned_df.shape[0] == cleaned_df.shape[0]

True

In [19]:
# Joining the data
spaceship_transformed = pd.merge(
    cleaned_df,
    spaceship_dummy,
    left_index = True,
    right_index= True
)

In [20]:
spaceship_transformed.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,...,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,1,0
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,...,0,0,1,0,0,0,0,1,1,0
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,...,0,0,0,0,0,0,0,1,0,1
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,...,0,0,0,0,0,0,0,1,1,0
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,...,0,0,1,0,0,0,0,1,1,0


In [21]:
spaceship_transformed.drop(columns= ['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP'], inplace=True)

In [22]:
spaceship_transformed.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,...,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,False,0,1,0,...,0,0,0,0,0,0,0,1,1,0
1,24.0,109.0,9.0,25.0,549.0,44.0,True,1,0,0,...,0,0,1,0,0,0,0,1,1,0
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,0,1,0,...,0,0,0,0,0,0,0,1,0,1
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,0,1,0,...,0,0,0,0,0,0,0,1,1,0
4,16.0,303.0,70.0,151.0,565.0,2.0,True,1,0,0,...,0,0,1,0,0,0,0,1,1,0


**Perform Train Test Split**

In [24]:
X_train, X_test, y_train, y_test = train_test_split(spaceship_transformed.drop(columns=['Transported']), spaceship_transformed['Transported'])

In [25]:
X_train.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,...,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
5618,28.0,0.0,0.0,0.0,0.0,0.0,0,1,0,1,...,0,0,0,0,0,0,0,1,1,0
1711,17.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,...,0,0,0,1,0,0,0,1,1,0
1570,20.0,127.0,0.0,0.0,542.0,0.0,1,0,0,1,...,0,0,1,0,0,0,1,0,1,0
519,33.0,0.0,165.0,1.0,2525.0,0.0,0,1,0,1,...,0,0,0,0,0,0,0,1,1,0
2423,20.0,0.0,0.0,0.0,0.0,0.0,0,0,1,0,...,0,0,1,0,0,0,0,1,1,0


In [26]:
y_train.head()

5618     True
1711     True
1570    False
519     False
2423     True
Name: Transported, dtype: bool

In [27]:
X_test.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,...,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
4492,31.0,588.0,1261.0,0.0,228.0,6.0,0,1,0,1,...,0,0,0,0,0,0,0,1,1,0
198,13.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,...,0,0,0,1,0,0,0,1,1,0
6165,22.0,151.0,6.0,26.0,0.0,533.0,1,0,0,1,...,0,0,1,0,0,0,0,1,1,0
8139,40.0,76.0,0.0,1169.0,0.0,64.0,0,0,1,1,...,0,0,1,0,0,0,0,1,1,0
1003,38.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,...,0,0,0,1,0,0,1,0,1,0


In [28]:
y_test.head()

4492    False
198     False
6165    False
8139     True
1003    False
Name: Transported, dtype: bool

In [38]:
from sklearn.preprocessing import MinMaxScaler
normalizer = MinMaxScaler()

In [39]:
# Fitting it to our train model
normalizer.fit(X_train)

In [40]:
X_train_norm = normalizer.transform(X_train)
X_test_norm = normalizer.transform(X_test)

In [41]:
# When applying transformation of our dataframe, normalizer will retrun an array instead of dataframe
# Therefore, converting to df

pd.DataFrame(X_train_norm, columns=normalizer.feature_names_in_)

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,...,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,0.354430,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,0.215190,0.000000,0.000000,0.000000,0.000000,0.000000,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
2,0.253165,0.012802,0.000000,0.000000,0.024188,0.000000,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.417722,0.000000,0.005534,0.000096,0.112683,0.000000,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
4,0.253165,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4949,0.215190,0.032157,0.006172,0.000000,0.000000,0.022022,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4950,0.468354,0.158972,0.382249,0.000000,0.045252,0.032798,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
4951,0.392405,0.059778,0.018113,0.022832,0.000000,0.004510,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
4952,0.493671,0.000000,0.015228,0.000000,0.020261,0.000000,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0


In [42]:
X_test_norm = pd.DataFrame(X_test_norm, columns=X_test.columns)
X_test_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,...,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,0.392405,0.059274,0.042297,0.0,0.010175,0.000351,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,0.164557,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
2,0.278481,0.015222,0.000201,0.002494,0.0,0.031217,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,0.506329,0.007661,0.0,0.112145,0.0,0.003748,0.0,0.0,1.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
4,0.481013,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


In [44]:
from sklearn.neighbors import KNeighborsClassifier

# Create knn,
knn = KNeighborsClassifier(n_neighbors=3)

In [45]:
knn.fit(X_train_norm, y_train)

In [47]:
y_pred = knn.predict(X_test)
y_pred[:5]



array([ True, False, False,  True, False])

In [48]:
y_test[:5]

4492    False
198     False
6165    False
8139     True
1003    False
Name: Transported, dtype: bool

In [49]:
knn.score(X_test_norm, y_test)



0.7530266343825666

In [50]:
from sklearn.metrics import mean_squared_error

mse_transport = mean_squared_error(y_test, y_pred)
mse_transport

0.3795399515738499

## Feature selection

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [None]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier


In [55]:
bagging_cls = BaggingClassifier(DecisionTreeClassifier(max_depth=20), n_estimators=100, max_samples=1000)

In [56]:
# Training bagging model with our normalized data

bagging_cls.fit(X_train_norm, y_train)

In [80]:
pred = bagging_cls.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test))
print("R2 score", bagging_cls.score(X_test_norm, y_test))

MAE 0.20278450363196127
RMSE 0.20278450363196127
R2 score 0.7972154963680388




In [76]:
bagging_cls.score(X_test_norm, y_test)



0.7972154963680388

- Random Forests

In [61]:
rf = RandomForestClassifier(n_estimators=100, max_depth=20)

In [62]:
rf.fit(X_train_norm, y_train)

In [66]:
# Evaluating my model
pred = rf.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test))
print("R2 score", rf.score(X_test_norm, y_test))

MAE 0.2009685230024213
RMSE 0.2009685230024213
R2 score 0.7990314769975787




In [77]:
rf.score(X_test_norm, y_test)



0.7990314769975787

- Gradient Boosting

In [67]:
gb_cls = GradientBoostingClassifier(max_depth=20, n_estimators=100)

In [68]:
# Training the model
gb_cls.fit(X_train_norm, y_train)

In [70]:
# Evaluating the model
pred = gb_cls.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test))
print("R2 score", gb_cls.score(X_test_norm, y_test))

MAE 0.2179176755447942
RMSE 0.2179176755447942
R2 score 0.7820823244552058




In [78]:
gb_cls.score(X_test_norm, y_test)



0.7820823244552058

- Adaptive Boosting

In [71]:
ada_cls = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20), n_estimators=100)

In [73]:
# Training the model
ada_cls.fit(X_train_norm, y_train)

In [74]:
# Evaluating the model
pred = ada_cls.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test))
print("R2 score", ada_cls.score(X_test_norm, y_test))

MAE 0.24092009685230023
RMSE 0.24092009685230023
R2 score 0.7590799031476998




In [79]:
ada_cls.score(X_test_norm, y_test)



0.7590799031476998

Which model is the best and why?

In [83]:
data = {'Model': ['Bagging and Pasting', 'Random Forest',  'Gradieint Boosting', 'Adaptive Boosting'],
        'MAE': [0.20278450363196127, 0.2009685230024213,0.2179176755447942,0.24092009685230023],
        'RMSE': [0.20278450363196127,0.2009685230024213,  0.2179176755447942, 0.24092009685230023],
        'R2 Score': [0.7972154963680388, 0.7990314769975787, 0.7820823244552058, 0.7590799031476998],
        'Score': [0.7972154963680388,0.7990314769975787, 0.7590799031476998, 0.7590799031476998]}

df = pd.DataFrame(data)
df

Unnamed: 0,Model,MAE,RMSE,R2 Score,Score
0,Bagging and Pasting,0.202785,0.202785,0.797215,0.797215
1,Random Forest,0.200969,0.200969,0.799031,0.799031
2,Gradieint Boosting,0.217918,0.217918,0.782082,0.75908
3,Adaptive Boosting,0.24092,0.24092,0.75908,0.75908


From the above models, the Random Forest model is the best performing model.
Rationale:
Lowest MAE: It has the lowest MAE (0.290097), indicating the smallest avg. magnitude of errors
Lowest RMSE: Lowest RMSE, meaning its better at avoiding large errors compared to other models
Highest R2 score: Highest R2 score among other, implying it explains the largest proportion of the variance in the data