# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [65]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder,  MinMaxScaler, StandardScaler

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, precision_score, accuracy_score, recall_score, f1_score

In [66]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [67]:
#your code here
# drop na rows
drop_rows = spaceship[spaceship.isna().sum(axis=1) > 0]
drop_rows.index
spaceship.drop(drop_rows.index, inplace=True)
spaceship.reset_index(drop=True,inplace=True)

# feature filtering
spaceship['Cabin'] = spaceship['Cabin'].str.split('/').str[0]

# split into features and target
features = spaceship.drop(['PassengerId', 'Name','Transported'], axis=1)
target = spaceship.Transported

In [68]:
# get one hot encode vars
def encode_via_one_hot(features_df):
    features_cat = features_df.select_dtypes(include='object')
    print(features_cat)

    encoder = OneHotEncoder(drop= 'if_binary').fit(features_cat)
    cols = encoder.get_feature_names_out(input_features=features_cat.columns)
    spaceship_encode = pd.DataFrame(encoder.transform(features_cat).toarray(),columns=cols)
    spaceship_encode.reset_index(drop=True, inplace=True)
    return spaceship_encode


# reformat features with one hot encoding for cateogorical features
cat_features_one_hot = encode_via_one_hot(features)
num_features = features.select_dtypes(include='number')

features = pd.concat([num_features, cat_features_one_hot],axis=1)
features

     HomePlanet CryoSleep Cabin    Destination    VIP
0        Europa     False     B    TRAPPIST-1e  False
1         Earth     False     F    TRAPPIST-1e  False
2        Europa     False     A    TRAPPIST-1e   True
3        Europa     False     A    TRAPPIST-1e  False
4         Earth     False     F    TRAPPIST-1e  False
...         ...       ...   ...            ...    ...
6601     Europa     False     A    55 Cancri e   True
6602      Earth      True     G  PSO J318.5-22  False
6603      Earth     False     G    TRAPPIST-1e  False
6604     Europa     False     E    55 Cancri e  False
6605     Europa     False     E    TRAPPIST-1e  False

[6606 rows x 5 columns]


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,...,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,24.0,109.0,9.0,25.0,549.0,44.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,58.0,43.0,3576.0,0.0,6715.0,49.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
3,33.0,0.0,1283.0,371.0,3329.0,193.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,16.0,303.0,70.0,151.0,565.0,2.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6601,41.0,0.0,6819.0,0.0,1643.0,74.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
6602,18.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
6603,26.0,0.0,0.0,1872.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
6604,32.0,0.0,1049.0,0.0,353.0,3235.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [69]:
# scale feature data
def scale_data(feature_data, normalizer):
    normalizer.fit(feature_data)

    norm_arr = normalizer.transform(feature_data)

    feature_data_norm = pd.DataFrame(norm_arr, columns = feature_data.columns)

    return feature_data_norm

scaler = MinMaxScaler()
features = scale_data(features, normalizer=scaler)
features

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,...,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,0.493671,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.303797,0.010988,0.000302,0.002040,0.024500,0.002164,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.734177,0.004335,0.119948,0.000000,0.299670,0.002410,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
3,0.417722,0.000000,0.043035,0.030278,0.148563,0.009491,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.202532,0.030544,0.002348,0.012324,0.025214,0.000098,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6601,0.518987,0.000000,0.228726,0.000000,0.073322,0.003639,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
6602,0.227848,0.000000,0.000000,0.000000,0.000000,0.000000,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
6603,0.329114,0.000000,0.000000,0.152779,0.000045,0.000000,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
6604,0.405063,0.000000,0.035186,0.000000,0.015753,0.159077,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


**Perform Train Test Split**

In [70]:
#your code here
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

In [73]:
# for model evaluation
def eval_model(model, test_data, test_labels):
    pred = model.predict(test_data)

    # print("MAE", mean_absolute_error(pred, test_labels))
    # print("RMSE", mean_squared_error(pred, test_labels, squared=False))
    # print("R2 score", model.score(test_data, test_labels))

    print(f"Precision:",{precision_score(test_labels, pred, average='binary')})

    print(f"Accuracy:",{accuracy_score(test_labels, pred)})
    print(f"Recall:",{recall_score(test_labels, pred)})
    print(f"F1:",{f1_score(test_labels, pred)})

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [74]:
bagging_reg = BaggingClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

bagging_reg.fit(X_train, y_train)
eval_model(bagging_reg, X_test, y_test)

Precision: {0.7766116941529235}
Accuracy: {0.7791225416036308}
Recall: {0.783661119515885}
F1: {0.7801204819277107}


- Random Forests

In [75]:
#your code here
forest = RandomForestClassifier(n_estimators=100,
                             max_depth=20)
forest.fit(X_train, y_train)

eval_model(forest, X_test, y_test)

Precision: {0.7908396946564885}
Accuracy: {0.7881996974281392}
Recall: {0.783661119515885}
F1: {0.7872340425531914}


- Adaptive Boosting

In [76]:
#your code here
ada_reg = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20),
                            n_estimators=100)

ada_reg.fit(X_train, y_train)

eval_model(ada_reg, X_test, y_test)

Precision: {0.7642752562225475}
Accuracy: {0.773071104387292}
Recall: {0.789712556732224}
F1: {0.7767857142857143}


- Gradient Boosting

In [77]:
#your code here
gb_reg = GradientBoostingClassifier(max_depth=20,
                                   n_estimators=100)

gb_reg.fit(X_train, y_train)

eval_model(gb_reg, X_test, y_test)

Precision: {0.7822222222222223}
Accuracy: {0.7881996974281392}
Recall: {0.7987897125567323}
F1: {0.7904191616766468}


Which model is the best and why?

It seems Gradient Boosting worked best for the space ship example.