# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [2]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [4]:
spaceship.dropna(inplace=True)

In [5]:
spaceship['Cabin'] = spaceship['Cabin'].apply(lambda x: x.split('/')[0])

In [6]:
spaceship.drop(columns=['PassengerId', 'Name'], inplace=True)

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [7]:
df_dummies = pd.get_dummies(spaceship)
display(df_dummies.head())
display(df_dummies.columns)

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,...,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,False,False,True,False,...,False,False,False,False,False,False,False,True,True,False
1,24.0,109.0,9.0,25.0,549.0,44.0,True,True,False,False,...,False,False,True,False,False,False,False,True,True,False
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,False,True,False,...,False,False,False,False,False,False,False,True,False,True
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,False,True,False,...,False,False,False,False,False,False,False,True,True,False
4,16.0,303.0,70.0,151.0,565.0,2.0,True,True,False,False,...,False,False,True,False,False,False,False,True,True,False


Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Transported', 'HomePlanet_Earth', 'HomePlanet_Europa',
       'HomePlanet_Mars', 'CryoSleep_False', 'CryoSleep_True', 'Cabin_A',
       'Cabin_B', 'Cabin_C', 'Cabin_D', 'Cabin_E', 'Cabin_F', 'Cabin_G',
       'Cabin_T', 'Destination_55 Cancri e', 'Destination_PSO J318.5-22',
       'Destination_TRAPPIST-1e', 'VIP_False', 'VIP_True'],
      dtype='object')

**Perform Train Test Split**

In [8]:
features = df_dummies.drop(columns = ['HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars', 'VIP_False', 'VIP_True', 'Cabin_A', 'Cabin_B', 
                                      'Cabin_C', 'Cabin_D', 'Cabin_E', 'Cabin_F', 'Cabin_G', 'Cabin_T', 'FoodCourt', 'ShoppingMall', 'Age'])
target = df_dummies["Transported"]


In [9]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

In [10]:
from sklearn.preprocessing import StandardScaler

normalizer = StandardScaler()
normalizer.fit(X_train)

In [11]:
X_train_norm = pd.DataFrame(normalizer.transform(X_train), columns=X_train.columns)
X_test_norm = pd.DataFrame(normalizer.transform(X_test), columns=X_test.columns)
display(X_train_norm)
display(X_test_norm)


Unnamed: 0,RoomService,Spa,VRDeck,Transported,CryoSleep_False,CryoSleep_True,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,-0.347046,-0.271543,-0.271123,-1.009126,-1.353973,1.353973,-0.526549,-0.328263,0.677097
1,-0.347046,-0.271543,-0.271123,0.990957,-1.353973,1.353973,-0.526549,-0.328263,0.677097
2,-0.347046,0.718070,-0.271123,-1.009126,0.738567,-0.738567,-0.526549,-0.328263,0.677097
3,-0.326846,0.044548,-0.270259,-1.009126,0.738567,-0.738567,-0.526549,-0.328263,0.677097
4,-0.347046,-0.271543,-0.271123,0.990957,-1.353973,1.353973,-0.526549,-0.328263,0.677097
...,...,...,...,...,...,...,...,...,...
5279,-0.347046,-0.271543,-0.271123,-1.009126,0.738567,-0.738567,-0.526549,-0.328263,0.677097
5280,-0.347046,0.352887,-0.269396,-1.009126,0.738567,-0.738567,1.899158,-0.328263,-1.476894
5281,-0.347046,6.461962,-0.188247,-1.009126,0.738567,-0.738567,-0.526549,-0.328263,0.677097
5282,-0.347046,0.318435,1.264664,-1.009126,0.738567,-0.738567,-0.526549,-0.328263,0.677097


Unnamed: 0,RoomService,Spa,VRDeck,Transported,CryoSleep_False,CryoSleep_True,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,-0.347046,-0.271543,-0.271123,0.990957,-1.353973,1.353973,1.899158,-0.328263,-1.476894
1,-0.347046,-0.271543,-0.271123,-1.009126,-1.353973,1.353973,-0.526549,3.046335,-1.476894
2,-0.347046,-0.271543,-0.271123,0.990957,-1.353973,1.353973,1.899158,-0.328263,-1.476894
3,-0.347046,-0.271543,-0.271123,-1.009126,-1.353973,1.353973,-0.526549,3.046335,-1.476894
4,-0.347046,1.228811,-0.271123,0.990957,0.738567,-0.738567,-0.526549,-0.328263,0.677097
...,...,...,...,...,...,...,...,...,...
1317,-0.347046,-0.270682,-0.241771,-1.009126,0.738567,-0.738567,-0.526549,-0.328263,0.677097
1318,0.119115,-0.270682,-0.123501,0.990957,0.738567,-0.738567,-0.526549,-0.328263,0.677097
1319,-0.343939,-0.265514,-0.271123,0.990957,0.738567,-0.738567,-0.526549,-0.328263,0.677097
1320,-0.347046,-0.271543,-0.271123,0.990957,-1.353973,1.353973,-0.526549,3.046335,-1.476894


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

In [24]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier

- Bagging and Pasting

In [13]:

bagging_clas = BaggingClassifier(DecisionTreeClassifier(),
                               n_estimators=100,
                               max_samples = 1000,
                               bootstrap=True) 

In [14]:
bagging_clas.fit(X_train_norm, y_train)

In [15]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
pred = bagging_clas.predict(X_test_norm)

In [16]:
from sklearn.metrics import precision_score, recall_score, classification_report, confusion_matrix, f1_score

print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00       661
        True       1.00      1.00      1.00       661

    accuracy                           1.00      1322
   macro avg       1.00      1.00      1.00      1322
weighted avg       1.00      1.00      1.00      1322



- Random Forests

In [25]:
forest = RandomForestClassifier(n_estimators=100)

In [27]:
forest.fit(X_train_norm, y_train)

In [28]:
pred = forest.predict(X_test_norm)

In [30]:
forest.score(X_test_norm, y_test)

1.0

In [31]:
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00       661
        True       1.00      1.00      1.00       661

    accuracy                           1.00      1322
   macro avg       1.00      1.00      1.00      1322
weighted avg       1.00      1.00      1.00      1322



- Gradient Boosting

In [46]:
gb = GradientBoostingClassifier(n_estimators=100)

In [47]:
gb.fit(X_train_norm, y_train)

In [48]:
pred = gb.predict(X_test_norm)

In [49]:
gb.score(X_test_norm, y_test)

1.0

In [50]:
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00       661
        True       1.00      1.00      1.00       661

    accuracy                           1.00      1322
   macro avg       1.00      1.00      1.00      1322
weighted avg       1.00      1.00      1.00      1322



- Adaptive Boosting

In [43]:
ada = AdaBoostClassifier(n_estimators=100)

In [44]:
ada.fit(X_train_norm, y_train)



In [45]:
pred = ada.predict(X_test_norm)

In [51]:
gb.score(X_test_norm, y_test)

1.0

In [52]:
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00       661
        True       1.00      1.00      1.00       661

    accuracy                           1.00      1322
   macro avg       1.00      1.00      1.00      1322
weighted avg       1.00      1.00      1.00      1322



Which model is the best and why?

In [23]:
# It's hard to tell since we are predicting a class and all models show the same result.