# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [41]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier

from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, classification_report, confusion_matrix, accuracy_score

import matplotlib.pyplot as plt
import seaborn as sns


In [42]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
pd.options.display.max_columns= None
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [43]:
df['Cabin'].unique()

array(['B/0/P', 'F/0/S', 'A/0/S', ..., 'G/1499/S', 'G/1500/S', 'E/608/S'],
      dtype=object)

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [44]:
df['Cabin'] = df['Cabin'].str[0]

df[['B','F','A','G','E','C','D','T']] = pd.get_dummies(df.Cabin, dtype=int)                                                        
df[['TRAPPIST-1e','PSO J318.5-22','55 Cancri e']] = pd.get_dummies(df.Destination, dtype=int)  
df[["Earth", "Europa", "Mars"]] = pd.get_dummies(df.HomePlanet, dtype=int)

df = df.drop(columns= ['PassengerId', 'Name', 'Cabin', 'Destination', 'HomePlanet'])

**Perform Train Test Split**

In [45]:
df.dropna(inplace= True)


X= df.drop(columns= ['Transported'])
Y= df['Transported']

x_train, x_test, y_train, y_test= train_test_split(X, Y, test_size=0.20, random_state=0)

In [46]:
normalizer= MinMaxScaler()
normalizer.fit(x_train)

x_train_norm = normalizer.transform(x_train)

x_test_norm = normalizer.transform(x_test)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [48]:
base_estimator = DecisionTreeClassifier(max_depth=None, random_state=42)

bagging_classifier = BaggingClassifier(estimator=base_estimator, n_estimators=50, random_state=42)

bagging_classifier.fit(x_train_norm, y_train)

predbagging = bagging_classifier.predict(x_test_norm)

print("Bagging Classifier:")
print(classification_report(y_test, predbagging))



Bagging Classifier:
              precision    recall  f1-score   support

       False       0.77      0.80      0.78       699
        True       0.81      0.78      0.79       752

    accuracy                           0.79      1451
   macro avg       0.79      0.79      0.79      1451
weighted avg       0.79      0.79      0.79      1451



- Random Forests

In [None]:
random_forest_class = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
random_forest_class.fit(x_train_norm, y_train)
predforest = random_forest_class.predict(x_test_norm)
conf_matrix_forest = confusion_matrix(y_test, predforest)

print(classification_report(y_pred=predforest, y_true=y_test, target_names=['Died', 'Survived']))



              precision    recall  f1-score   support

        Died       0.82      0.78      0.80       699
    Survived       0.80      0.84      0.82       752

    accuracy                           0.81      1451
   macro avg       0.81      0.81      0.81      1451
weighted avg       0.81      0.81      0.81      1451



- Gradient Boosting

In [None]:
gb_classifier = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb_classifier.fit(x_train_norm, y_train)
prediction = gb_classifier.predict(x_test_norm)

print("Gradient Boosting Classifier:")
print(classification_report(y_test, prediction))


Gradient Boosting Classifier:
              precision    recall  f1-score   support

       False       0.83      0.76      0.79       699
        True       0.79      0.86      0.82       752

    accuracy                           0.81      1451
   macro avg       0.81      0.81      0.81      1451
weighted avg       0.81      0.81      0.81      1451



- Adaptive Boosting

In [None]:
base_estimator = DecisionTreeClassifier(max_depth=1)
ada_boost = AdaBoostClassifier(estimator=base_estimator, n_estimators=50, learning_rate=1, random_state=42)

ada_boost.fit(x_train_norm, y_train)

prediction_ada = ada_boost.predict(x_test_norm)
print("AdaBoost Classifier:")
print(classification_report(y_test, prediction_ada))




AdaBoost Classifier:
              precision    recall  f1-score   support

       False       0.80      0.74      0.77       699
        True       0.77      0.83      0.80       752

    accuracy                           0.78      1451
   macro avg       0.79      0.78      0.78      1451
weighted avg       0.79      0.78      0.78      1451



Which model is the best and why?

In [None]:
#According to the different tests, we conclude that the best model with this data is gradient boosting, followed close by the random forest model