# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [16]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import classification_report

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,AdaBoostClassifier, GradientBoostingClassifier   


In [21]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [22]:
spaceship.dropna(inplace=True)

In [23]:
spaceship['Cabin'] = spaceship['Cabin'].str.split('/').str[0]


In [24]:
spaceship.drop(columns=['PassengerId', 'Name'], inplace=True)   

In [25]:
# Select the numeric columns (feature selection)
numeric_columns = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']


# extract the numerical variables
X = spaceship[numeric_columns]
var_target = spaceship['Transported']

# Split the data into training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, var_target, test_size=0.2, random_state=0)


# Create a scaler object (feature standardizer)
scaler = StandardScaler()

# Adapt the scaler to the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform data in the test set
X_test_scaled = scaler.transform(X_test)


**Perform Train Test Split**

In [26]:
#create an instance of the normalizer

normalizer = MinMaxScaler()

normalizer.fit(X_train)

X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [17]:
# Initialize the bagging model(base)
bagging_reg = BaggingClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

bagging_reg.fit(X_train, y_train)

pred = bagging_reg.predict(X_test)

print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00       661
        True       1.00      1.00      1.00       661

    accuracy                           1.00      1322
   macro avg       1.00      1.00      1.00      1322
weighted avg       1.00      1.00      1.00      1322



In [28]:
# Initialize the bagging model(normalized)
bagging_reg = BaggingClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

bagging_reg.fit(X_train_norm, y_train)

pred = bagging_reg.predict(X_test_norm)

print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       0.80      0.73      0.77       661
        True       0.75      0.82      0.79       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



In [29]:
# Initialize the bagging model(standardized)
bagging_reg = BaggingClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

bagging_reg.fit(X_train_scaled, y_train)

pred = bagging_reg.predict(X_test_scaled)

print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       0.81      0.73      0.77       661
        True       0.76      0.83      0.79       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



In [30]:
# Initialize the Pasting model (base)
pasting_reg = BaggingClassifier(
  estimator=DecisionTreeClassifier(max_depth=20),
  n_estimators=100,
  max_samples=1000,
  bootstrap=False  # This ensures that no bootstrap is done, which is characteristic of Pasting
)

# Train the model with normalized data
pasting_reg.fit(X_train, y_train)

# Evaluate the model's performance
pred = pasting_reg.predict(X_test)

print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       0.82      0.72      0.77       661
        True       0.75      0.84      0.79       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



In [31]:
# Initialize the Pasting model (normalized)
pasting_reg = BaggingClassifier(
  estimator=DecisionTreeClassifier(max_depth=20),
  n_estimators=100,
  max_samples=1000,
  bootstrap=False  # This ensures that no bootstrap is done, which is characteristic of Pasting
)

# Train the model with normalized data
pasting_reg.fit(X_train_norm, y_train)

# Evaluate the model's performance
pred = pasting_reg.predict(X_test_norm)

print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       0.82      0.73      0.77       661
        True       0.76      0.84      0.79       661

    accuracy                           0.78      1322
   macro avg       0.79      0.78      0.78      1322
weighted avg       0.79      0.78      0.78      1322



In [32]:
# Initialize the Pasting model (standardized)
pasting_reg = BaggingClassifier(
  estimator=DecisionTreeClassifier(max_depth=20),
  n_estimators=100,
  max_samples=1000,
  bootstrap=False  # This ensures that no bootstrap is done, which is characteristic of Pasting
)

# Train the model with normalized data
pasting_reg.fit(X_train_scaled, y_train)

# Evaluate the model's performance
pred = pasting_reg.predict(X_test_scaled)

print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       0.81      0.73      0.77       661
        True       0.75      0.84      0.79       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



- Random Forests

In [33]:
#initialize a random forest (base)

forest = RandomForestClassifier(n_estimators=100,
                             max_depth=20)
# train the model
forest.fit(X_train, y_train)
#evaluate the model
pred = forest.predict(X_test)

print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       0.81      0.73      0.77       661
        True       0.75      0.83      0.79       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



In [34]:
#initialize a random forest(normalized)

forest = RandomForestClassifier(n_estimators=100,
                             max_depth=20)
# train the model
forest.fit(X_train_norm, y_train)
#evaluate the model
pred = forest.predict(X_test_norm)

print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       0.81      0.73      0.76       661
        True       0.75      0.82      0.79       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.77      1322
weighted avg       0.78      0.78      0.77      1322



In [35]:
#initialize a random forest(standardized)

forest = RandomForestClassifier(n_estimators=100,
                             max_depth=20)
# train the model
forest.fit(X_train_scaled, y_train)

#evaluate the model
pred = forest.predict(X_test_scaled)

print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       0.81      0.73      0.77       661
        True       0.75      0.82      0.79       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



- Gradient Boosting

In [36]:
#initialize the Gradient Boosting model (base)
gb_reg = GradientBoostingClassifier(max_depth=20,
                                   n_estimators=100)
#train the model
gb_reg.fit(X_train, y_train)

#evaluate the model
pred = gb_reg.predict(X_test)

print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       0.78      0.66      0.71       661
        True       0.71      0.81      0.75       661

    accuracy                           0.74      1322
   macro avg       0.74      0.74      0.73      1322
weighted avg       0.74      0.74      0.73      1322



In [37]:
#initialize the Gradient Boosting model (normalized)
gb_reg = GradientBoostingClassifier(max_depth=20,
                                   n_estimators=100)
#train the model
gb_reg.fit(X_train_norm, y_train)

#evaluate the model
pred = gb_reg.predict(X_test_norm)


print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       0.78      0.66      0.71       661
        True       0.70      0.82      0.76       661

    accuracy                           0.74      1322
   macro avg       0.74      0.74      0.73      1322
weighted avg       0.74      0.74      0.73      1322



In [38]:
#initialize the Gradient Boosting model (standardized)
gb_reg = GradientBoostingClassifier(max_depth=20,
                                   n_estimators=100)
#train the model
gb_reg.fit(X_train_scaled, y_train)

#evaluate the model
pred = gb_reg.predict(X_test_scaled)


print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       0.78      0.66      0.71       661
        True       0.71      0.81      0.75       661

    accuracy                           0.74      1322
   macro avg       0.74      0.74      0.73      1322
weighted avg       0.74      0.74      0.73      1322



- Adaptive Boosting

In [39]:
# Initialize the AdaBoost model (base)
ada_reg = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20),
                            n_estimators=200)
# Training the model
ada_reg.fit(X_train, y_train)
# Evaluate the model
pred = ada_reg.predict(X_test) 

print(classification_report(y_pred = pred, y_true = y_test))



              precision    recall  f1-score   support

       False       0.81      0.71      0.75       661
        True       0.74      0.83      0.78       661

    accuracy                           0.77      1322
   macro avg       0.77      0.77      0.77      1322
weighted avg       0.77      0.77      0.77      1322



In [40]:
# Initialize the AdaBoost model (normalized)
ada_reg = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20),
                            n_estimators=200)
# Training the model
ada_reg.fit(X_train_norm, y_train)
# Evaluate the model
pred = ada_reg.predict(X_test_norm) 


print(classification_report(y_pred = pred, y_true = y_test))



              precision    recall  f1-score   support

       False       0.81      0.70      0.75       661
        True       0.74      0.84      0.78       661

    accuracy                           0.77      1322
   macro avg       0.77      0.77      0.77      1322
weighted avg       0.77      0.77      0.77      1322



In [41]:
# Initialize the AdaBoost model (standardized)
ada_reg = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20),
                            n_estimators=200)
# Training the model
ada_reg.fit(X_train_scaled, y_train)
# Evaluate the model
pred = ada_reg.predict(X_test_scaled) 


print(classification_report(y_pred = pred, y_true = y_test))



              precision    recall  f1-score   support

       False       0.82      0.71      0.76       661
        True       0.74      0.84      0.79       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



Which model is the best and why?

In [27]:
#The best model is random forest because it has the highest accuracy