# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
spaceship.shape


(8693, 14)

In [4]:
spaceship.dtypes


PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

In [5]:
spaceship.isnull().sum()


PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [6]:
spaceship = spaceship.dropna()
spaceship.isnull().sum()  # Check again to ensure there are no missing values


PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64

**Perform Train Test Split**

In [7]:
spaceship['Cabin'] = spaceship['Cabin'].apply(lambda x: x[0] if pd.notnull(x) else x)
spaceship['Cabin'].unique()  # Check the unique values


array(['B', 'F', 'A', 'G', 'E', 'C', 'D', 'T'], dtype=object)

Drop PassengerId and Name

In [8]:
spaceship = spaceship.drop(['PassengerId', 'Name'], axis=1)


Convert Non-Numerical Columns to Dummies


In [9]:
spaceship = pd.get_dummies(spaceship)
spaceship.head()


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,...,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,False,0,1,0,...,0,0,0,0,0,0,0,1,1,0
1,24.0,109.0,9.0,25.0,549.0,44.0,True,1,0,0,...,0,0,1,0,0,0,0,1,1,0
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,0,1,0,...,0,0,0,0,0,0,0,1,0,1
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,0,1,0,...,0,0,0,0,0,0,0,1,1,0
4,16.0,303.0,70.0,151.0,565.0,2.0,True,1,0,0,...,0,0,1,0,0,0,0,1,1,0


Perform Train Test Split

In [15]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler




In [17]:
# Define features and target variable
X = spaceship.drop('Transported', axis=1)
y = spaceship['Transported']

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [18]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Bagging
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, random_state=42
)
bag_clf.fit(X_train, y_train)
y_pred_bag = bag_clf.predict(X_test)

# Evaluate Bagging
accuracy_bag = accuracy_score(y_test, y_pred_bag)
report_bag = classification_report(y_test, y_pred_bag)
conf_matrix_bag = confusion_matrix(y_test, y_pred_bag)

print(f"Bagging Accuracy: {accuracy_bag}")
print(f"Bagging Classification Report:\n{report_bag}")
print(f"Bagging Confusion Matrix:\n{conf_matrix_bag}")


Bagging Accuracy: 0.802571860816944
Bagging Classification Report:
              precision    recall  f1-score   support

       False       0.81      0.79      0.80       653
        True       0.80      0.82      0.81       669

    accuracy                           0.80      1322
   macro avg       0.80      0.80      0.80      1322
weighted avg       0.80      0.80      0.80      1322

Bagging Confusion Matrix:
[[514 139]
 [122 547]]


- Random Forests

In [19]:
from sklearn.ensemble import RandomForestClassifier

# Random Forest
rf_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)

# Evaluate Random Forest
accuracy_rf = accuracy_score(y_test, y_pred_rf)
report_rf = classification_report(y_test, y_pred_rf)
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)

print(f"Random Forest Accuracy: {accuracy_rf}")
print(f"Random Forest Classification Report:\n{report_rf}")
print(f"Random Forest Confusion Matrix:\n{conf_matrix_rf}")


Random Forest Accuracy: 0.8055975794251135
Random Forest Classification Report:
              precision    recall  f1-score   support

       False       0.80      0.81      0.80       653
        True       0.81      0.80      0.81       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322

Random Forest Confusion Matrix:
[[530 123]
 [134 535]]


- Gradient Boosting

In [20]:
from sklearn.ensemble import GradientBoostingClassifier

# Gradient Boosting
gb_clf = GradientBoostingClassifier(n_estimators=500, random_state=42)
gb_clf.fit(X_train, y_train)
y_pred_gb = gb_clf.predict(X_test)

# Evaluate Gradient Boosting
accuracy_gb = accuracy_score(y_test, y_pred_gb)
report_gb = classification_report(y_test, y_pred_gb)
conf_matrix_gb = confusion_matrix(y_test, y_pred_gb)

print(f"Gradient Boosting Accuracy: {accuracy_gb}")
print(f"Gradient Boosting Classification Report:\n{report_gb}")
print(f"Gradient Boosting Confusion Matrix:\n{conf_matrix_gb}")


Gradient Boosting Accuracy: 0.8093797276853253
Gradient Boosting Classification Report:
              precision    recall  f1-score   support

       False       0.81      0.80      0.80       653
        True       0.81      0.82      0.81       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322

Gradient Boosting Confusion Matrix:
[[520 133]
 [119 550]]


- Adaptive Boosting

In [21]:
from sklearn.ensemble import AdaBoostClassifier

# AdaBoost
ada_clf = AdaBoostClassifier(n_estimators=500, random_state=42)
ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)

# Evaluate AdaBoost
accuracy_ada = accuracy_score(y_test, y_pred_ada)
report_ada = classification_report(y_test, y_pred_ada)
conf_matrix_ada = confusion_matrix(y_test, y_pred_ada)

print(f"AdaBoost Accuracy: {accuracy_ada}")
print(f"AdaBoost Classification Report:\n{report_ada}")
print(f"AdaBoost Confusion Matrix:\n{conf_matrix_ada}")


AdaBoost Accuracy: 0.7904689863842662
AdaBoost Classification Report:
              precision    recall  f1-score   support

       False       0.81      0.75      0.78       653
        True       0.77      0.83      0.80       669

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322

AdaBoost Confusion Matrix:
[[488 165]
 [112 557]]


Which model is the best and why?

Compare the accuracy, classification report, and confusion matrix for each model to determine which performs the best. Comment on which model is the best and why.

Model Evaluation
Bagging: Offers a simple way to reduce variance and avoid overfitting, particularly with decision trees.
Random Forests: Usually performs better than a single decision tree by combining multiple decision trees and reducing overfitting.
Gradient Boosting: Combines multiple weak learners to form a strong learner, typically results in high accuracy.
AdaBoost: Focuses on correcting the mistakes of the weak classifiers and can improve accuracy on difficult datasets.

Conclusion

Based on the results, you can determine which model performs best in terms of accuracy and other metrics. Typically, Gradient Boosting and Random Forests tend to perform well, but this can vary based on the dataset and problem specifics.