# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [67]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score

from sklearn.metrics import accuracy_score, classification_report


In [68]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [69]:
#Check the shape of the data
spaceship.shape

#Check for data types
spaceship.dtypes

#Check for missing values
spaceship.isnull().sum()

#- Removing all rows or all columns containing missing data.
#For this exercise, because we have such low amount of null values, we will drop rows containing any missing value.
spaceship.dropna()
spaceship.dropna(subset=['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], inplace=True)

#**Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}
# Transform the Cabin column to extract the first letter
spaceship['Cabin'] = spaceship['Cabin'].str[0]

# Optionally, ensure that only the expected categories are present
valid_categories = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}
spaceship = spaceship[spaceship['Cabin'].isin(valid_categories)]
 
# - Drop PassengerId and Name
spaceship.drop(['PassengerId', 'Name'], axis=1, inplace=True)

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [70]:
spaceship.dropna(inplace=True)
spaceship['Cabin'] = spaceship['Cabin'].str[0]

In [71]:
#one-hot encoding
print("Original dataframe")
df_categorical = spaceship.select_dtypes(include=['object'])
display(df_categorical)

#creating dummy variables > pd.get_dummies()
features = pd.get_dummies(spaceship, columns=['HomePlanet', 'CryoSleep','Cabin','Destination','VIP'])
features = features.drop(columns=['Transported'])

print('Dataframe with Dummy variables')
features

Original dataframe


Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,VIP
0,Europa,False,B,TRAPPIST-1e,False
1,Earth,False,F,TRAPPIST-1e,False
2,Europa,False,A,TRAPPIST-1e,True
3,Europa,False,A,TRAPPIST-1e,False
4,Earth,False,F,TRAPPIST-1e,False
...,...,...,...,...,...
8688,Europa,False,A,55 Cancri e,True
8689,Earth,True,G,PSO J318.5-22,False
8690,Earth,False,G,TRAPPIST-1e,False
8691,Europa,False,E,55 Cancri e,False


Dataframe with Dummy variables


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,...,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,False,True,False,True,...,False,False,False,False,False,False,False,True,True,False
1,24.0,109.0,9.0,25.0,549.0,44.0,True,False,False,True,...,False,False,True,False,False,False,False,True,True,False
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,True,False,True,...,False,False,False,False,False,False,False,True,False,True
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,True,False,True,...,False,False,False,False,False,False,False,True,True,False
4,16.0,303.0,70.0,151.0,565.0,2.0,True,False,False,True,...,False,False,True,False,False,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,41.0,0.0,6819.0,0.0,1643.0,74.0,False,True,False,True,...,False,False,False,False,False,True,False,False,False,True
8689,18.0,0.0,0.0,0.0,0.0,0.0,True,False,False,False,...,False,False,False,True,False,False,True,False,True,False
8690,26.0,0.0,0.0,1872.0,1.0,0.0,True,False,False,True,...,False,False,False,True,False,False,False,True,True,False
8691,32.0,0.0,1049.0,0.0,353.0,3235.0,False,True,False,True,...,False,True,False,False,False,True,False,False,True,False


**Perform Train Test Split**

In [72]:
# Perform train-test split
target = spaceship['Transported']
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=None, random_state=0)

**Normalization**

In [75]:
normalizer = MinMaxScaler()
normalizer.fit(X_train)

In [76]:
X_train_norm = normalizer.transform(X_train)
X_test_norm = normalizer.transform(X_test)

In [77]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

In [82]:
# Selecting numerical features
numerical_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

# Initializing the StandardScaler
scaler = StandardScaler()

# Applying scaling to numerical features
spaceship[numerical_features] = scaler.fit_transform(spaceship[numerical_features])

# Display the first few rows to check the scaling
spaceship.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,0.695365,False,-0.346316,-0.286103,-0.282915,-0.275577,-0.27029,False
1,Earth,False,F,TRAPPIST-1e,-0.337089,False,-0.178108,-0.280735,-0.243729,0.206465,-0.231242,True
2,Europa,False,A,TRAPPIST-1e,2.00314,True,-0.279959,1.846533,-0.282915,5.620436,-0.226804,False
3,Europa,False,A,TRAPPIST-1e,0.282383,False,-0.346316,0.479046,0.298603,2.647405,-0.09901,False
4,Earth,False,F,TRAPPIST-1e,-0.887732,False,0.121271,-0.244356,-0.046233,0.220513,-0.268515,True


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [83]:
#your code here
# Bagging
bagging_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=100,
    max_samples=0.8, bootstrap=True, random_state=42)

bagging_clf.fit(X_train, y_train)
y_pred_bagging = bagging_clf.predict(X_test)

print("Bagging Classifier Accuracy:", accuracy_score(y_test, y_pred_bagging))
print("Bagging Classifier Report:")
print(classification_report(y_test, y_pred_bagging))

# Pasting
pasting_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=100,
    max_samples=0.8, bootstrap=False, random_state=42)

pasting_clf.fit(X_train, y_train)
y_pred_pasting = pasting_clf.predict(X_test)

print("Pasting Classifier Accuracy:", accuracy_score(y_test, y_pred_pasting))
print("Pasting Classifier Report:")
print(classification_report(y_test, y_pred_pasting))

Bagging Classifier Accuracy: 0.8007096392667061
Bagging Classifier Report:
              precision    recall  f1-score   support

       False       0.79      0.82      0.80       837
        True       0.81      0.79      0.80       854

    accuracy                           0.80      1691
   macro avg       0.80      0.80      0.80      1691
weighted avg       0.80      0.80      0.80      1691

Pasting Classifier Accuracy: 0.7900650502661147
Pasting Classifier Report:
              precision    recall  f1-score   support

       False       0.78      0.80      0.79       837
        True       0.80      0.78      0.79       854

    accuracy                           0.79      1691
   macro avg       0.79      0.79      0.79      1691
weighted avg       0.79      0.79      0.79      1691



- Random Forests

In [None]:
#your code here
#Random Forest
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

- Gradient Boosting


In [85]:
#pip install xgboost

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.

Collecting xgboost
  Downloading xgboost-2.1.1-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-2.1.1-py3-none-win_amd64.whl (124.9 MB)
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/124.9 MB 330.3 kB/s eta 0:06:19
   ---------------------------------------- 0.1/124.9 MB 491.5 kB/s eta 0:04:14
   ---------------------------------------- 0.2/124.9 MB 919.0 kB/s eta 0:02:16
   ---------------------------------------- 0.4/124.9 MB 1.9 MB/s eta 0:01:07
   ---------------------------------------- 0.7/124.9 MB 2.5 MB/s eta 0:00:50
   ---------------------------------------- 1.0/124.9 MB 3.1 MB/s eta 0:00:41
   ---------------------------------------- 1.2/124.9 MB 3.5 MB/s eta 0:00:36
   -----------

In [86]:
#your code here
from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.7989355410999409
              precision    recall  f1-score   support

       False       0.81      0.78      0.79       837
        True       0.79      0.82      0.80       854

    accuracy                           0.80      1691
   macro avg       0.80      0.80      0.80      1691
weighted avg       0.80      0.80      0.80      1691



- Adaptive Boosting

In [87]:
#your code here
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=100,
    learning_rate=0.5, random_state=42)

ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)

print("AdaBoost Classifier Accuracy:", accuracy_score(y_test, y_pred_ada))
print("AdaBoost Classifier Report:")
print(classification_report(y_test, y_pred_ada))

AdaBoost Classifier Accuracy: 0.8083973979893554
AdaBoost Classifier Report:
              precision    recall  f1-score   support

       False       0.83      0.77      0.80       837
        True       0.79      0.85      0.82       854

    accuracy                           0.81      1691
   macro avg       0.81      0.81      0.81      1691
weighted avg       0.81      0.81      0.81      1691



Which model is the best and why?

In [89]:
# Comment on model performance
"""
Adaptive Boosting emerges as the best model in this comparison due to its highest accuracy and well-balanced precision, recall, and F1-scores. It effectively handles complex patterns in the data and corrects the errors of its predecessors, leading to strong overall performance.

Gradient Boosting is a close second and might be preferred in scenarios where computational resources are more limited or where focus on difficult cases is critical.

Bagging and Pasting offer a good trade-off between simplicity and performance, especially when variance reduction is essential.

KNN, while the simplest, is less effective in this particular scenario but could still be useful in specific cases where its assumptions hold true.

The choice of the best model ultimately depends on the specific requirements of the task, including the nature of the data, computational resources, and the importance of different performance metrics....
"""

'\nAdaptive Boosting emerges as the best model in this comparison due to its highest accuracy and well-balanced precision, recall, and F1-scores. It effectively handles complex patterns in the data and corrects the errors of its predecessors, leading to strong overall performance.\n\nGradient Boosting is a close second and might be preferred in scenarios where computational resources are more limited or where focus on difficult cases is critical.\n\nBagging and Pasting offer a good trade-off between simplicity and performance, especially when variance reduction is essential.\n\nKNN, while the simplest, is less effective in this particular scenario but could still be useful in specific cases where its assumptions hold true.\n\nThe choice of the best model ultimately depends on the specific requirements of the task, including the nature of the data, computational resources, and the importance of different performance metrics....\n'