# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [78]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

In [80]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [83]:
#Data cleaning
# Check types and missing values
print(spaceship.dtypes)
print(spaceship.isnull().sum())

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object
PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64


In [85]:
#Data cleaning
spaceship_cleaned = spaceship.drop(columns=['PassengerId', 'Name']) 
spaceship_cleaned = spaceship.dropna()
print(spaceship_cleaned.isnull().sum())

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64


In [87]:
#Data cleaning
spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].apply(lambda x: x.split('/')[0] if isinstance(x, str) else x)
print(spaceship_cleaned['Cabin'].unique())

['B' 'F' 'A' 'G' 'E' 'C' 'D' 'T']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].apply(lambda x: x.split('/')[0] if isinstance(x, str) else x)


**Perform Train Test Split**

In [90]:
# Encodage des variables catégorielles
spaceship_cleaned = pd.get_dummies(spaceship_cleaned, drop_first=True)

In [91]:
# Séparation des données en ensembles d'entraînement et de test
X = spaceship_cleaned.drop(columns=['Transported'])
y = spaceship_cleaned['Transported']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [94]:
# Mise à l'échelle des caractéristiques
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [96]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1
)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

- Random Forests

In [98]:
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

- Gradient Boosting

In [100]:
gbrt = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=42)
gbrt.fit(X_train, y_train)
y_pred_gbrt = gbrt.predict(X_test)

- Adaptive Boosting

In [102]:
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5
)
ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)

Which model is the best and why?

In [104]:
def evaluate_model(model, X_test, y_test, y_pred):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print(f"Model: {model.__class__.__name__}")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(classification_report(y_test, y_pred))
    print("\n")

# Évaluer les modèles
evaluate_model(bag_clf, X_test, y_test, y_pred)
evaluate_model(rnd_clf, X_test, y_test, y_pred_rf)
evaluate_model(gbrt, X_test, y_test, y_pred_gbrt)
evaluate_model(ada_clf, X_test, y_test, y_pred_ada)

Model: BaggingClassifier
Accuracy: 0.8003
Precision: 0.7965
Recall: 0.8132
F1 Score: 0.8047
              precision    recall  f1-score   support

       False       0.80      0.79      0.80       653
        True       0.80      0.81      0.80       669

    accuracy                           0.80      1322
   macro avg       0.80      0.80      0.80      1322
weighted avg       0.80      0.80      0.80      1322



Model: RandomForestClassifier
Accuracy: 0.7436
Precision: 0.7895
Recall: 0.6726
F1 Score: 0.7264
              precision    recall  f1-score   support

       False       0.71      0.82      0.76       653
        True       0.79      0.67      0.73       669

    accuracy                           0.74      1322
   macro avg       0.75      0.74      0.74      1322
weighted avg       0.75      0.74      0.74      1322



Model: GradientBoostingClassifier
Accuracy: 0.8003
Precision: 0.7939
Recall: 0.8176
F1 Score: 0.8056
              precision    recall  f1-score   suppor

In [None]:
#Reminder

#Precision: If false positives are very costly or problematic 
#for example, in spam filtering where every legitimate email classified as spam is a problem

#Recall: If false negatives are very costly or problematic
#for example, in medical diagnosis where every missed case of disease is critical

#Accuracy: If you have a balanced distribution of classes and each error has the same cost

#The F1 Score is the harmonic average of precision and recall, and it is particularly useful when you have unbalanced classes 
#for example, many more positive classes than negative classes, or vice versa

In [None]:
#The GradientBoostingClassifier seems to be the best model among those tested
#with a good balance between precision (0.7939), recall (0.8176) and F1 Score (0.8056).

In [None]:
#The BaggingClassifier also performs well with metrics very close to those of the GradientBoostingClassifier

#The AdaBoostClassifier has the best recall, but its precision is slightly lower,
#making it a little less balanced than the GradientBoostingClassifier and the BaggingClassifier

#The RandomForestClassifier has the lowest performance, mainly due to its lower recall