# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [2]:
#Libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, accuracy_score, classification_report
from sklearn.neighbors import KNeighborsClassifier


In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [4]:
# Limpieza de nulos, conversión a numerico de variables objeto y seleccion de target y features

spaceship = spaceship.dropna()
spaceship["Cabin"] = spaceship["Cabin"].str.split("/").str[0]
spaceship = spaceship.drop(columns=["PassengerId", "Name"])
columnas_categoricas = spaceship.select_dtypes(include="object").columns
spaceship = pd.get_dummies(spaceship, columns=columnas_categoricas, drop_first=True)
target = spaceship['Transported']
features = spaceship.drop(columns=["Transported"])


In [5]:
# Evaluamos el modelo KNN antes del escalado

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=42)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print(f"Accuracy en test: {acc:.4f}")
print(classification_report(y_test, y_pred))

Accuracy en test: 0.7935
              precision    recall  f1-score   support

       False       0.79      0.79      0.79       653
        True       0.80      0.79      0.80       669

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322



In [6]:
# Feature Scaling

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)       # que son las features escaladas
X_test_scaled = scaler.transform(X_test)             # X_test_scaled la transformamos con el mismo escalado que X_train

# Feature Selection

selector = SelectKBest(score_func=f_classif, k=10)                   # esta funcion sirve para hacer un ranking TOP10 de las features que mejor ayudan a distinguir los True/False de la variable target
X_train_selected = selector.fit_transform(X_train_scaled, y_train)   # para cada columna de X_train_scaled, mide si sus valores son diferentes cuando y_train = 0 y cuando y_train = 1
X_test_selected = selector.transform(X_test_scaled)                  # X_test_selected es simplemente el X_test_scaled, pero reducido a las mismas 10 columnas seleccionadas en el X_train_selected


**Perform Train Test Split**

In [7]:
# Evaluamos el modelo KNN despues del escalado

knn = KNeighborsClassifier()
knn.fit(X_train_selected, y_train)

y_pred = knn.predict(X_test_selected)

acc = accuracy_score(y_test, y_pred)
print(f"Accuracy en test: {acc:.4f}")
print(classification_report(y_test, y_pred))

Accuracy en test: 0.7504
              precision    recall  f1-score   support

       False       0.72      0.81      0.76       653
        True       0.79      0.70      0.74       669

    accuracy                           0.75      1322
   macro avg       0.75      0.75      0.75      1322
weighted avg       0.75      0.75      0.75      1322



##### Chuleta de la tabla de clasificación:

1. **Precision** → ¿Cuántos de los predichos como *true* lo son realmente?
2. **Recall** → ¿Cuántos de los *true* son detectados?
3. **F1** → Equilibrio entre precision y recall.
4. **Accuracy** → Aciertos totales del modelo.
5. **Macro avg** → Media simple entre clases.
6. **Weighted avg** → Media ponderada según el número de ejemplos por clase.


###### Con el reescalado 

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [8]:
from sklearn.ensemble import BaggingClassifier

# Bagging 
bagging = BaggingClassifier(
    estimator=KNeighborsClassifier(),
    n_estimators=50,
    max_samples=0.8,     # % de muestras para cada modelo
    bootstrap=True,      # Bagging
    random_state=42
)

bagging.fit(X_train_selected, y_train)
y_pred_bag = bagging.predict(X_test_selected)

print("Resultados Bagging")
print("Accuracy:", accuracy_score(y_test, y_pred_bag))
print(classification_report(y_test, y_pred_bag))


#  Pasting 
pasting = BaggingClassifier(
    estimator=KNeighborsClassifier(),
    n_estimators=50,
    max_samples=0.8,
    bootstrap=False,     # Pasting
    random_state=42
)

pasting.fit(X_train_selected, y_train)
y_pred_past = pasting.predict(X_test_selected)

print("Resultados Pasting")
print("Accuracy:", accuracy_score(y_test, y_pred_past))
print(classification_report(y_test, y_pred_past))


Resultados Bagging
Accuracy: 0.7844175491679274
              precision    recall  f1-score   support

       False       0.79      0.76      0.78       653
        True       0.78      0.81      0.79       669

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322

Resultados Pasting
Accuracy: 0.7881996974281392
              precision    recall  f1-score   support

       False       0.81      0.74      0.78       653
        True       0.77      0.83      0.80       669

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322



- Random Forests

In [9]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train_selected, y_train)
y_pred_rf = rf.predict(X_test_selected)

print("Resultados Random Forest")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))


Resultados Random Forest
Accuracy: 0.7821482602118003
              precision    recall  f1-score   support

       False       0.80      0.74      0.77       653
        True       0.76      0.82      0.79       669

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



- Gradient Boosting

In [10]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=3,
    random_state=42
)

gb.fit(X_train_selected, y_train)
y_pred_gb = gb.predict(X_test_selected)

print("Resultados Gradient Boosting")
print("Accuracy:", accuracy_score(y_test, y_pred_gb))
print(classification_report(y_test, y_pred_gb))


Resultados Gradient Boosting
Accuracy: 0.7851739788199698
              precision    recall  f1-score   support

       False       0.83      0.71      0.77       653
        True       0.75      0.86      0.80       669

    accuracy                           0.79      1322
   macro avg       0.79      0.78      0.78      1322
weighted avg       0.79      0.79      0.78      1322



- Adaptive Boosting

In [11]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(
    n_estimators=200,
    learning_rate=0.05,
    random_state=42
)

ada.fit(X_train_selected, y_train)
y_pred_ada = ada.predict(X_test_selected)

print("Resultados Adaboost")
print("Accuracy:", accuracy_score(y_test, y_pred_ada))
print(classification_report(y_test, y_pred_ada))


Resultados Adaboost
Accuracy: 0.7465960665658093
              precision    recall  f1-score   support

       False       0.70      0.86      0.77       653
        True       0.83      0.63      0.72       669

    accuracy                           0.75      1322
   macro avg       0.76      0.75      0.74      1322
weighted avg       0.76      0.75      0.74      1322



Which model is the best and why?

El mejor modelo es el que haya obtenido mayor accuracy en el test, pero normalmente Gradient Boosting o Random Forest son los que mejor funcionan. La razón es que ambos capturan muy bien relaciones no lineales, manejan gran cantidad de variables categóricas transformadas y reducen el overfitting gracias a su naturaleza de ensamble. En este dataset suelen superar claramente a KNN porque modelan interacciones complejas entre las características y aprovechan mejor la información disponible.