# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [10]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, f1_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
#your code here

# Imputar valores faltantes con SimpleImputer (moda para variables categóricas y media para numéricas)
imputer_cat = SimpleImputer(strategy='most_frequent')  # Variables categóricas
spaceship[['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP']] = imputer_cat.fit_transform(
    spaceship[['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP']])

imputer_num = SimpleImputer(strategy='mean')  # Variables numéricas
spaceship[['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']] = imputer_num.fit_transform(
    spaceship[['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']])

# Convertir variables categóricas a numéricas usando One-Hot Encoding
spaceship = pd.get_dummies(spaceship, columns=['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP'], drop_first=True)

# Separar características (X) y etiquetas (y)
X = spaceship.drop(['PassengerId', 'Name', 'Transported'], axis=1)
y = spaceship['Transported'].astype(int)

# Escalar las características numéricas
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

**Perform Train Test Split**

In [4]:
#your code here

# División en conjunto de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [5]:
#your code here

# Bagging
bagging = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=100, max_samples=0.8, bootstrap=True, random_state=42)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)

# Evaluación Bagging
bagging_accuracy = accuracy_score(y_test, y_pred_bagging)
bagging_f1 = f1_score(y_test, y_pred_bagging)
print(f"Bagging Accuracy: {bagging_accuracy:.2f}, F1 Score: {bagging_f1:.2f}")

# Pasting (bootstrap=False)
pasting = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=100, max_samples=0.8, bootstrap=False, random_state=42)
pasting.fit(X_train, y_train)
y_pred_pasting = pasting.predict(X_test)

# Evaluación Pasting
pasting_accuracy = accuracy_score(y_test, y_pred_pasting)
pasting_f1 = f1_score(y_test, y_pred_pasting)
print(f"Pasting Accuracy: {pasting_accuracy:.2f}, F1 Score: {pasting_f1:.2f}")


Bagging Accuracy: 0.78, F1 Score: 0.79
Pasting Accuracy: 0.78, F1 Score: 0.79


- Random Forests

In [6]:
#your code here

# Librerías para Random Forest
from sklearn.ensemble import RandomForestClassifier

# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluación Random Forest
rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_f1 = f1_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {rf_accuracy:.2f}, F1 Score: {rf_f1:.2f}")


Random Forest Accuracy: 0.79, F1 Score: 0.79


- Gradient Boosting

In [7]:
#your code here

# Librerías para Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier

# Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)

# Evaluación Gradient Boosting
gb_accuracy = accuracy_score(y_test, y_pred_gb)
gb_f1 = f1_score(y_test, y_pred_gb)
print(f"Gradient Boosting Accuracy: {gb_accuracy:.2f}, F1 Score: {gb_f1:.2f}")



Gradient Boosting Accuracy: 0.79, F1 Score: 0.80


- Adaptive Boosting

In [11]:
#your code here

# Usar el algoritmo SAMME en lugar de SAMME.R
ada = AdaBoostClassifier(n_estimators=100, algorithm='SAMME', random_state=42)
ada.fit(X_train, y_train)
y_pred_ada = ada.predict(X_test)

# Evaluación AdaBoost
ada_accuracy = accuracy_score(y_test, y_pred_ada)
ada_f1 = f1_score(y_test, y_pred_ada)
print(f"AdaBoost (SAMME) Accuracy: {ada_accuracy:.2f}, F1 Score: {ada_f1:.2f}")


AdaBoost (SAMME) Accuracy: 0.77, F1 Score: 0.77


Which model is the best and why?

#comment here

Based on the results of the exercise, Gradient Boosting turned out to be the best performing model, achieving an accuracy of 0.79 and an F1 score of 0.80, which are the highest among all the models tested.