# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split 

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [21]:
# Identificar valores nulos
print("Valores nulos por columna:")
print(spaceship.isnull().sum())

# 1. Rellenar valores nulos (ejemplo: media para numéricos, moda para categóricos)
for column in spaceship.columns:
    if spaceship[column].dtype in ['int64', 'float64']:
        # Rellenar numéricos con la media
        spaceship[column].fillna(spaceship[column].mean(), inplace=True)
    else:
        # Rellenar categóricos con la moda
        spaceship[column].fillna(spaceship[column].mode()[0], inplace=True)

# Verificar nuevamente los valores nulos
print("\nValores nulos después del manejo:")
print(spaceship.isnull().sum())

# Vista rápida del dataset limpio
print("\nDatos después de limpieza:")
print(spaceship.head())

Valores nulos por columna:
Age                       179
RoomService               181
FoodCourt                 183
ShoppingMall              208
Spa                       183
                         ... 
Name_Zosmark Unaasor        0
Name_Zosmas Ineedeve        0
Name_Zosmas Mormonized      0
Name_Zubeneb Flesping       0
Name_Zubeneb Pasharne       0
Length: 23736, dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  spaceship[column].fillna(spaceship[column].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  spaceship[column].fillna(spaceship[column].mode()[0], inplace=True)



Valores nulos después del manejo:
Age                       0
RoomService               0
FoodCourt                 0
ShoppingMall              0
Spa                       0
                         ..
Name_Zosmark Unaasor      0
Name_Zosmas Ineedeve      0
Name_Zosmas Mormonized    0
Name_Zubeneb Flesping     0
Name_Zubeneb Pasharne     0
Length: 23736, dtype: int64

Datos después de limpieza:
    Age  RoomService  FoodCourt  ShoppingMall     Spa  VRDeck  Transported  \
0  39.0          0.0        0.0           0.0     0.0     0.0        False   
1  24.0        109.0        9.0          25.0   549.0    44.0         True   
2  58.0         43.0     3576.0           0.0  6715.0    49.0        False   
3  33.0          0.0     1283.0         371.0  3329.0   193.0        False   
4  16.0        303.0       70.0         151.0   565.0     2.0         True   

   PassengerId_0002_01  PassengerId_0003_01  PassengerId_0003_02  ...  \
0                False                False                

In [3]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler, LabelEncoder


In [22]:
# Codificar variables categóricas
spaceship = pd.get_dummies(spaceship, drop_first=True)

In [23]:
x = spaceship.drop("Transported", axis=1)
y = spaceship["Transported"]

Now perform the same as before:
- Feature Scaling
- Feature Selection

In [24]:
#Feature Scaling
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

In [26]:
# Feature Selection
selector = SelectKBest(score_func=f_classif, k=10)
x_selected = selector.fit_transform(x_scaled, y)
selected_features = x.columns[selector.get_support()]
print("Selected Features:", selected_features)

Selected Features: Index(['Age', 'RoomService', 'FoodCourt', 'Spa', 'VRDeck', 'HomePlanet_Europa',
       'CryoSleep_True', 'Cabin_G/981/S', 'Destination_TRAPPIST-1e',
       'VIP_True'],
      dtype='object')


**Perform Train Test Split**

In [27]:
#your code here

X_train, X_test, y_train, y_test = train_test_split(x_selected, y, test_size=0.2, random_state=42)

In [37]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [38]:
# Bagging
bagging_model = BaggingClassifier(random_state=42)
bagging_model.fit(X_train, y_train)
bagging_pred = bagging_model.predict(X_test)

In [42]:
pasting_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,
    bootstrap=False,  # Sin reemplazo
    random_state=42
)
pasting_model.fit(X_train, y_train)
pasting_pred = pasting_model.predict(X_test)

- Random Forests

In [43]:
# Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)


- Gradient Boosting

In [44]:
# Gradient Boosting
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)
gb_pred = gb_model.predict(X_test)

- Adaptive Boosting

In [45]:
# AdaBoost
ab_model = AdaBoostClassifier(random_state=42)
ab_model.fit(X_train, y_train)
ab_pred = ab_model.predict(X_test)




In [47]:
# Evaluación
models = {
    "Bagging": bagging_pred,
    "Pasting": pasting_pred,
    "Random Forest": rf_pred,
    "Gradient Boosting": gb_pred,
    "AdaBoost": ab_pred
}

for name, pred in models.items():
    print(f"{name} Accuracy: {accuracy_score(y_test, pred):.4f}")

Bagging Accuracy: 0.7596
Pasting Accuracy: 0.7614
Random Forest Accuracy: 0.7625
Gradient Boosting Accuracy: 0.7769
AdaBoost Accuracy: 0.7786


Which model is the best and why?

El mejor modelo es AdaBoost, ya que tiene la mayor precisión (0.7786). Esto significa que logra clasificar correctamente una mayor proporción de los datos en comparación con los otros modelos.