# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [18]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [19]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [23]:
# Revisa si hay columnas categóricas que aún no se hayan codificado correctamente
categorical_columns = spaceship.select_dtypes(include=['object']).columns
print(f"Columnas categóricas: {categorical_columns}")

# Si quedan columnas categóricas, se convierten a dummies
if len(categorical_columns) > 0:
    spaceship = pd.get_dummies(spaceship, columns=categorical_columns, drop_first=True)

# Verificar de nuevo si las columnas categóricas han sido codificadas
print(spaceship.dtypes)

# Eliminar columnas no útiles como 'PassengerId' y 'Name', si aún existen
if 'PassengerId' in spaceship.columns:
    spaceship.drop(['PassengerId'], axis=1, inplace=True)

if 'Name' in spaceship.columns:
    spaceship.drop(['Name'], axis=1, inplace=True)

# Separar las variables predictoras (X) y la variable objetivo (y)
X = spaceship.drop('Transported', axis=1)
y = spaceship['Transported']

# Escalar las características numéricas
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Selección de características
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Crear el modelo RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_scaled, y)

# Importancia de las características
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]

# Imprimir las características en orden de importancia
for i in range(X.shape[1]):
    print(f"{X.columns[indices[i]]}: {importances[indices[i]]}")




Columnas categóricas: Index(['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP'], dtype='object')
Age                          float64
RoomService                  float64
FoodCourt                    float64
ShoppingMall                 float64
Spa                          float64
                              ...   
Cabin_T/2/S                     bool
Cabin_T/3/P                     bool
Destination_PSO J318.5-22       bool
Destination_TRAPPIST-1e         bool
VIP_True                        bool
Length: 6572, dtype: object
RoomService: 0.07714139847978409
Spa: 0.07495211990298205
VRDeck: 0.0690469561047325
CryoSleep_True: 0.06466709156775627
FoodCourt: 0.06374396955453386
Age: 0.059292363439910935
ShoppingMall: 0.05649306930707603
HomePlanet_Europa: 0.02593335618151677
HomePlanet_Mars: 0.011509160336452313
Destination_TRAPPIST-1e: 0.008496361994604493
Destination_PSO J318.5-22: 0.004239668272240165
VIP_True: 0.0021889150738613346
Cabin_G/974/P: 0.0012105313610774754
Cabin_E/0

**Perform Train Test Split**

In [24]:
#your code here
from sklearn.model_selection import train_test_split

# Separar las variables predictoras (X) y la variable objetivo (y)
X = spaceship.drop('Transported', axis=1)
y = spaceship['Transported']

# Realizar el train-test split (80% entrenamiento, 20% prueba)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Imprimir el tamaño de los conjuntos de entrenamiento y prueba
print(f"Tamaño del conjunto de entrenamiento: {X_train.shape}")
print(f"Tamaño del conjunto de prueba: {X_test.shape}")


Tamaño del conjunto de entrenamiento: (6954, 6571)
Tamaño del conjunto de prueba: (1739, 6571)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [29]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Crear un clasificador base (árbol de decisión)
tree_clf = DecisionTreeClassifier()

# Implementar Bagging (muestreo con reemplazo)
bagging_clf = BaggingClassifier(
    estimator=tree_clf, n_estimators=100, bootstrap=True, n_jobs=-1, random_state=42
)

# Entrenar el modelo
bagging_clf.fit(X_train, y_train)

# Predicciones en el conjunto de prueba
y_pred_bagging = bagging_clf.predict(X_test)

# Evaluar precisión del modelo
print("Accuracy with Bagging:", accuracy_score(y_test, y_pred_bagging))


Accuracy with Bagging: 0.7745830937320299


In [30]:
# Implementar Pasting (muestreo sin reemplazo)
pasting_clf = BaggingClassifier(
    estimator=tree_clf, n_estimators=100, bootstrap=False, n_jobs=-1, random_state=42
)

# Entrenar el modelo
pasting_clf.fit(X_train, y_train)

# Predicciones en el conjunto de prueba
y_pred_pasting = pasting_clf.predict(X_test)

# Evaluar precisión del modelo
print("Accuracy with Pasting:", accuracy_score(y_test, y_pred_pasting))


Accuracy with Pasting: 0.7567567567567568


- Random Forests

In [31]:
#your code here
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Crear el modelo Random Forest
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Entrenar el modelo
rf_clf.fit(X_train, y_train)

# Predicciones en el conjunto de prueba
y_pred_rf = rf_clf.predict(X_test)

# Evaluar precisión del modelo
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Accuracy with Random Forest:", accuracy_rf)

# Matriz de confusión
conf_matrix = confusion_matrix(y_test, y_pred_rf)
print("\nConfusion Matrix:\n", conf_matrix)

# Reporte de clasificación
class_report = classification_report(y_test, y_pred_rf)
print("\nClassification Report:\n", class_report)


Accuracy with Random Forest: 0.7711328349626222

Confusion Matrix:
 [[680 181]
 [217 661]]

Classification Report:
               precision    recall  f1-score   support

       False       0.76      0.79      0.77       861
        True       0.79      0.75      0.77       878

    accuracy                           0.77      1739
   macro avg       0.77      0.77      0.77      1739
weighted avg       0.77      0.77      0.77      1739



- Gradient Boosting

In [32]:
#your code here
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Crear el modelo Gradient Boosting
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Entrenar el modelo
gb_clf.fit(X_train, y_train)

# Predicciones en el conjunto de prueba
y_pred_gb = gb_clf.predict(X_test)

# Evaluar precisión del modelo
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print("Accuracy with Gradient Boosting:", accuracy_gb)

# Matriz de confusión
conf_matrix_gb = confusion_matrix(y_test, y_pred_gb)
print("\nConfusion Matrix:\n", conf_matrix_gb)

# Reporte de clasificación
class_report_gb = classification_report(y_test, y_pred_gb)
print("\nClassification Report:\n", class_report_gb)


Accuracy with Gradient Boosting: 0.7855089131684876

Confusion Matrix:
 [[628 233]
 [140 738]]

Classification Report:
               precision    recall  f1-score   support

       False       0.82      0.73      0.77       861
        True       0.76      0.84      0.80       878

    accuracy                           0.79      1739
   macro avg       0.79      0.78      0.78      1739
weighted avg       0.79      0.79      0.78      1739



- Adaptive Boosting

In [35]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Crear el modelo AdaBoost con un clasificador base por defecto (Decision Tree) y el algoritmo SAMME
ada_clf = AdaBoostClassifier(
    n_estimators=100,  # Número de estimadores (árboles) a usar
    learning_rate=1,    # Tasa de aprendizaje
    algorithm='SAMME',  # Usar el algoritmo SAMME en lugar de SAMME.R
    random_state=42
)

# Entrenar el modelo
ada_clf.fit(X_train, y_train)

# Predicciones en el conjunto de prueba
y_pred_ada = ada_clf.predict(X_test)

# Evaluar precisión del modelo
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print("Accuracy with AdaBoost:", accuracy_ada)

# Matriz de confusión
conf_matrix_ada = confusion_matrix(y_test, y_pred_ada)
print("\nConfusion Matrix:\n", conf_matrix_ada)

# Reporte de clasificación
class_report_ada = classification_report(y_test, y_pred_ada)
print("\nClassification Report:\n", class_report_ada)




Accuracy with AdaBoost: 0.7705577918343876

Confusion Matrix:
 [[654 207]
 [192 686]]

Classification Report:
               precision    recall  f1-score   support

       False       0.77      0.76      0.77       861
        True       0.77      0.78      0.77       878

    accuracy                           0.77      1739
   macro avg       0.77      0.77      0.77      1739
weighted avg       0.77      0.77      0.77      1739



Which model is the best and why?

In [None]:
#comment here
# Accuracy with Gradient Boosting: 0.7855089131684876