# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [3]:
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [4]:
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [5]:
spaceship_clean = spaceship.dropna()

In [6]:
# Extract the deck component from the Cabin column
spaceship_clean['Deck'] = spaceship_clean['Cabin'].str[0]

# Now that we have the Deck, drop the original Cabin column
spaceship_clean = spaceship_clean.drop(columns=['Cabin'])

# Check unique values to ensure we have the desired {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}
print(spaceship_clean['Deck'].unique())

['B' 'F' 'A' 'G' 'E' 'C' 'D' 'T']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_clean['Deck'] = spaceship_clean['Cabin'].str[0]


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [7]:
#your code here

# Drop PassengerId and Name
spaceship_clean = spaceship_clean.drop(columns=['PassengerId', 'Name'])

# Convert CryoSleep and VIP into numerical format
spaceship_clean['CryoSleep'] = spaceship_clean['CryoSleep'].astype(int)
spaceship_clean['VIP'] = spaceship_clean['VIP'].astype(int)

# Perform one-hot encoding on categorical columns
# Columns that need to be encoded: HomePlanet, Destination, Deck
spaceship_clean = pd.get_dummies(spaceship_clean, columns=['HomePlanet', 'Destination', 'Deck'], drop_first=True)

In [8]:
spaceship_clean

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,Deck_B,Deck_C,Deck_D,Deck_E,Deck_F,Deck_G,Deck_T
0,0,39.0,0,0.0,0.0,0.0,0.0,0.0,False,1,0,0,1,1,0,0,0,0,0,0
1,0,24.0,0,109.0,9.0,25.0,549.0,44.0,True,0,0,0,1,0,0,0,0,1,0,0
2,0,58.0,1,43.0,3576.0,0.0,6715.0,49.0,False,1,0,0,1,0,0,0,0,0,0,0
3,0,33.0,0,0.0,1283.0,371.0,3329.0,193.0,False,1,0,0,1,0,0,0,0,0,0,0
4,0,16.0,0,303.0,70.0,151.0,565.0,2.0,True,0,0,0,1,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,0,41.0,1,0.0,6819.0,0.0,1643.0,74.0,False,1,0,0,0,0,0,0,0,0,0,0
8689,1,18.0,0,0.0,0.0,0.0,0.0,0.0,False,0,0,1,0,0,0,0,0,0,1,0
8690,0,26.0,0,0.0,0.0,1872.0,1.0,0.0,True,0,0,0,1,0,0,0,0,0,1,0
8691,0,32.0,0,0.0,1049.0,0.0,353.0,3235.0,False,1,0,0,0,0,0,0,1,0,0,0


**Perform Train Test Split**

In [9]:
#your code here

x = spaceship_clean.drop(columns=['Transported'])
y = spaceship_clean['Transported']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

datos normalizados

In [15]:
# normalize the data
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_norm = scaler.fit_transform(X_train)
X_test_norm = scaler.transform(X_test)

datos escalados

In [16]:
# Scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [29]:
#your code here
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Crear y entrenar el modelo de Bagging
bagging_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, max_samples=0.8, bootstrap=True, random_state=42)
bagging_clf.fit(X_train, y_train)

# Predicción y evaluación de Bagging
y_pred_bagging = bagging_clf.predict(X_test)
bagging_accuracy = accuracy_score(y_test, y_pred_bagging)
print(f"Accuracy de Bagging: {bagging_accuracy}")
print("Reporte de clasificación para Bagging:\n")
print(classification_report(y_test, y_pred_bagging))

Accuracy de Bagging: 0.8071104387291982
Reporte de clasificación para Bagging:

              precision    recall  f1-score   support

       False       0.80      0.81      0.81       653
        True       0.82      0.80      0.81       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322



pruebo el modelo con datos normalizados

In [30]:
# Crear y entrenar el modelo de Bagging
bagging_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, max_samples=0.8, bootstrap=True, random_state=42)
bagging_clf.fit(X_train_norm, y_train)

# Predicción y evaluación de Bagging
y_pred_bagging = bagging_clf.predict(X_test_norm)
bagging_accuracy_norm = accuracy_score(y_test, y_pred_bagging)
print(f"Accuracy de Bagging: {bagging_accuracy_norm}")
print("Reporte de clasificación para Bagging datos normalizados:\n")
print(classification_report(y_test, y_pred_bagging))

Accuracy de Bagging: 0.8063540090771558
Reporte de clasificación para Bagging datos normalizados:

              precision    recall  f1-score   support

       False       0.80      0.81      0.81       653
        True       0.81      0.80      0.81       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322



pruebo el modelo con datos escalados

In [31]:
# Crear y entrenar el modelo de Bagging
bagging_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, max_samples=0.8, bootstrap=True, random_state=42)
bagging_clf.fit(X_train_scaled, y_train)

# Predicción y evaluación de Bagging
y_pred_bagging = bagging_clf.predict(X_test_scaled)
bagging_accuracy_scaled = accuracy_score(y_test, y_pred_bagging)
print(f"Accuracy de Bagging: {bagging_accuracy_scaled}")
print("Reporte de clasificación para Bagging datos escalados:\n")
print(classification_report(y_test, y_pred_bagging))

Accuracy de Bagging: 0.8078668683812406
Reporte de clasificación para Bagging datos escalados:

              precision    recall  f1-score   support

       False       0.80      0.81      0.81       653
        True       0.82      0.80      0.81       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322



- Random Forests

In [32]:
#your code here
from sklearn.ensemble import RandomForestClassifier

# Crear y entrenar el modelo de Random Forest
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predicción y evaluación de Random Forest
y_pred_rf = rf_clf.predict(X_test)
rf_accuracy = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy de Random Forest: {rf_accuracy}")
print("Reporte de clasificación para Random Forest:\n")
print(classification_report(y_test, y_pred_rf))

Accuracy de Random Forest: 0.8101361573373677
Reporte de clasificación para Random Forest:

              precision    recall  f1-score   support

       False       0.80      0.82      0.81       653
        True       0.82      0.80      0.81       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322



pruebo el modelo con datos normalizados

In [33]:
# Crear y entrenar el modelo de Random Forest
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train_norm, y_train)

# Predicción y evaluación de Random Forest
y_pred_rf = rf_clf.predict(X_test_norm)
rf_accuracy_norm = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy de Random Forest: {rf_accuracy_norm}")
print("Reporte de clasificación para Random Forest datos normalizados:\n")
print(classification_report(y_test, y_pred_rf))

Accuracy de Random Forest: 0.8116490166414524
Reporte de clasificación para Random Forest datos normalizados:

              precision    recall  f1-score   support

       False       0.80      0.82      0.81       653
        True       0.82      0.80      0.81       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322



pruebo el modelo con datos escalados

In [34]:
# Crear y entrenar el modelo de Random Forest
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train_scaled, y_train)

# Predicción y evaluación de Random Forest
y_pred_rf = rf_clf.predict(X_test_scaled)
rf_accuracy_scaled = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy de Random Forest: {rf_accuracy_scaled}")
print("Reporte de clasificación para Random Forest datos escalados:\n")
print(classification_report(y_test, y_pred_rf))

Accuracy de Random Forest: 0.8093797276853253
Reporte de clasificación para Random Forest datos escalados:

              precision    recall  f1-score   support

       False       0.80      0.82      0.81       653
        True       0.82      0.80      0.81       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322



- Gradient Boosting

In [35]:
#your code here
from sklearn.ensemble import GradientBoostingClassifier

# Crear y entrenar el modelo de Gradient Boosting
gb_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_clf.fit(X_train, y_train)

# Predicción y evaluación de Gradient Boosting
y_pred_gb = gb_clf.predict(X_test)
gb_accuracy = accuracy_score(y_test, y_pred_gb)
print(f"Accuracy de Gradient Boosting: {gb_accuracy}")
print("Reporte de clasificación para Gradient Boosting:\n")
print(classification_report(y_test, y_pred_gb))

Accuracy de Gradient Boosting: 0.8071104387291982
Reporte de clasificación para Gradient Boosting:

              precision    recall  f1-score   support

       False       0.84      0.76      0.79       653
        True       0.78      0.86      0.82       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322



pruebo el modelo con datos normalizados

In [36]:
# Crear y entrenar el modelo de Gradient Boosting
gb_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_clf.fit(X_train_norm, y_train)

# Predicción y evaluación de Gradient Boosting
y_pred_gb = gb_clf.predict(X_test_norm)
gb_accuracy_norm = accuracy_score(y_test, y_pred_gb)
print(f"Accuracy de Gradient Boosting: {gb_accuracy_norm}")
print("Reporte de clasificación para Gradient Boosting datos normalizados:\n")
print(classification_report(y_test, y_pred_gb))

Accuracy de Gradient Boosting: 0.8071104387291982
Reporte de clasificación para Gradient Boosting datos normalizados:

              precision    recall  f1-score   support

       False       0.84      0.76      0.79       653
        True       0.78      0.86      0.82       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322



pruebo el modelo con datos escalados

In [37]:
# Crear y entrenar el modelo de Gradient Boosting
gb_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_clf.fit(X_train_scaled, y_train)

# Predicción y evaluación de Gradient Boosting
y_pred_gb = gb_clf.predict(X_test_scaled)
gb_accuracy_scaled = accuracy_score(y_test, y_pred_gb)
print(f"Accuracy de Gradient Boosting: {gb_accuracy_scaled}")
print("Reporte de clasificación para Gradient Boosting datos escalados:\n")
print(classification_report(y_test, y_pred_gb))

Accuracy de Gradient Boosting: 0.8071104387291982
Reporte de clasificación para Gradient Boosting datos escalados:

              precision    recall  f1-score   support

       False       0.84      0.76      0.79       653
        True       0.78      0.86      0.82       669

    accuracy                           0.81      1322
   macro avg       0.81      0.81      0.81      1322
weighted avg       0.81      0.81      0.81      1322



- Adaptive Boosting

In [38]:
#your code here
from sklearn.ensemble import AdaBoostClassifier

# Crear y entrenar el modelo de AdaBoost
ada_clf = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_clf.fit(X_train, y_train)

# Predicción y evaluación de AdaBoost
y_pred_ada = ada_clf.predict(X_test)
ada_accuracy = accuracy_score(y_test, y_pred_ada)
print(f"Accuracy de AdaBoost: {ada_accuracy}")
print("Reporte de clasificación para AdaBoost:\n")
print(classification_report(y_test, y_pred_ada))



Accuracy de AdaBoost: 0.7957639939485628
Reporte de clasificación para AdaBoost:

              precision    recall  f1-score   support

       False       0.82      0.76      0.79       653
        True       0.78      0.83      0.80       669

    accuracy                           0.80      1322
   macro avg       0.80      0.80      0.80      1322
weighted avg       0.80      0.80      0.80      1322



pruebo el modelo con datos normalizados

In [39]:
# Crear y entrenar el modelo de AdaBoost
ada_clf = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_clf.fit(X_train_norm, y_train)

# Predicción y evaluación de AdaBoost
y_pred_ada = ada_clf.predict(X_test_norm)
ada_accuracy_norm = accuracy_score(y_test, y_pred_ada)
print(f"Accuracy de AdaBoost: {ada_accuracy_norm}")
print("Reporte de clasificación para AdaBoost datos normalizados:\n")
print(classification_report(y_test, y_pred_ada))



Accuracy de AdaBoost: 0.7957639939485628
Reporte de clasificación para AdaBoost datos normalizados:

              precision    recall  f1-score   support

       False       0.82      0.76      0.79       653
        True       0.78      0.83      0.80       669

    accuracy                           0.80      1322
   macro avg       0.80      0.80      0.80      1322
weighted avg       0.80      0.80      0.80      1322



pruebo el modelo con datos escalados

In [40]:
# Crear y entrenar el modelo de AdaBoost
ada_clf = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_clf.fit(X_train_scaled, y_train)

# Predicción y evaluación de AdaBoost
y_pred_ada = ada_clf.predict(X_test_scaled)
ada_accuracy_scaled = accuracy_score(y_test, y_pred_ada)
print(f"Accuracy de AdaBoost: {ada_accuracy_scaled}")
print("Reporte de clasificación para AdaBoost datos escalados:\n")
print(classification_report(y_test, y_pred_ada))



Accuracy de AdaBoost: 0.7957639939485628
Reporte de clasificación para AdaBoost datos escalados:

              precision    recall  f1-score   support

       False       0.82      0.76      0.79       653
        True       0.78      0.83      0.80       669

    accuracy                           0.80      1322
   macro avg       0.80      0.80      0.80      1322
weighted avg       0.80      0.80      0.80      1322



Which model is the best and why?

In [41]:
#comment here
# Random Forest datos normalizados es el modelo con el accuracy más alto
