# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [67]:
#Libraries
from sklearn.datasets import  fetch_california_housing
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [45]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [46]:
spaceship = spaceship.dropna()

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [47]:
#your code here
features = spaceship.drop(columns = ["Transported", "PassengerId", "Name"])
target = spaceship["Transported"]

**Perform Train Test Split**

In [48]:
#your code here


In [49]:
spaceship["Cabin"].unique()

array(['B/0/P', 'F/0/S', 'A/0/S', ..., 'G/1499/S', 'G/1500/S', 'E/608/S'],
      dtype=object)

In [50]:
spaceship['Cabin'].apply(lambda x: isinstance(x, float)).any()

False

In [51]:
spaceship = spaceship[spaceship['Cabin'].apply(lambda x: not isinstance(x, float))]

In [52]:
spaceship['Cabin'] = spaceship['Cabin'].apply(lambda x: x.split('/')[0])
spaceship['Cabin'].unique()

array(['B', 'F', 'A', 'G', 'E', 'C', 'D', 'T'], dtype=object)

In [53]:
spaceship.drop(columns=['PassengerId', 'Name'], inplace=True)

In [54]:
spaceship = pd.get_dummies(spaceship, columns=['Cabin', 'HomePlanet', 'Destination', 'VIP', 'CryoSleep'])
spaceship.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Cabin_A,Cabin_B,Cabin_C,...,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True,CryoSleep_False,CryoSleep_True
0,39.0,0.0,0.0,0.0,0.0,0.0,False,False,True,False,...,False,True,False,False,False,True,True,False,True,False
1,24.0,109.0,9.0,25.0,549.0,44.0,True,False,False,False,...,True,False,False,False,False,True,True,False,True,False
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,True,False,False,...,False,True,False,False,False,True,False,True,True,False
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,True,False,False,...,False,True,False,False,False,True,True,False,True,False
4,16.0,303.0,70.0,151.0,565.0,2.0,True,False,False,False,...,True,False,False,False,False,True,True,False,True,False


In [55]:
features = spaceship.drop(columns = ["Transported"])
target = spaceship["Transported"]

In [56]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

In [57]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [58]:
normalizer = MinMaxScaler()

normalizer.fit(X_train)

In [59]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

In [60]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_train_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Cabin_A,Cabin_B,Cabin_C,Cabin_D,...,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True,CryoSleep_False,CryoSleep_True
0,0.405063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
1,0.050633,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
2,0.379747,0.0,0.007916,0.0,0.051276,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
3,0.21519,0.00131,0.0,0.046111,0.016378,4.9e-05,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
4,0.329114,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0


In [66]:
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)
X_test_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Cabin_A,Cabin_B,Cabin_C,Cabin_D,...,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True,CryoSleep_False,CryoSleep_True
0,0.632911,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
1,0.227848,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,0.189873,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
3,0.658228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,0.78481,0.0,0.054775,0.0,0.07774,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [86]:
#your code here
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Crear el modelo BaggingClassifier
bagging_clf = BaggingClassifier(DecisionTreeClassifier(max_depth=4),
                                n_estimators=100,
                                max_samples=1000)


In [87]:
bagging_clf.fit(X_train_norm, y_train)

In [88]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Predecir con el modelo de BaggingClassifier
pred = bagging_clf.predict(X_test_norm)

# Calcular y mostrar las métricas de clasificación
print("Accuracy:", accuracy_score(y_test, pred))
print("Precision:", precision_score(y_test, pred, average='weighted'))
print("Recall:", recall_score(y_test, pred, average='weighted'))
print("F1 Score:", f1_score(y_test, pred, average='weighted'))

# Matriz de confusión (opcional)
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))


Accuracy: 0.7776096822995462
Precision: 0.7780398629210651
Recall: 0.7776096822995462
F1 Score: 0.7775236289702164
Confusion Matrix:
 [[527 134]
 [160 501]]


- Random Forests

In [99]:
#your code here
from sklearn.ensemble import RandomForestClassifier

# Crear el modelo RandomForestClassifier
forest_clf = RandomForestClassifier(n_estimators=100,
                                    max_depth=7)


In [100]:
# Ajustar el modelo a los datos de entrenamiento 
forest_clf.fit(X_train_norm, y_train)

In [101]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Predecir con el modelo de RandomForestClassifier
pred = forest_clf.predict(X_test_norm)

# Calcular y mostrar las métricas de clasificación
print("Accuracy:", accuracy_score(y_test, pred))
print("Precision:", precision_score(y_test, pred, average='weighted'))
print("Recall:", recall_score(y_test, pred, average='weighted'))
print("F1 Score:", f1_score(y_test, pred, average='weighted'))

# Matriz de confusión (opcional)
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))


Accuracy: 0.7844175491679274
Precision: 0.7849660612731609
Recall: 0.7844175491679274
F1 Score: 0.7843137591643895
Confusion Matrix:
 [[533 128]
 [157 504]]


- Gradient Boosting

In [133]:
#your code here
from sklearn.ensemble import GradientBoostingClassifier

# Crear el modelo GradientBoostingClassifier
gb_clf = GradientBoostingClassifier(max_depth=15,
                                    n_estimators=100)


In [134]:
gb_clf.fit(X_train_norm, y_train)

In [135]:
pred = gb_clf.predict(X_test_norm)

# Calcular y mostrar las métricas de clasificación
print("Accuracy:", accuracy_score(y_test, pred))
print("Precision:", precision_score(y_test, pred, average='weighted'))
print("Recall:", recall_score(y_test, pred, average='weighted'))
print("F1 Score:", f1_score(y_test, pred, average='weighted'))

# Matriz de confusión (opcional)
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))

Accuracy: 0.7866868381240545
Precision: 0.7866894627583729
Recall: 0.7866868381240545
F1 Score: 0.7866863499038724
Confusion Matrix:
 [[521 140]
 [142 519]]


- Adaptive Boosting

In [124]:
#your code here
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Crear el modelo AdaBoostClassifier
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=8),
                             n_estimators=100)


In [125]:
ada_clf.fit(X_train_norm, y_train)



In [126]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Predecir con el modelo de AdaBoostClassifier
pred = ada_clf.predict(X_test_norm)

# Calcular y mostrar las métricas de clasificación
print("Accuracy:", accuracy_score(y_test, pred))
print("Precision:", precision_score(y_test, pred, average='weighted'))
print("Recall:", recall_score(y_test, pred, average='weighted'))
print("F1 Score:", f1_score(y_test, pred, average='weighted'))

# Matriz de confusión (opcional)
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))


Accuracy: 0.7783661119515886
Precision: 0.7784177273143484
Recall: 0.7783661119515886
F1 Score: 0.7783558393983074
Confusion Matrix:
 [[519 142]
 [151 510]]


Which model is the best and why?

In [65]:
#comment here
# The best model is the GradientBoostingClassifier with an accuracy of 0.98, precision of 0.98, recall of 0.98 and F1 Score of 0.98, althought the RandomForestClassifier has similar results. The BaggingClassifier and AdaBoostClassifier have lower results, but the slowest is the GradientBoostingClassifier.
