# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [18]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,AdaBoostClassifier, GradientBoostingClassifier

from sklearn.metrics import f1_score, accuracy_score, confusion_matrix

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
#your code here
spaceship.dropna(inplace=True)

In [4]:
#your code here
spaceship['Cabin'] = spaceship['Cabin'].apply(lambda x: str(x)[0] if pd.notna(x) else x)
spaceship['Cabin'].value_counts()

Cabin
F    2152
G    1973
E     683
B     628
C     587
D     374
A     207
T       2
Name: count, dtype: int64

In [5]:
#your code here
spaceship = spaceship.drop(columns = ["PassengerId", "Name"])
spaceship

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,A,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False
8689,Earth,True,G,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,False
8690,Earth,False,G,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True
8691,Europa,False,E,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False


In [6]:
#your code here
spaceship = pd.get_dummies(spaceship, columns=['HomePlanet', 'Cabin', 'Destination'])
spaceship

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Earth,...,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,False,39.0,False,0.0,0.0,0.0,0.0,0.0,False,False,...,True,False,False,False,False,False,False,False,False,True
1,False,24.0,False,109.0,9.0,25.0,549.0,44.0,True,True,...,False,False,False,False,True,False,False,False,False,True
2,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,False,...,False,False,False,False,False,False,False,False,False,True
3,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,False,...,False,False,False,False,False,False,False,False,False,True
4,False,16.0,False,303.0,70.0,151.0,565.0,2.0,True,True,...,False,False,False,False,True,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,False,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False,False,...,False,False,False,False,False,False,False,True,False,False
8689,True,18.0,False,0.0,0.0,0.0,0.0,0.0,False,True,...,False,False,False,False,False,True,False,False,True,False
8690,False,26.0,False,0.0,0.0,1872.0,1.0,0.0,True,True,...,False,False,False,False,False,True,False,False,False,True
8691,False,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False,False,...,False,False,False,True,False,False,False,True,False,False


**Perform Train Test Split**

In [7]:
#your code here
features = spaceship.drop(columns = ['Transported'])
target = spaceship["Transported"]

In [8]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=22)

In [9]:
X_train.head()


Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,...,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
7537,True,26.0,False,0.0,0.0,0.0,0.0,0.0,False,True,...,True,False,False,False,False,False,False,True,False,False
6310,False,30.0,False,77.0,71.0,1147.0,0.0,0.0,False,False,...,False,False,False,True,False,False,False,False,True,False
1277,False,39.0,False,1535.0,0.0,340.0,0.0,723.0,False,False,...,False,False,False,False,True,False,False,False,False,True
4047,False,25.0,False,412.0,0.0,567.0,775.0,0.0,False,False,...,False,False,False,False,True,False,False,False,False,True
1609,False,23.0,False,2210.0,0.0,89.0,0.0,0.0,False,False,...,False,False,True,False,False,False,False,False,False,True


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [15]:
#your code here
bagging_cls = BaggingClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

In [16]:
bagging_cls.fit(X_train, y_train)

In [19]:
# Make predictions
pred = bagging_cls.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, pred))
print("F1 Score:", f1_score(y_test, pred, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))

Accuracy: 0.8101361573373677
F1 Score: 0.8101495198061605
Confusion Matrix:
 [[534 105]
 [146 537]]


- Random Forests

In [20]:
#your code here
forest = RandomForestClassifier(n_estimators=100,
                             max_depth=20)

In [21]:
forest.fit(X_train, y_train)

In [22]:
forest_pred = forest.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, pred))
print("F1 Score:", f1_score(y_test, forest_pred, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_test, forest_pred))

Accuracy: 0.8101361573373677
F1 Score: 0.80334629587592
Confusion Matrix:
 [[529 110]
 [150 533]]


- Gradient Boosting

In [23]:
#your code here
gb_reg = GradientBoostingClassifier(max_depth=20,
                                   n_estimators=100)

In [24]:
gb_reg.fit(X_train, y_train)

In [25]:
gb_pred = gb_reg.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, gb_pred))
print("F1 Score:", f1_score(y_test, gb_pred, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_test, gb_pred))

Accuracy: 0.7798789712556732
F1 Score: 0.7799337859399373
Confusion Matrix:
 [[501 138]
 [153 530]]


- Adaptive Boosting

In [27]:
#your code here
ada_reg = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100)

In [29]:
ada_reg.fit(X_train, y_train)



In [30]:
ada_pred = ada_reg.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, ada_pred))
print("F1 Score:", f1_score(y_test, ada_pred, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_test, ada_pred))

Accuracy: 0.7571860816944024
F1 Score: 0.7570829393512
Confusion Matrix:
 [[472 167]
 [154 529]]


Which model is the best and why?

It looks like that for this particular dataset, the ensemble technique that produces the best model scores is the BaggingClassifier which only randomizes the dataset that is used as training data. It seems that the model does not produce as accurate a result by randomizing the features. The dataset lends itself better to estimators that learn in parallel instead of sequentially