# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [144]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
pd.set_option('display.max_columns', None)

In [145]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [147]:
# Dropping NaN values
spaceship=spaceship.dropna()
spaceship.isnull().sum()

# Transforming 'Cabin' to  {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}
spaceship['Cabin'] = spaceship['Cabin'].str[0]

spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [148]:
spaceship['HomePlanet'].unique()

array(['Europa', 'Earth', 'Mars'], dtype=object)

**Perform Train Test Split**

In [150]:
# Create dummy variables for non-numerical columns

spaceship_dummies = pd.get_dummies(spaceship, columns=['HomePlanet', 'Cabin', 'Destination'])


# Dropping non-numeric columns from features
features = spaceship_dummies.drop(columns=['PassengerId', 'Name', 'Transported'])
target = spaceship_dummies['Transported']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=0)

# Display the sizes of the resulting datasets
print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])

spaceship_dummies.head()

Training set size: 4624
Test set size: 1982


Unnamed: 0,PassengerId,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0001_01,False,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,False,True,False,False,True,False,False,False,False,False,False,False,False,True
1,0002_01,False,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,True,False,False,False,False,False,False,False,True,False,False,False,False,True
2,0003_01,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,False,True,False,True,False,False,False,False,False,False,False,False,False,True
3,0003_02,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,False,True,False,True,False,False,False,False,False,False,False,False,False,True
4,0004_01,False,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,True,False,False,False,False,False,False,False,True,False,False,False,False,True


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

### Bagging and Pasting
[Bagging and Pasting video](https://www.youtube.com/watch?v=tjZcb4nrX2g)
1. Bagging (Bootstrap Aggregating)
Bagging involves sampling the dataset with replacement, so some data points might be selected multiple times in a single subset, while others might not appear.
Each subset is used to train a base model (in this case, a Decision Tree).
After training, predictions from all models are combined (often through majority voting for classification) to make the final prediction.
Bagging can help reduce variance and improve the stability of models, especially those prone to overfitting, like decision trees.

2. Pasting
Pasting is similar to Bagging, but it samples without replacement. Each subset of the data is unique.
This technique can be helpful if you have enough data that’s not too noisy, as it might give a bit more diverse representation across subsets.
Like Bagging, Pasting reduces variance by averaging out predictions from several models but tends to use the dataset more fully.

- Why Use Them?
Both methods build multiple models and combine their predictions, which helps increase accuracy and stability compared to using a single model. This is especially useful when working with a dataset that might cause individual models to overfit, as combining the outputs from multiple models reduces this risk.

In [153]:
# Bagging model

bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),  # Update here
    n_estimators=1000,
    max_samples=1.0,  # Full dataset for Bagging
    bootstrap=True,   # Enable Bagging
    random_state=0
)
bagging_model.fit(X_train, y_train)
y_pred_bagging = bagging_model.predict(X_test)
bagging_accuracy = accuracy_score(y_test, y_pred_bagging)

print("Bagging Model Accuracy:", bagging_accuracy)


Bagging Model Accuracy: 0.784561049445005


In [154]:
# Pasting model
pasting_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),  # Update here
    n_estimators=1000,
    max_samples=1.0,  # Full dataset for Pasting
    bootstrap=False,  # Disable bootstrapping for Pasting
    random_state=0
)
pasting_model.fit(X_train, y_train)
y_pred_pasting = pasting_model.predict(X_test)
pasting_accuracy = accuracy_score(y_test, y_pred_pasting)

print("Pasting Model Accuracy:", pasting_accuracy)

Pasting Model Accuracy: 0.7532795156407669


### Random Forests

In [169]:
# Random Forests

from sklearn.ensemble import RandomForestClassifier

# Create the Random Forest model
random_forest_model = RandomForestClassifier(
    n_estimators=100,        # Number of trees in the forest
    max_depth=None,          # No limit on depth of each tree
    random_state=0          # Random state for reproducibility
)

# Train the model
random_forest_model.fit(X_train, y_train)

# Predict on test data
y_pred = random_forest_model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Random Forest Model Accuracy:", accuracy)


Random Forest Model Accuracy: 0.7870837537840565


### Gradient Boosting

In [183]:
# Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier


# Create the Gradient Boosting model
gradient_boosting_model = GradientBoostingClassifier(
    n_estimators=100,      # Number of boosting stages
    learning_rate=0.1,     # Step size for updating weights
    max_depth=3,           # Depth of each tree
    random_state=0        # For reproducibility
)

# Train the model
gradient_boosting_model.fit(X_train, y_train)

# Predict on the test set
y_pred = gradient_boosting_model.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Gradient Boosting Model Accuracy:", accuracy)


Gradient Boosting Model Accuracy: 0.7901109989909183


### Adaptive Boosting

In [191]:
# Adaptive Boosting

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Create the AdaBoost model with a Decision Tree as the base estimator
adaboost_model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # Weak learner with max depth 1
    n_estimators=100,       # Number of boosting rounds
    learning_rate=0.5,      # Step size
    random_state=0         # For reproducibility
)

# Train the model
adaboost_model.fit(X_train, y_train)

# Predict on the test set
y_pred = adaboost_model.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print("AdaBoost Model Accuracy:", accuracy)




AdaBoost Model Accuracy: 0.786074672048436


### Which model is the best and why?

In [162]:
# The best model, based on accuracy, is the Gradient Boosting model, with an accuracy of approximately 0.7901.