# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [39]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [40]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [41]:
#your code here

# Feature Scaling
spaceship_clean = spaceship.dropna()

spaceship_clean.loc[:, 'Cabin'] = spaceship_clean['Cabin'].apply(lambda x: x[0])

spaceship_clean['Cabin'].value_counts()

spaceship_clean = spaceship_clean.drop(columns=['PassengerId', 'Name'])

spaceship_clean.dtypes


HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Transported        bool
dtype: object

In [42]:
# Featuring Selection

spaceship_clean = pd.get_dummies(spaceship_clean, columns=['Destination'])

spaceship_clean = pd.get_dummies(spaceship_clean, columns=['HomePlanet'])

spaceship_clean = pd.get_dummies(spaceship_clean, columns=['Cabin'])

**Perform Train Test Split**

In [43]:
#your code here

numerical_features = spaceship_clean.drop(columns=['Transported']).select_dtypes(include=['int64', 'float64'])

target = spaceship_clean['Transported']

X = numerical_features  # Features (all columns but the target)
y = target  # Target

X_train, X_test, y_train, y_test = train_test_split(numerical_features, target, test_size=0.2, random_state=42)

# test_size=0.2: Defines 20% of the data to be set aside for testing, ensuring that 80% is used for training.

# random_state=42: Sets a seed for reproducibility, meaning you’ll get the same split 
# every time you run the script (using any integer value can suffice, 42 is commonly used).

# Check the shape of the splits to confirm
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (5284, 6)
X_test shape: (1322, 6)
y_train shape: (5284,)
y_test shape: (1322,)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [44]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [45]:
#your code here

# Initialize a base estimator, e.g., a Decision Tree
estimator = DecisionTreeClassifier(random_state=42)

# Bagging: use the default bootstrap=True
bagging_clf = BaggingClassifier(estimator=estimator, n_estimators=100, bootstrap=True, random_state=42)

# Pasting: set bootstrap=False
pasting_clf = BaggingClassifier(estimator=estimator, n_estimators=100, bootstrap=False, random_state=42)

# Fit the models
bagging_clf.fit(X_train, y_train)
pasting_clf.fit(X_train, y_train)

# You can then predict and evaluate them
bagging_score = bagging_clf.score(X_test, y_test)
pasting_score = pasting_clf.score(X_test, y_test)

print(f'Bagging Test Score: {bagging_score}')
print(f'Pasting Test Score: {pasting_score}')

Bagging Test Score: 0.7927382753403933
Pasting Test Score: 0.7503782148260212


In [46]:
accuracies = {"Bagging": bagging_score, "Pasting": pasting_score}

- Random Forests

In [47]:
#your code here

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize the Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)  # n_estimators is the number of trees in the forest

# n_estimators: The number of trees in the forest. More trees can lead to a better model, but 
# there's a trade-off with computational cost
# random_state: Ensures reproducibility by fixing the random seed.

# Fit the model to the training data
rf_clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_clf.predict(X_test)

# Evaluate the model's accuracy
random_forest_accuracy = accuracy_score(y_test, y_pred)
print(f'Random Forest Test Accuracy: {random_forest_accuracy:.2f}')

Random Forest Test Accuracy: 0.79


In [48]:
accuracies['Random Forest'] = random_forest_accuracy

- Gradient Boosting

In [49]:
#your code here

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Initialize the Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model to the training data
gb_clf.fit(X_train, y_train)

# Predict on the test data
y_pred = gb_clf.predict(X_test)

#n_estimators: Number of boosting stages (trees) to be built. More trees can improve 
# the model at the cost of computational time.
#learning_rate: Determines the contribution of each tree. A smaller value typically 
# requires more trees.

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Gradient Boosting Test Accuracy: {accuracy:.2f}')

Gradient Boosting Test Accuracy: 0.80


In [50]:
accuracies['Gradient Boosting Test Accuracy'] = accuracy

- Adaptive Boosting

In [51]:
#your code here

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Initialize the base estimator
estimator = DecisionTreeClassifier(max_depth=1, random_state=42)

# Initialize the AdaBoost Classifier
ada_clf = AdaBoostClassifier(estimator=estimator, n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model to the training data
ada_clf.fit(X_train, y_train)

# Predict on the test data
y_pred = ada_clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'AdaBoost Test Accuracy: {accuracy:.2f}')

AdaBoost Test Accuracy: 0.75


In [52]:
accuracies['AdaBoost Test Accuracy'] = accuracy

Which model is the best and why?

In [54]:
#comment here

accuracies = {k: f"{v*100:.2f}%" for k, v in accuracies.items()}

accuracies

# The best one is Gradient, maybe because it builds models sequentially, focusing on correcting the errors of previously built models 

{'Bagging': '79.27%',
 'Pasting': '75.04%',
 'Random Forest': '79.35%',
 'Gradient Boosting Test Accuracy': '79.58%',
 'AdaBoost Test Accuracy': '74.66%'}