# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [5]:
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [7]:
from sklearn.datasets import fetch_openml
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score

spaceship = spaceship.dropna()

X = spaceship.drop(columns=['PassengerId', 'Name', 'Cabin', 'Transported'])
y = spaceship['Transported']

# One-hot encode categorical variables
ohe = OneHotEncoder(sparse=False)
X_cat_trans = ohe.fit_transform(X[['HomePlanet', 'CryoSleep', 'Destination', 'VIP']])
X_cat_trans_df = pd.DataFrame(X_cat_trans, columns=ohe.get_feature_names_out(), index=X.index)

X = X.drop(columns=['HomePlanet', 'CryoSleep', 'Destination', 'VIP'])
X = pd.concat([X, X_cat_trans_df], axis=1)


from sklearn.preprocessing import MinMaxScaler

# Feature scaling
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)



**Perform Train Test Split**

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0)



In [11]:

# K-Nearest Neighbors Regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score

# Train the model
knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)

# Predict on the test set
y_pred_knn = knn_reg.predict(X_test)
y_pred_knn_binary = (y_pred_knn >= 0.5)

# Calculate metrics
mae_knn = mean_absolute_error(y_test, y_pred_knn)
rmse_knn = mean_squared_error(y_test, y_pred_knn, squared=False)
r2_knn = r2_score(y_test, y_pred_knn)
accuracy_knn = accuracy_score(y_test, y_pred_knn_binary)


print(f"K-Nearest Neighbors Regression")
print(f"MAE: {mae_knn:.2f}")
print(f"RMSE: {rmse_knn:.2f}")
print(f"R2 score: {r2_knn:.2f}")
print(f"Accuracy: {accuracy_knn:.2f}")
print("\n")


K-Nearest Neighbors Regression
MAE: 0.29
RMSE: 0.41
R2 score: 0.32
Accuracy: 0.75




**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [15]:

from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor

# Train the model
bagging_knn = BaggingRegressor(base_estimator=KNeighborsRegressor(), n_estimators=50, random_state=0)
bagging_knn.fit(X_train, y_train)

# Predict on the test set
y_pred_bag_knn = bagging_knn.predict(X_test)
y_pred_bag_knn_binary = (y_pred_bag_knn >= 0.5)

# Calculate metrics
mae_bag_knn = mean_absolute_error(y_test, y_pred_bag_knn)
rmse_bag_knn = mean_squared_error(y_test, y_pred_bag_knn, squared=False)
r2_bag_knn = r2_score(y_test, y_pred_bag_knn)
accuracy_bag_knn = accuracy_score(y_test, y_pred_bag_knn_binary)


print(f"Bagging with K-Nearest Neighbors Regression")
print(f"MAE: {mae_bag_knn:.2f}")
print(f"RMSE: {rmse_bag_knn:.2f}")
print(f"R2 score: {r2_bag_knn:.2f}")
print(f"Accuracy: {accuracy_bag_knn:.2f}")
print("\n")

# KKn
pasting_knn = BaggingRegressor(base_estimator=KNeighborsRegressor(), n_estimators=50, bootstrap=False, random_state=0)
pasting_knn.fit(X_train, y_train)

# Predict on the test set
y_pred_paste_knn = pasting_knn.predict(X_test)
y_pred_paste_knn_binary = (y_pred_paste_knn >= 0.5)

# Calculate metrics
mae_paste_knn = mean_absolute_error(y_test, y_pred_paste_knn)
rmse_paste_knn = mean_squared_error(y_test, y_pred_paste_knn, squared=False)
r2_paste_knn = r2_score(y_test, y_pred_paste_knn)
accuracy_paste_knn = accuracy_score(y_test, y_pred_paste_knn_binary)

print(f"Pasting with K-Nearest Neighbors Regression")
print(f"MAE: {mae_paste_knn:.2f}")
print(f"RMSE: {rmse_paste_knn:.2f}")
print(f"R2 score: {r2_paste_knn:.2f}")
print(f"Accuracy: {accuracy_paste_knn:.2f}")
print("\n")




Bagging with K-Nearest Neighbors Regression
MAE: 0.29
RMSE: 0.40
R2 score: 0.35
Accuracy: 0.77






Pasting with K-Nearest Neighbors Regression
MAE: 0.29
RMSE: 0.41
R2 score: 0.32
Accuracy: 0.75




- Random Forests

In [17]:
# Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

# Train the model
random_forest = RandomForestRegressor(n_estimators=100, random_state=0)
random_forest.fit(X_train, y_train)

# Predict on the test set
y_pred_rf = random_forest.predict(X_test)
y_pred_rf_binary = (y_pred_rf >= 0.5)

# Calculate metrics
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = mean_squared_error(y_test, y_pred_rf, squared=False)
r2_rf = r2_score(y_test, y_pred_rf)
accuracy_rf = accuracy_score(y_test, y_pred_rf_binary)


print(f"Random Forest Regressor")
print(f"MAE: {mae_rf:.2f}")
print(f"RMSE: {rmse_rf:.2f}")
print(f"R2 score: {r2_rf:.2f}")
print(f"Accuracy: {accuracy_rf:.2f}")
print("\n")


Random Forest Regressor
MAE: 0.28
RMSE: 0.39
R2 score: 0.39
Accuracy: 0.79




- Gradient Boosting

In [19]:
# Gradient Boosting Regressor
from sklearn.ensemble import GradientBoostingRegressor

# Train the model
gradient_boosting = GradientBoostingRegressor(n_estimators=100, random_state=0)
gradient_boosting.fit(X_train, y_train)

# Predict on the test set
y_pred_gb = gradient_boosting.predict(X_test)
y_pred_gb_binary = (y_pred_gb >= 0.5)

# Calculate metrics
mae_gb = mean_absolute_error(y_test, y_pred_gb)
rmse_gb = mean_squared_error(y_test, y_pred_gb, squared=False)
r2_gb = r2_score(y_test, y_pred_gb)
accuracy_gb = accuracy_score(y_test, y_pred_gb_binary)


print(f"Gradient Boosting Regressor")
print(f"MAE: {mae_gb:.2f}")
print(f"RMSE: {rmse_gb:.2f}")
print(f"R2 score: {r2_gb:.2f}")
print(f"Accuracy: {accuracy_gb:.2f}")
print("\n")



Gradient Boosting Regressor
MAE: 0.29
RMSE: 0.38
R2 score: 0.42
Accuracy: 0.78




- Adaptive Boosting

In [21]:
# Adaptive Boosting Regressor
from sklearn.ensemble import AdaBoostRegressor

# Train the model
adaptive_boosting = AdaBoostRegressor(n_estimators=100, random_state=0)
adaptive_boosting.fit(X_train, y_train)

# Predict on the test set
y_pred_ab = adaptive_boosting.predict(X_test)
y_pred_ab_binary = (y_pred_ab >= 0.5)

# Calculate metrics
mae_ab = mean_absolute_error(y_test, y_pred_ab)
rmse_ab = mean_squared_error(y_test, y_pred_ab, squared=False)
r2_ab = r2_score(y_test, y_pred_ab)
accuracy_ab = accuracy_score(y_test, y_pred_ab_binary)


print(f"Adaptive Boosting Regressor")
print(f"MAE: {mae_ab:.2f}")
print(f"RMSE: {rmse_ab:.2f}")
print(f"R2 score: {r2_ab:.2f}")
print(f"Accuracy: {accuracy_ab:.2f}")
print("\n")



Adaptive Boosting Regressor
MAE: 0.36
RMSE: 0.40
R2 score: 0.35
Accuracy: 0.78




Which model is the best and why?

In [None]:
#Random Forest Regressor and Gradient Boosting Regressor are the top performers, with the Random Forest Regressor having a slight edge due to its slightly better MAE, RMSE, and Accuracy.
#The Random Forest Regressor is the best model in this case, primarily due to its better overall balance of low MAE, competitive RMSE, good R2 score, and highest accuracy. This suggests it has the best predictive performance and classification accuracy among the models tested. However, the Gradient Boosting Regressor also performs very well and could be considered a close second, especially given its highest R2 score, indicating it might fit the data slightly better in some contexts.






