# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [None]:
#Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, root_mean_squared_error
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [None]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [None]:
spaceship = spaceship.dropna()
spaceship.reset_index(drop=True, inplace=True)

In [None]:
spaceship['Cabin'] = spaceship['Cabin'].apply(lambda x: x.split('/')[0] if isinstance(x, str) else np.nan)
spaceship = spaceship.drop(columns=['Name'])
spaceship = spaceship.drop(columns=['PassengerId'])

**Perform Train Test Split**

In [None]:
spaceship = pd.get_dummies(spaceship, columns=['HomePlanet','CryoSleep','Cabin','Destination','VIP'])

In [None]:
spaceship_numerical = spaceship.select_dtypes(include=['number'])

In [None]:
normalizer = MinMaxScaler()

In [None]:
spaceship_numerical_norm = normalizer.fit_transform(spaceship_numerical)

In [None]:
spaceship_numerical_norm_df = pd.DataFrame(spaceship_numerical_norm, columns=spaceship_numerical.columns)

In [None]:
numerical_cols = spaceship.select_dtypes(include=['number']).columns
spaceship = spaceship.drop(columns=numerical_cols)

In [None]:
spaceship_combined = pd.concat([spaceship, spaceship_numerical_norm_df], axis=1)

In [None]:
boolean_cols = spaceship_combined.select_dtypes(include=['bool']).columns
spaceship_combined[boolean_cols] = spaceship_combined[boolean_cols].astype(int)

In [None]:
spaceship_combined.info()

In [None]:
# X-y split; features = X, target = y
features = spaceship_combined.drop(columns = ["Transported"])
target = spaceship_combined["Transported"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=1)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Decision Trees

In [None]:
tree = DecisionTreeRegressor(max_depth=10)

In [None]:
tree.fit(X_train, y_train)

In [None]:
y_pred_test_dt = tree.predict(X_test)
print(f"MAE, {mean_absolute_error(y_pred_test_dt, y_test): .2f}")
print(f"MSE, {mean_squared_error(y_pred_test_dt, y_test): .2f}")
print(f"RMSE, {root_mean_squared_error(y_pred_test_dt, y_test): .2f}")
print(f"R2 score, {tree.score(X_test, y_test): .2f}")

- Bagging and Pasting

In [None]:
bagging_reg = BaggingClassifier(DecisionTreeClassifier(max_depth=20),
                               n_estimators=100, # number of models to use
                               max_samples = 1000)

In [None]:
bagging_reg.fit(X_train, y_train)

In [None]:
y_pred_test_bag = bagging_reg.predict(X_test)

print(f"MAE {mean_absolute_error(y_pred_test_bag, y_test): .2f}")
print(f"RMSE {root_mean_squared_error(y_pred_test_bag, y_test): .2f}")
print(f"MSE {mean_squared_error(y_pred_test_bag, y_test): .2f}")
print(f"R2 score {bagging_reg.score(X_test, y_test): .2f}")

In [None]:
pasting_reg = BaggingClassifier(DecisionTreeClassifier(max_depth=20), n_estimators=100,max_samples=1000,bootstrap=False) # For Pasting, set bootstrap to False

In [None]:
pasting_reg.fit(X_train, y_train)

In [None]:
y_pred_test_past = pasting_reg.predict(X_test)

print(f"MAE {mean_absolute_error(y_pred_test_past, y_test): .2f}")
print(f"RMSE {root_mean_squared_error(y_pred_test_past, y_test): .2f}")
print(f"MSE {mean_squared_error(y_pred_test_past, y_test): .2f}")
print(f"R2 score {pasting_reg.score(X_test, y_test): .2f}")

- Random Forests

In [None]:
forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)

In [None]:
forest.fit(X_train, y_train)

In [None]:
y_pred_test_rf = forest.predict(X_test)
print(f"MAE {mean_absolute_error(y_pred_test_rf, y_test): .2f}")
print(f"MSE, {mean_squared_error(y_pred_test_rf, y_test): .2f}")
print(f"RMSE, {root_mean_squared_error(y_pred_test_rf, y_test): .2f}")
print(f"R2 score, {forest.score(X_test, y_test): .2f}")

- Gradient Boosting

In [None]:
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)

In [None]:
gb_reg.fit(X_train, y_train)

In [None]:
y_pred_test_gb = gb_reg.predict(X_test)

print(f"MAE, {mean_absolute_error(y_pred_test_gb, y_test): .2f}")
print(f"MSE, {mean_squared_error(y_pred_test_gb, y_test): .2f}")
print(f"RMSE, {root_mean_squared_error(y_pred_test_gb, y_test): .2f}")
print(f"R2 score, {gb_reg.score(X_test, y_test): .2f}")

- Adaptive Boosting

In [None]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=100)

In [None]:
ada_reg.fit(X_train, y_train)

In [None]:
y_pred_test_ada = ada_reg.predict(X_test)

print(f"MAE, {mean_absolute_error(y_pred_test_ada, y_test): .2f}")
print(f"MSE, {mean_squared_error(y_pred_test_ada, y_test): .2f}")
print(f"RMSE, {root_mean_squared_error(y_pred_test_ada, y_test): .2f}")
print(f"R2 score, {ada_reg.score(X_test, y_test): .2f}")

Which model is the best and why?

Bagging and Pasting because predict 80% accuracy and has lowest MAE only 0.21