# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [5]:
missing_values = spaceship.isnull().sum()
missing_values
spaceship_clean = spaceship.dropna()

In [9]:
numerical_features = spaceship_clean.select_dtypes(include=[np.number])

In [11]:
target_variable = spaceship_clean['Transported']

In [13]:
from sklearn.model_selection import train_test_split

X = numerical_features
y = target_variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [19]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier()

knn_classifier.fit(X_train, y_train)

In [23]:
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=5)

X_train_selected = selector.fit_transform(X_train_scaled, y_train)

selected_features = X.columns[selector.get_support()]
selected_features

Index(['Age', 'RoomService', 'FoodCourt', 'Spa', 'VRDeck'], dtype='object')

**Perform Train Test Split**

In [25]:
from sklearn.model_selection import train_test_split

X_train_selected, X_test_selected, y_train, y_test = train_test_split(X_train_selected, y_train, test_size=0.2, random_state=42)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [31]:
pip install --upgrade scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp312-cp312-win_amd64.whl.metadata (15 kB)
Downloading scikit_learn-1.6.1-cp312-cp312-win_amd64.whl (11.1 MB)
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
    --------------------------------------- 0.3/11.1 MB ? eta -:--:--
    --------------------------------------- 0.3/11.1 MB ? eta -:--:--
   - -------------------------------------- 0.5/11.1 MB 453.5 kB/s eta 0:00:24
   - -------------------------------------- 0.5/11.1 MB 453.5 kB/s eta 0:00:24
   -- ------------------------------------- 0.8/11.1 MB 516.5 kB/s eta 0:00:20
   --- -----------------------------

In [41]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

In [45]:
bagging_reg.fit(X_train_selected, y_train)

- Random Forests

In [47]:
from sklearn.ensemble import RandomForestRegressor

random_forest_reg = RandomForestRegressor(n_estimators=100, max_depth=20, random_state=42)

random_forest_reg.fit(X_train_selected, y_train)

- Gradient Boosting

In [49]:
from sklearn.ensemble import GradientBoostingRegressor

gradient_boosting_reg = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

gradient_boosting_reg.fit(X_train_selected, y_train)

- Adaptive Boosting

In [57]:
from sklearn.ensemble import AdaBoostRegressor

ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=100)

ada_reg.fit(X_train_selected, y_train)

Which model is the best and why?

In [61]:
from sklearn.metrics import r2_score, mean_absolute_error

y_pred_bagging = bagging_reg.predict(X_test_selected)
y_pred_random_forest = random_forest_reg.predict(X_test_selected)
y_pred_gradient_boosting = gradient_boosting_reg.predict(X_test_selected)
y_pred_adaboost = ada_reg.predict(X_test_selected)

r2_bagging = r2_score(y_test, y_pred_bagging)
r2_random_forest = r2_score(y_test, y_pred_random_forest)
r2_gradient_boosting = r2_score(y_test, y_pred_gradient_boosting)
r2_adaboost = r2_score(y_test, y_pred_adaboost)

mae_bagging = mean_absolute_error(y_test, y_pred_bagging)
mae_random_forest = mean_absolute_error(y_test, y_pred_random_forest)
mae_gradient_boosting = mean_absolute_error(y_test, y_pred_gradient_boosting)
mae_adaboost = mean_absolute_error(y_test, y_pred_adaboost)

print("R-squared:")
print("Bagging:", r2_bagging)
print("Random Forest:", r2_random_forest)
print("Gradient Boosting:", r2_gradient_boosting)
print("AdaBoost:", r2_adaboost)
print("\nMean Absolute Error:")
print("Bagging:", mae_bagging)
print("Random Forest:", mae_random_forest)
print("Gradient Boosting:", mae_gradient_boosting)
print("AdaBoost:", mae_adaboost)

print("The Gradient Boosting and Bagging models appear to have the highest R-squared values and the lowest Mean Absolute Error (MAE), suggesting better performance compared to the Random Forest and AdaBoost models.")

R-squared:
Bagging: 0.3506220421634172
Random Forest: 0.3265502878363087
Gradient Boosting: 0.34881157883238
AdaBoost: 0.0016499993928987822

Mean Absolute Error:
Bagging: 0.32125344654903887
Random Forest: 0.318766395960906
Gradient Boosting: 0.3227201465246533
AdaBoost: 0.36332707399543374
The Gradient Boosting and Bagging models appear to have the highest R-squared values and the lowest Mean Absolute Error (MAE), suggesting better performance compared to the Random Forest and AdaBoost models.
