# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [3]:
spaceship.dropna(inplace = True)

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [4]:
#your code here
X = spaceship.drop(columns=['Transported'])
y = spaceship['Transported']

**Perform Train Test Split**

In [5]:
numeric_features = X.drop(columns = ['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP','PassengerId','Name'])
categorical_features = X.drop(columns =['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck','PassengerId','Name'])

In [6]:
numeric_features

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,39.0,0.0,0.0,0.0,0.0,0.0
1,24.0,109.0,9.0,25.0,549.0,44.0
2,58.0,43.0,3576.0,0.0,6715.0,49.0
3,33.0,0.0,1283.0,371.0,3329.0,193.0
4,16.0,303.0,70.0,151.0,565.0,2.0
...,...,...,...,...,...,...
8688,41.0,0.0,6819.0,0.0,1643.0,74.0
8689,18.0,0.0,0.0,0.0,0.0,0.0
8690,26.0,0.0,0.0,1872.0,1.0,0.0
8691,32.0,0.0,1049.0,0.0,353.0,3235.0


In [7]:
X1 = numeric_features

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X1, y, test_size=0.2, random_state=0)

In [9]:
normalizer = MinMaxScaler()

normalizer.fit(X_train)

In [10]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

In [11]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_train_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0.405063,0.0,0.0,0.0,0.0,0.0
1,0.050633,0.0,0.0,0.0,0.0,0.0
2,0.379747,0.0,0.007916,0.0,0.051276,0.0
3,0.21519,0.00131,0.0,0.046111,0.016378,4.9e-05
4,0.329114,0.0,0.0,0.0,0.0,0.0


In [12]:
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)
X_test_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0.632911,0.0,0.0,0.0,0.0,0.0
1,0.227848,0.0,0.0,0.0,0.0,0.0
2,0.189873,0.0,0.0,0.0,0.0,0.0
3,0.658228,0.0,0.0,0.0,0.0,0.0
4,0.78481,0.0,0.054775,0.0,0.07774,0.0


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [13]:
#your code here
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

In [14]:
bagging_reg.fit(X_train_norm, y_train)

In [16]:
#evaluate performance
pred = bagging_reg.predict(X_test_norm)

print(f"MAE {mean_absolute_error(pred, y_test): .2f}")
print(f"RMSE {mean_squared_error(pred, y_test, squared=False): .2f}")
print(f"R2 score {bagging_reg.score(X_test_norm, y_test): .2f}")

MAE  0.32
RMSE  0.40
R2 score  0.35


- Random Forests

In [25]:
#your code here
forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)

In [26]:
#train the model
forest.fit(X_train_norm, y_train)

In [27]:
#Evaluate the model
pred = forest.predict(X_test_norm)

print(f"MAE {mean_absolute_error(pred, y_test): .2f}")
print(f"RMSE {mean_squared_error(pred, y_test, squared=False):.2f}")
print(f"R2 score {forest.score(X_test_norm, y_test): .2f}")

MAE  0.31
RMSE 0.41
R2 score  0.34


- Gradient Boosting

In [29]:
#your code here: initialize
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)
gb_reg.fit(X_train_norm, y_train)  # train the model

In [30]:
# evaluate
pred = gb_reg.predict(X_test_norm)

print(f"MAE {mean_absolute_error(pred, y_test): .2f}")
print(f"RMSE {mean_squared_error(pred, y_test, squared=False): .2f}")
print(f"R2 score {gb_reg.score(X_test_norm, y_test): .2f}")

MAE  0.31
RMSE  0.48
R2 score  0.09


- Adaptive Boosting

In [31]:
#your code here
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=100)
#train
ada_reg.fit(X_train_norm, y_train)



In [32]:
#evaluate
pred = ada_reg.predict(X_test_norm)

print(f"MAE {mean_absolute_error(pred, y_test): .2f}")
print(f"RMSE, {mean_squared_error(pred, y_test, squared=False): .2f}")
print(f"R2 score {ada_reg.score(X_test_norm, y_test):.2f}")

MAE  0.34
RMSE,  0.48
R2 score 0.09


Which model is the best and why?

In [None]:
#comment here:  
#Conclusion: As per the results, Bagging regressor is the better as R2 is highest for the model.