# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [31]:
#Libraries
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [11]:
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [12]:
# dropping rows with null values
spaceship = spaceship.dropna()  

In [13]:
# check again the missing value for certainty
spaceship.isnull().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [16]:
spaceship['Cabin'] = spaceship['Cabin'].apply(lambda x: x[0])

In [17]:
spaceship = spaceship.drop(columns=['PassengerId', 'Name'])

- For non-numerical columns, do dummies.

In [18]:
# select non-numerical columns
df_categorical = spaceship.select_dtypes('object')

categorical_columns = ['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP']

In [20]:
# do dummies
X = pd.get_dummies(spaceship, columns=categorical_columns)

**Perform Train Test Split**

In [21]:
target= spaceship['Transported']    
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size = 0.20, random_state=0)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

Normalization

In [24]:
normalizer = MinMaxScaler()

normalizer.fit(X_train)

In [25]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

- KNN

In [26]:
knn = KNeighborsRegressor(n_neighbors=10)

In [27]:
knn.fit(X_train_norm, y_train)

In [28]:
knn.score(X_test_norm, y_test)

0.9826928895612708

In [29]:
# Predictions
y_pred = knn.predict(X_test_norm)

# Calculate MSE
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

# Calculate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')

Mean Squared Error: 0.0043267776096823
Mean Absolute Error: 0.008623298033282904


- Bagging and Pasting

In [48]:
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators= 100,
                               max_samples = 1000)

Training Bagging model with our normalized data

In [49]:
bagging_reg.fit(X_train_norm, y_train)

Evaluate model's performance

In [50]:
pred = bagging_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", bagging_reg.score(X_test_norm, y_test))

MAE 0.0
RMSE 0.0
R2 score 1.0




In [51]:
# Initialize the Pasting model
pasting_reg = BaggingRegressor(
  estimator=DecisionTreeRegressor(max_depth=20),
  n_estimators=100,
  max_samples=1000,
  bootstrap=False  # This ensures that no bootstrap is done, which is characteristic of Pasting
)                 # bootstrap by default is True, if it is False it means Pasting

# Train the model with normalized data
pasting_reg.fit(X_train_norm, y_train)

# Evaluate the model's performance
pred = pasting_reg.predict(X_test_norm)

print("MAE:", mean_absolute_error(pred, y_test))
print("RMSE:", mean_squared_error(pred, y_test, squared=False))
print("R2 score:", pasting_reg.score(X_test_norm, y_test))

MAE: 0.0
RMSE: 0.0
R2 score: 1.0




- Random Forests

Initialize a Random Forest

In [52]:
forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)

Training the model

In [53]:
forest.fit(X_train_norm, y_train)

Evaluate the model

In [54]:
pred = forest.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", forest.score(X_test_norm, y_test))

MAE 0.0
RMSE 0.0
R2 score 1.0




- Gradient Boosting

Initialize a Gradient Boosting

In [55]:
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)

Training the model

In [56]:
gb_reg.fit(X_train_norm, y_train)

Evaluate the model

In [57]:
pred = gb_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", gb_reg.score(X_test_norm, y_test))

MAE 1.3280699443793777e-05
RMSE 1.3281247391168192e-05
R2 score 0.9999999992944338




- Adaptive Boosting

Initialize Adaptive Boosting

In [58]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=100)

Training the model

In [59]:
ada_reg.fit(X_train_norm, y_train)

Evaluate the model

In [60]:
pred = ada_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", ada_reg.score(X_test_norm, y_test))

MAE 0.0
RMSE 0.0
R2 score 1.0




Which model is the best and why?

All models look good because the R2 score of all of them are really close to each other, which is close to 1.