# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor   


In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [7]:
spaceship.dropna(inplace=True)

In [8]:
spaceship['Cabin'] = spaceship['Cabin'].str.split('/').str[0]


In [9]:
spaceship.drop(columns=['PassengerId', 'Name'], inplace=True)   

- Feature Scaling

In [73]:
# Select the numeric columns
numeric_columns = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']


# extract the numerical variables
X = spaceship[numeric_columns]
var_target = spaceship['Transported']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, var_target, test_size=0.2, random_state=0)

# Create a scaler object
scaler = StandardScaler()

# Adapt the scaler to the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform data in the test set
X_test_scaled = scaler.transform(X_test)


- Feature Selection

In [74]:

# select the numeric columns
features = spaceship.select_dtypes(exclude='object')

# define the target
target = spaceship["Transported"]


# Split the data into training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

**Perform Train Test Split**

In [75]:
#create an instance of the normalizer

normalizer = MinMaxScaler()

normalizer.fit(X_train)

X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [63]:
# Initialize the bagging model(base)
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

bagging_reg.fit(X_train, y_train)

pred = bagging_reg.predict(X_test)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", bagging_reg.score(X_test, y_test))

MAE 0.0
RMSE 0.0
R2 score 1.0




In [76]:
# Initialize the bagging model(normalized)
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

bagging_reg.fit(X_train_norm, y_train)

pred = bagging_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", bagging_reg.score(X_test_norm, y_test))

MAE 0.0
RMSE 0.0
R2 score 1.0




In [77]:
# Initialize the bagging model(standardized)
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=100,
                               max_samples = 1000)

bagging_reg.fit(X_train_scaled, y_train)

pred = bagging_reg.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", bagging_reg.score(X_test_scaled, y_test))

MAE 0.3171613389846439
RMSE 0.4024575144275858
R2 score 0.3521117963230782




In [65]:
# Initialize the Pasting model (base)
pasting_reg = BaggingRegressor(
  estimator=DecisionTreeRegressor(max_depth=20),
  n_estimators=100,
  max_samples=1000,
  bootstrap=False  # This ensures that no bootstrap is done, which is characteristic of Pasting
)

# Train the model with normalized data
pasting_reg.fit(X_train, y_train)

# Evaluate the model's performance
pred = pasting_reg.predict(X_test)

print("MAE:", mean_absolute_error(pred, y_test))
print("RMSE:", mean_squared_error(pred, y_test, squared=False))
print("R2 score:", pasting_reg.score(X_test, y_test))

MAE: 0.0
RMSE: 0.0
R2 score: 1.0




In [66]:
# Initialize the Pasting model (normalized)
pasting_reg = BaggingRegressor(
  estimator=DecisionTreeRegressor(max_depth=20),
  n_estimators=100,
  max_samples=1000,
  bootstrap=False  # This ensures that no bootstrap is done, which is characteristic of Pasting
)

# Train the model with normalized data
pasting_reg.fit(X_train_norm, y_train)

# Evaluate the model's performance
pred = pasting_reg.predict(X_test_norm)

print("MAE:", mean_absolute_error(pred, y_test))
print("RMSE:", mean_squared_error(pred, y_test, squared=False))
print("R2 score:", pasting_reg.score(X_test_norm, y_test))

MAE: 0.0
RMSE: 0.0
R2 score: 1.0




In [78]:
# Initialize the Pasting model (standardized)
pasting_reg = BaggingRegressor(
  estimator=DecisionTreeRegressor(max_depth=20),
  n_estimators=100,
  max_samples=1000,
  bootstrap=False  # This ensures that no bootstrap is done, which is characteristic of Pasting
)

# Train the model with normalized data
pasting_reg.fit(X_train_scaled, y_train)

# Evaluate the model's performance
pred = pasting_reg.predict(X_test_scaled)

print("MAE:", mean_absolute_error(pred, y_test))
print("RMSE:", mean_squared_error(pred, y_test, squared=False))
print("R2 score:", pasting_reg.score(X_test_scaled, y_test))

MAE: 0.31549822294089797
RMSE: 0.40239715584196417
R2 score: 0.352306115881192




- Random Forests

In [61]:
#initialize a random forest (base)

forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)
# train the model
forest.fit(X_train, y_train)
#evaluate the model
pred = forest.predict(X_test)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", forest.score(X_test, y_test))

MAE 0.0
RMSE 0.0
R2 score 1.0




In [60]:
#initialize a random forest(normalized)

forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)
# train the model
forest.fit(X_train_norm, y_train)
#evaluate the model
pred = forest.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", forest.score(X_test_norm, y_test))

MAE 0.0
RMSE 0.0
R2 score 1.0




In [79]:
#initialize a random forest(standardized)

forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)
# train the model
forest.fit(X_train_scaled, y_train)

#evaluate the model
pred = forest.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", forest.score(X_test_scaled, y_test))

MAE 0.3130242087455354
RMSE 0.40764201366582004
R2 score 0.33531195477790154




- Gradient Boosting

In [59]:
#initialize the Gradient Boosting model (base)
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)
#train the model
gb_reg.fit(X_train, y_train)

#evaluate the model
pred = gb_reg.predict(X_test)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", gb_reg.score(X_test, y_test))

MAE 1.328069944376701e-05
RMSE 1.32812473911417e-05
R2 score 0.9999999992944338




In [57]:
#initialize the Gradient Boosting model (normalized)
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)
#train the model
gb_reg.fit(X_train_norm, y_train)

#evaluate the model
pred = gb_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", gb_reg.score(X_test_norm, y_test))

MAE 1.328069944382936e-05
RMSE 1.3281247391203615e-05
R2 score 0.9999999992944338




In [81]:
#initialize the Gradient Boosting model (standardized)
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)
#train the model
gb_reg.fit(X_train_scaled, y_train)

#evaluate the model
pred = gb_reg.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", gb_reg.score(X_test_scaled, y_test))

MAE 0.3110158012800247
RMSE 0.47269264514802634
R2 score 0.1062466528918482




- Adaptive Boosting

In [None]:
# Initialize the AdaBoost model (base)
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=200)
# Training the model
ada_reg.fit(X_train, y_train)
# Evaluate the model
pred = ada_reg.predict(X_test) 

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", ada_reg.score(X_test, y_test))

In [55]:
# Initialize the AdaBoost model (normalized)
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=200)
# Training the model
ada_reg.fit(X_train_norm, y_train)
# Evaluate the model
pred = ada_reg.predict(X_test_norm) 

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", ada_reg.score(X_test_norm, y_test))

MAE 0.0
RMSE 0.0
R2 score 1.0




In [82]:
# Initialize the AdaBoost model (standardized)
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=200)
# Training the model
ada_reg.fit(X_train_scaled, y_train)
# Evaluate the model
pred = ada_reg.predict(X_test_scaled) 

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", ada_reg.score(X_test_scaled, y_test))

MAE 0.3436204173133435
RMSE 0.47800111308120524
R2 score 0.08605974357251545




linear regression 

In [84]:
# Create and train the linear regression model (base)
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Make predictions
pred = lin_reg.predict(X_test)

# Evaluate the model
print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", lin_reg.score(X_test, y_test))

# Optionally, you can examine the coefficients to see feature importance
lin_reg_coef = {feature: coef for feature, coef in zip(X_train.columns, lin_reg.coef_)}
print("Feature coefficients:", lin_reg_coef)

MAE 1.9265808081853852e-14
RMSE 2.4742159933628118e-14
R2 score 1.0
Feature coefficients: {'Age': -1.7217624926803148e-15, 'RoomService': 4.869654778431443e-18, 'FoodCourt': 1.7131210677976042e-18, 'ShoppingMall': -4.682691739692561e-18, 'Spa': 2.270938739196473e-18, 'VRDeck': 1.0911215216118525e-18, 'Transported': 0.9999999999999986}




In [83]:
# Create and train the linear regression model (normalized)
lin_reg = LinearRegression()
lin_reg.fit(X_train_norm, y_train)

# Make predictions
pred = lin_reg.predict(X_test_norm)

# Evaluate the model
print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", lin_reg.score(X_test_norm, y_test))

# Optionally, you can examine the coefficients to see feature importance
lin_reg_coef = {feature: coef for feature, coef in zip(X_train.columns, lin_reg.coef_)}
print("Feature coefficients:", lin_reg_coef)

MAE 1.382073678670398e-16
RMSE 2.6877439173008263e-16
R2 score 1.0
Feature coefficients: {'Age': -3.2211556319156853e-16, 'RoomService': -3.955096301892952e-15, 'FoodCourt': 4.504552355890548e-16, 'ShoppingMall': 1.0752455624138713e-16, 'Spa': -1.3771858801845555e-15, 'VRDeck': -2.9287575906476316e-16, 'Transported': 0.9999999999999998}




In [85]:
# Create and train the linear regression model (standardized)
lin_reg = LinearRegression()
lin_reg.fit(X_train_scaled, y_train)

# Make predictions
pred = lin_reg.predict(X_test_scaled)

# Evaluate the model
print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", lin_reg.score(X_test_scaled, y_test))

# Optionally, you can examine the coefficients to see feature importance
lin_reg_coef = {feature: coef for feature, coef in zip(X_train.columns, lin_reg.coef_)}
print("Feature coefficients:", lin_reg_coef)

MAE 0.4431598170331094
RMSE 0.46007190861149444
R2 score 0.15333535562630674
Feature coefficients: {'Age': -0.02260406611062138, 'RoomService': -0.12452859017047915, 'FoodCourt': 0.07109304421255994, 'ShoppingMall': 0.020563743238231805, 'Spa': -0.10467721320361727, 'VRDeck': -0.10449890741147175}




Which model is the best and why?

In [None]:
#comment here