Tech Project 2 - Ensemble Methods

Diogo Pessoa


In [3]:
data_file = 'combined_data.csv' # Load variable so data_loader can locale the csv filesystem path accordingly.
%run data_loader.ipynb 

Data loaded:
- x_train_sc: Scaled training features.
- x_test_sc: Scaled testing features.
- x_train - Training features.
- x_test - Testing features.
- y_train - Training labels.
- y_test - Testing labels.


In [4]:
%run FeatureEngineering/PCA.ipynb

PCA applied to the training and testing features:
- x_train_pca_Trans_sc: Scaled training features.
- x_test_pca_Trans_sc: Scaled testing features.
- x_train_pca_Trans - Training features.
- x_test_pca_Trans - Testing features.


In [ ]:
# %run FeatureEngineering/Bayesian_optimization.ipynb

## Voting/Bagging Regressor

* [Voting Regressor](https://scikit-learn.org/stable/modules/ensemble.html#voting-regressor)
* [Bagging Regressor](https://scikit-learn.org/stable/modules/ensemble.html#bagging-regressor)

Deliberately leaving RandomForestRegressor out of this test, since it was already used on initial notebook.

In [7]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import BaggingRegressor

n_estimators=140
rd_state=42
# Training Regressors to use on Voting Regressor
grad_boosting_regressor = GradientBoostingRegressor(random_state=rd_state, n_estimators=n_estimators)
bagging_regressor = BaggingRegressor(random_state=rd_state, n_estimators=n_estimators)

voting_reg = VotingRegressor(estimators=[('gb', grad_boosting_regressor), ('bagging_r', bagging_regressor)])

In [11]:
# Training Voting Regressor without PCA
import time

# Record the start time
start_time = time.time()
votinh_reg_sc = voting_reg.fit(x_train_sc, y_train)
# Calculate the duration
end_time = time.time()
duration = end_time - start_time

# Record the start time
start_time = time.time()
voting_reg_sc_pca = voting_reg.fit(x_train_pca_Trans_sc, y_train)
# Calculate the duration
end_time = time.time()
duration_pca = end_time - start_time

In [10]:
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score

voting_regressor_pred = voting_reg.predict(x_test_sc)
voting_regressor_sc_pca_pred = voting_reg_sc_pca.predict(x_test_pca_Trans_sc)

# Scaled Data
mse = mean_squared_error(y_test, voting_regressor_pred)
r_two_score = r2_score(y_test, voting_regressor_pred)
ex_variance_score = explained_variance_score(y_test, voting_regressor_pred)

# Using PCA applied Set
mse_pca = mean_squared_error(y_test, voting_regressor_sc_pca_pred)
r_two_score_pca = r2_score(y_test, voting_regressor_sc_pca_pred)
ex_variance_score_pca = explained_variance_score(y_test, voting_regressor_sc_pca_pred)

print(f"Voting Regressor Model with PCA took {duration_pca:.2f} seconds.")
print(f'Mean Squared Error on PCA Test Set: {mse:.2f}')
print(f'r2 score PCA: {r_two_score:.2f}')
print(f'explained variance score PCA: {ex_variance_score:.2f}')




Voting Regressor Model training took 176.30 seconds.
Mean Squared Error on Test Set: 0.25
r2 score: 0.92
explained variance score: 0.92


In [None]:
# Training BaggingReg
import time
# Record the start time
start_time = time.time()
bagging_regressor.fit(x_train_Trans, y_train)
# Calculate the duration
end_time = time.time()
duration = end_time - start_time

In [ ]:
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score

bagging_regressor_pred = bagging_regressor.predict(x_test_Trans)

mse = mean_squared_error(y_test, bagging_regressor_pred)
r_two_score = r2_score(y_test, bagging_regressor_pred)
ex_variance_score = explained_variance_score(y_test, bagging_regressor_pred)

print(f"Bagging Regressor Model training took {duration:.2f} seconds.")
print(f'Mean Squared Error on Test Set: {mse:.2f}')
print(f'r2 score: {r_two_score:.2f}')
print(f'explained variance score: {ex_variance_score:.2f}')