<a href="https://colab.research.google.com/github/convenience-tinashe-chibatamoto/Electricity-Consumption-Forecasting/blob/main/Hyperparameter_Optimization_on_the_5_Models_XGBoost%2C_Random_Forest_Regression%2C_Linear_Regression%2C_MLP%2C_and_Ridge_Regression_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***
In the previous example, I used 5 different models to make electricity consumption predictions: XGBoost, Random Forest Regression, Linear Regression, MLP, and Ridge Regression.

In this example, I will use these 5 models again but this time, I'll implement hyperparameter optimization for each to try and improve the performance for each model.
***

In [None]:
# Importing the necessary modules
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

In [None]:
# Loading the dataset
df = pd.read_csv('/content/electricityConsumptionAndProductioction.csv')
df.head(8)

Unnamed: 0,DateTime,Consumption,Production,Nuclear,Wind,Hydroelectric,Oil and Gas,Coal,Solar,Biomass
0,2019-01-01 00:00:00,6352,6527,1395,79,1383,1896,1744,0,30
1,2019-01-01 01:00:00,6116,5701,1393,96,1112,1429,1641,0,30
2,2019-01-01 02:00:00,5873,5676,1393,142,1030,1465,1616,0,30
3,2019-01-01 03:00:00,5682,5603,1397,191,972,1455,1558,0,30
4,2019-01-01 04:00:00,5557,5454,1393,159,960,1454,1458,0,30
5,2019-01-01 05:00:00,5525,5385,1395,91,958,1455,1456,0,30
6,2019-01-01 06:00:00,5513,5349,1392,98,938,1451,1440,0,31
7,2019-01-01 07:00:00,5524,5547,1392,93,1187,1446,1394,0,34


In [None]:
X = df[['Production', 'Nuclear', 'Wind', 'Hydroelectric', 'Oil and Gas', 'Coal', 'Solar', 'Biomass']]
y = df['Consumption']

In [None]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# XGBoost Regression
xgb_params = {'max_depth': [3, 5, 7], 'learning_rate': [0.01, 0.1, 0.3], 'n_estimators': [100, 200, 300]}
xgb_model = GridSearchCV(XGBRegressor(random_state=42), xgb_params, cv=5)
xgb_model.fit(X_train, y_train)
xgb_y_pred = xgb_model.predict(X_test)
xgb_mse = mean_squared_error(y_test, xgb_y_pred)
xgb_r2 = r2_score(y_test, xgb_y_pred)

In [None]:
# Random Forest Regression
rf_params = {'n_estimators': [100, 200, 300], 'max_depth': [5, 10, 15], 'min_samples_split': [2, 5, 10]}
rf_model = GridSearchCV(RandomForestRegressor(random_state=42), rf_params, cv=5)
rf_model.fit(X_train, y_train)
rf_y_pred = rf_model.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_y_pred)
rf_r2 = r2_score(y_test, rf_y_pred)

In [None]:
# Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_y_pred = lr_model.predict(X_test)
lr_mse = mean_squared_error(y_test, lr_y_pred)
lr_r2 = r2_score(y_test, lr_y_pred)

In [None]:
# MLP Regression
mlp_params = {'hidden_layer_sizes': [(64,), (128,), (64, 64)], 'alpha': [0.0001, 0.001, 0.01], 'max_iter': [500, 1000, 1500]}
mlp_model = GridSearchCV(MLPRegressor(random_state=42), mlp_params, cv=5)
mlp_model.fit(X_train, y_train)
mlp_y_pred = mlp_model.predict(X_test)
mlp_mse = mean_squared_error(y_test, mlp_y_pred)
mlp_r2 = r2_score(y_test, mlp_y_pred)

In [None]:
# Ridge Regression
ridge_params = {'alpha': [0.1, 1, 10]}
ridge_model = GridSearchCV(Ridge(), ridge_params, cv=5)
ridge_model.fit(X_train, y_train)
ridge_y_pred = ridge_model.predict(X_test)
ridge_mse = mean_squared_error(y_test, ridge_y_pred)
ridge_r2 = r2_score(y_test, ridge_y_pred)

In [None]:
# Compare the performance of the models
models = ['XGBoost', 'Random Forest', 'Linear Regression', 'MLP', 'Ridge Regression']
mse_scores = [xgb_mse, rf_mse, lr_mse, mlp_mse, ridge_mse]
r2_scores = [xgb_r2, rf_r2, lr_r2, mlp_r2, ridge_r2]

In [None]:
# Printing Model Metrics
model_metrics = [
    (xgb_mse, xgb_r2),
    (rf_mse, rf_r2),
    (lr_mse, lr_r2),
    (mlp_mse, mlp_r2),
    (ridge_mse, ridge_r2)
]

print("Model Performance:")
for i, model in enumerate(models):
    mse, r2 = model_metrics[i]
    print(f"{model}:")
    print(f"  MSE: {mse:.2f}")
    print(f"  R2: {r2:.2f}")

In [None]:
# Visualising results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

ax1.bar(models, mse_scores)
ax1.set_xlabel('Model')
ax1.set_ylabel('Mean Squared Error')
ax1.set_title('Electricity Load Forecasting Model Performance (MSE)')

ax2.bar(models, r2_scores)
ax2.set_xlabel('Model')
ax2.set_ylabel('R-squared Score')
ax2.set_title('Electricity Load Forecasting Model Performance (R2)')

plt.tight_layout()
plt.show()

***
I need more compute to fully test these models. They are taking way too long to train with the hyperparameters that I have specified. It took more than 1 hour to train the Random Forest Regressor alone. And that's on SageMaker Studio Lab, which is typically significantly faster for these models than Google Colab. Gulp.
If you happen to have access to more computational resources (e.g., an RTX 40 series GPU or better), let me know what you find out.

My guess is that we would see better performance across the board, with Random Forest and XGBoost potentially leading the pack. However, this is just a guess, based on the previous experiments with less demanding training jobs.

I suppose I could tweak the hyperparameters to enable the models to run faster, but I've already done that in the earlier example. My intention now was to see the final results with a no-holds-barred approach to model training, even if it means longer training times.