This notebook loads the prepared hourly dataset (03 - hourly_features_dataset.csv) and trains forecasting models (ARIMA, Prophet, XGBoost), then compares them using MAE and RMSE.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error
from statsmodels.tsa.arima.model import ARIMA  # Import ARIMA model from statsmodels
from prophet import Prophet # Import Prophet model
from xgboost import XGBRegressor # Import XGBoost Regressor
import os

In [None]:
os.path.exists("03 - hourly_features_dataset.csv")

In [None]:
df = pd.read_csv("03 - hourly_features_dataset.csv", index_col=0, parse_dates=True)

# Time Series Train–Test Split 

In this step the hourly feature-engineered dataset is split into training and testing sets using a chronological (time-aware) split. This prevents data leakage and ensures that forecasting models are evaluated realistically by predicting future values using only past information.

In [None]:
df = df.asfreq('h')  # sets frequency on whole dataset

In [None]:
# defining target and features

X = df.drop(columns='Global_active_power')
Y = df['Global_active_power']

In [None]:
# Calculate the split point (80% training data)

split_index = int(len(df) * 0.8)
split_index

In [None]:
X_train = X.iloc[:split_index] # create training features using first 80% data
X_test = X.iloc[split_index:]   # Create testing features using the last 20% rows 
Y_train = Y.iloc[:split_index]  # Create training target values using the first 80% rows
Y_test = Y.iloc[split_index:]  # Create testing target values using the last 20% rows

In [None]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

# ARIMA Model (Statistical Time Series Forecasting)

In this step an ARIMA model is trained on historical hourly energy consumption data to forecast future values. ARIMA serves as a classical statistical baseline model, allowing us to compare its performance against more advanced models later.

In [None]:
arima_order = (1, 1, 1)  # define ARIMA order (p, d, q)

In [None]:
arima_model = ARIMA(Y_train, order=arima_order)  # Initialize the ARIMA model using training data

In [None]:
arima_fitted = arima_model.fit() # Fit the ARIMA model

In [None]:
# Forecast the same number of steps as the test set length
arima_forecast = arima_fitted.forecast(steps=len(Y_test))

# Convert forecast to a pandas Series with the same index as y_test
arima_forecast = pd.Series(arima_forecast, index=Y_test.index)

### ARIMA Model Evaluation & Visualization

In [None]:
arima_mae = mean_absolute_error(Y_test, arima_forecast) # calculate MSE
arima_rmse = np.sqrt(mean_squared_error(Y_test, arima_forecast)) # calculate RMSE

print(f'ARIMA MAE: {arima_mae:.2f}')
print(f'ARIMA RMSE: {arima_rmse:.2f}')

In [None]:
# Actual vs Forecast Plot

plt.figure(figsize=(22, 5))
plt.plot(Y_test, label="Actual", color="black")
plt.plot(arima_forecast, label="ARIMA Forecast", linestyle="--")
plt.title("ARIMA Forecast vs Actual Energy Consumption")
plt.xlabel("Time")
plt.ylabel("Global Active Power (kW)")

plt.legend()

plt.show()

# Prophet Model (Seasonality-Aware Forecasting)

In this step the Prophet model is used to forecast hourly household energy consumption. Prophet is designed to automatically capture trends and seasonality in time series data, making it well suited for modeling recurring daily energy usage patterns.

In [None]:
# Prophet Model accepts data in a specific format so preparing data accordingly 

prophet_train = Y_train.reset_index() # reset index to make datetime as a column

In [None]:
prophet_train.columns = ['ds', 'y']  # rename columns

In [None]:
prophet_test = Y_test.reset_index()  # Create Prophet testing df
prophet_test.columns = ['ds', 'y']

In [None]:
prophet_model = Prophet(daily_seasonality=True) # enable daily seasonality 

In [None]:
prophet_model.fit(prophet_train) # fit model on training data

In [None]:
# Create future dataframe for prediction

future = prophet_model.make_future_dataframe(periods=len(prophet_test), freq='h')   #  periods = number of hours in test set

In [None]:
prophet_forecast = prophet_model.predict(future) # generate forecast

In [None]:
prophet_predictions = prophet_forecast[['ds', 'yhat']].iloc[-len(prophet_test):] # extract only forecasted values

In [None]:
prophet_predictions.set_index('ds', inplace=True) # set datetime again as index

### Prophet Model Evaluation and Visualization

In [None]:
prophet_mae = mean_absolute_error(Y_test, prophet_predictions['yhat'])  # MSE
prophet_rmse = np.sqrt(mean_squared_error(Y_test, prophet_predictions['yhat'])) # RMSE

print(f"Prophet MAE: {prophet_mae:.2f}")
print(f"Prophet RMSE: {prophet_rmse:.2f}")

In [None]:
# Visualization -> Actual vs Forecast

plt.figure(figsize=(24, 5))

plt.plot(Y_test, label='Actual', color='black')
plt.plot(prophet_predictions['yhat'], label='Prophet Forecast', linestyle='--')
plt.title("Prophet Forecast vs Actual Energy Consumption")
plt.xlabel("Time")
plt.ylabel("Global Active Power (kW)")

plt.legend()

plt.show()

# XGBoost Model (Feature-Based Machine Learning Forecast)

In this step an XGBoost regression model is trained using time-based engineered features to forecast hourly household energy consumption. Unlike statistical models, XGBoost leverages explicit features such as hour of day and weekday information to learn complex, non-linear patterns in the data.

In [None]:
# Initialize the XGBoost regression model
xgb_model = XGBRegressor(n_estimators=200, learning_rate=0.05, max_depth=5, random_state=42)

In [None]:
xgb_model.fit(X_train, Y_train) # train model on training data

In [None]:
xgb_predictions = xgb_model.predict(X_test)  # Generate predictions on the test set
xgb_predictions = pd.Series(xgb_predictions, index=Y_test.index)  # conver predictions to pandas series

### XGBoost Model Evaluation and Visualization

In [None]:
xgb_mae = mean_absolute_error(Y_test, xgb_predictions)
xgb_rmse = np.sqrt(mean_squared_error(Y_test, xgb_predictions))

print(F"XGBoost MAE: {xgb_mae:.2F}")
print(F"XGBoost RMSE: {xgb_rmse:.2F}")

In [None]:
plt.figure(figsize=(24, 5))

plt.plot(Y_test, label="Actual", color="black")
plt.plot(xgb_predictions, label="XGBoost Forecast", linestyle="--")
plt.title("XGBoost Forecast vs Actual Energy Consumption")
plt.xlabel("Time")
plt.ylabel("Global Active Power (kW)")

plt.legend()

plt.show()

# ModelS Comparison Table

The table below summarizes the forecasting performance of all implemented models using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). These metrics provide a quantitative basis for comparing how accurately each model predicts household energy consumption, with lower values indicating better performance.

In [None]:
results_df = pd.DataFrame({
    'Model': ['ARIMA', 'Prophet', 'XGBoost'],
    'MAE': [arima_mae, prophet_mae, xgb_mae],
    'RMSE': [arima_rmse, prophet_rmse, xgb_rmse]
})

results_df

# MAE & RMSE Subplot Visualization

This subplot visualizes the MAE and RMSE values for all models side by side, enabling a clear and intuitive comparison of forecasting errors. By examining both metrics simultaneously, we can better understand each model’s overall accuracy as well as its sensitivity to larger prediction errors.

In [None]:
# Create a figure with 1 row and 2 columns of subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot MAE comparison on the first subplot
results_df.set_index('Model')['MAE'].plot(kind='bar', ax=axes[0], title='MAE Comparison Across Models', ylabel='MAE')

# Plot RMSE comparison on the second subplot
results_df.set_index('Model')['RMSE'].plot(kind='bar', ax=axes[1], title='RMSE Comparison Across Models', ylabel='RMSE')

plt.tight_layout()
plt.show()


# Side-by-Side Forecast Visualization

This visualization compares the actual household energy consumption with the forecasts generated by ARIMA, Prophet, and XGBoost models on the same time axis. Displaying all predictions together highlights the differences in how each model captures trends, seasonality, and sudden fluctuations in energy usage.

In [None]:
plt.figure(figsize=(24, 6))

plt.plot(Y_test, label='Actual', color='black', linewidth=2)
plt.plot(arima_forecast, label='ARIMA Forecast', linestyle='--') # Plot ARIMA forecast
plt.plot(prophet_predictions['yhat'], label='Prophet Forecast', linestyle='--') # Plot Prophet forecast
plt.plot(xgb_predictions, label='XGBoost Forecast', linestyle='--') # Plot XGBoost forecast

plt.title("Model Comparison: Actual vs ARIMA, Prophet, and XGBoost Forecasts")
plt.xlabel("Time")
plt.ylabel("Global Active Power (kW)")

plt.legend()
plt.show()

# Final Conclusion 

In this task short-term household energy consumption was forecasted using three different approaches: ARIMA, Prophet, and XGBoost. ARIMA served as a statistical baseline but struggled to model the high volatility and daily usage patterns present in the data. Prophet improved forecasting performance by explicitly modeling seasonality, producing smoother and more realistic predictions. XGBoost achieved the best results by leveraging engineered time-based features, allowing it to capture complex and non-linear consumption behavior. Overall, the comparison demonstrates that feature-based machine learning models can significantly outperform traditional statistical methods for short-term energy forecasting.