# **UofT Building Energy Consumption Anomaly Detection using Machine Learning**

By Zichen Liu


## Sources:
1. Building Energy Consumption - Sustainability Office UofT

2. Temperature Data - https://toronto.weatherstats.ca/download.html



In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd

# Upload the files

In [None]:
from google.colab import files
uploaded = files.upload()

Saving UTSG CED CHD FY2019-2023 (1).xlsx to UTSG CED CHD FY2019-2023 (1) (1).xlsx


In [None]:
weather_data = pd.read_csv('weatherstats_toronto_normal_monthly.csv')
energy_data = pd.read_excel('UTSG CED CHD FY2019-2023 (1).xlsx')

In [None]:
weather_data.head()

Unnamed: 0,date,max_dew_point_v,max_dew_point_s,max_dew_point_c,max_dew_point_d,max_relative_humidity_v,max_relative_humidity_s,max_relative_humidity_c,max_relative_humidity_d,max_temperature_v,...,snow_c,snow_d,snow_on_ground_v,snow_on_ground_s,snow_on_ground_c,snow_on_ground_d,solar_radiation_v,solar_radiation_s,solar_radiation_c,solar_radiation_d
0,6/1/2024,15.17,1.03,30.0,1994-06-01 2023-06-01,84.21,4.43,30.0,1994-06-01 2023-06-01,24.72,...,30,1994-06-01 2023-06-01,0.0,0.0,20.0,1994-06-01 2013-06-01,,,,
1,5/1/2024,9.59,1.86,30.0,1994-05-01 2023-05-01,81.91,4.97,30.0,1994-05-01 2023-05-01,19.2,...,30,1994-05-01 2023-05-01,0.0,0.0,21.0,1994-05-01 2020-05-01,,,,
2,4/1/2024,3.17,1.51,30.0,1994-04-01 2023-04-01,80.91,4.76,30.0,1994-04-01 2023-04-01,12.06,...,30,1994-04-01 2023-04-01,0.5,0.75,29.0,1994-04-01 2022-04-01,,,,
3,3/1/2024,-1.58,2.23,30.0,1994-03-01 2023-03-01,82.58,4.38,30.0,1994-03-01 2023-03-01,5.06,...,30,1994-03-01 2023-03-01,3.0,3.02,30.0,1994-03-01 2023-03-01,,,,
4,2/1/2024,-4.74,2.37,30.0,1994-02-01 2023-02-01,84.62,3.05,30.0,1994-02-01 2023-02-01,-0.13,...,30,1994-02-01 2023-02-01,6.5,5.37,30.0,1994-02-01 2023-02-01,,,,


In [None]:
energy_data.head()

Unnamed: 0,Archibus No.,Archibus Building Name,Month,Consumption kWh
0,1,University College,2018-04-01,126993.0
1,1,University College,2018-05-01,97771.065
2,1,University College,2018-06-01,119654.935
3,1,University College,2018-07-01,95081.0
4,1,University College,2018-08-01,137121.0


# Prepare and clean the data
We'll merge the datasets and create a new column indicating whether the consumption is above the baseline.

In [None]:
# Convert date columns to datetime
weather_data['date'] = pd.to_datetime(weather_data['date'])
energy_data['Month'] = pd.to_datetime(energy_data['Month'])

# Identify the date range of the energy data
start_date = energy_data['Month'].min()
end_date = energy_data['Month'].max()

# Filter the weather data to match the date range of the energy data
filtered_weather_data = weather_data[(weather_data['date'] >= start_date) & (weather_data['date'] <= end_date)]

# Aggregate weather data by month
filtered_weather_data['month'] = filtered_weather_data['date'].dt.to_period('M')
monthly_weather_data = filtered_weather_data.groupby('month').agg({
    'max_temperature_v': 'mean',
    'min_temperature_v': 'mean'
}).reset_index()

# Convert 'month' back to datetime for merging
monthly_weather_data['month'] = monthly_weather_data['month'].dt.to_timestamp()

# Merge datasets on month
merged_data = pd.merge(monthly_weather_data, energy_data, left_on='month', right_on='Month', how='inner')

# Calculate baseline consumption for each building
baseline_consumption = merged_data.groupby('Archibus Building Name')['Consumption kWh'].transform('mean')

# Create a target variable based on whether consumption is above the baseline
merged_data['Above_Baseline'] = merged_data['Consumption kWh'] > baseline_consumption

# Add temporal feature (month)
merged_data['month_number'] = merged_data['month'].dt.month




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_weather_data['month'] = filtered_weather_data['date'].dt.to_period('M')


# Training the Model
We'll train a logistic regression model and evaluate its performance.

In [None]:
# Select relevant columns for the model
features = merged_data[['max_temperature_v', 'min_temperature_v', 'month_number']]
target = merged_data['Above_Baseline']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize and train the logistic regression model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = log_reg.predict(X_test)

# Evaluate the model's performance
classification_report_result = classification_report(y_test, y_pred)
confusion_matrix_result = confusion_matrix(y_test, y_pred)

print(classification_report_result)
print(confusion_matrix_result)


              precision    recall  f1-score   support

       False       0.64      0.73      0.69       507
        True       0.51      0.41      0.45       345

    accuracy                           0.60       852
   macro avg       0.58      0.57      0.57       852
weighted avg       0.59      0.60      0.59       852

[[371 136]
 [205 140]]


# Interpretation
**Accuracy: The overall accuracy is 60%.**

Precision and Recall for False: The model has a precision of 0.64 and recall of 0.73 for the False class.

Precision and Recall for True: The model has a precision of 0.51 and recall of 0.41 for the True class.

Confusion Matrix: The model correctly predicted 371 False cases and 140 True cases, but it misclassified 136 False cases as True and 205 True cases as False.

The model performs better at predicting the False class than the True class.
There is a significant number of True cases being misclassified as False.

# Reiteration
Adding Additional Features and Handling Class Imbalance
Let's enhance the feature set and use SMOTE to balance the classes.



In [None]:
from imblearn.over_sampling import SMOTE

# Add additional weather-related features
monthly_weather_data = filtered_weather_data.groupby('month').agg({
    'max_temperature_v': 'mean',
    'min_temperature_v': 'mean',
    'max_relative_humidity_v': 'mean',
    'min_relative_humidity_v': 'mean',
    'precipitation_v': 'mean',
    'rain_v': 'mean',
    'snow_v': 'mean'
}).reset_index()

# Convert 'month' back to datetime for merging
monthly_weather_data['month'] = monthly_weather_data['month'].dt.to_timestamp()

# Merge datasets on month
merged_data = pd.merge(monthly_weather_data, energy_data, left_on='month', right_on='Month', how='inner')

# Calculate baseline consumption for each building
baseline_consumption = merged_data.groupby('Archibus Building Name')['Consumption kWh'].transform('mean')

# Create a target variable based on whether consumption is above the baseline
merged_data['Above_Baseline'] = merged_data['Consumption kWh'] > baseline_consumption

# Add temporal feature (month)
merged_data['month_number'] = merged_data['month'].dt.month

# Select relevant columns for the model
features = merged_data[['max_temperature_v', 'min_temperature_v', 'max_relative_humidity_v', 'min_relative_humidity_v',
                        'precipitation_v', 'rain_v', 'snow_v', 'month_number']]
target = merged_data['Above_Baseline']

# Apply SMOTE to handle class imbalance
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(features, target)

# Split the resampled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize and train the logistic regression model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = log_reg.predict(X_test)

# Evaluate the model's performance
classification_report_result = classification_report(y_test, y_pred)
confusion_matrix_result = confusion_matrix(y_test, y_pred)

print(classification_report_result)
print(confusion_matrix_result)


              precision    recall  f1-score   support

       False       0.58      0.65      0.61       484
        True       0.61      0.54      0.57       488

    accuracy                           0.59       972
   macro avg       0.59      0.59      0.59       972
weighted avg       0.60      0.59      0.59       972

[[314 170]
 [225 263]]


#Interpretation
**Accuracy: The overall accuracy is now 59%.**

Precision and Recall for False: Precision is 0.58 and recall is 0.65 for the False class.

Precision and Recall for True: Precision is 0.61 and recall is 0.54 for the True class.

Balanced Performance: The model shows a more balanced performance between the two classes compared to the previous attempts.

#Next steps to fine tune the model

Let's try using a Random Forest classifier and perform hyperparameter tuning to see if we can further improve the performance. This implementation performs hyperparameter tuning using Grid Search to find the optimal parameters for the Random Forest model and evaluates its performance.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search for hyperparameter tuning
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Best parameters from Grid Search
best_params = grid_search.best_params_
print(f"Best parameters: {best_params}")

# Train the best Random Forest model
best_rf_model = grid_search.best_estimator_
best_rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_rf_model.predict(X_test)

# Evaluate the model's performance
classification_report_result = classification_report(y_test, y_pred)
confusion_matrix_result = confusion_matrix(y_test, y_pred)

print(classification_report_result)
print(confusion_matrix_result)


Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best parameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 300}
              precision    recall  f1-score   support

       False       0.65      0.71      0.68       484
        True       0.68      0.62      0.65       488

    accuracy                           0.66       972
   macro avg       0.67      0.66      0.66       972
weighted avg       0.67      0.66      0.66       972

[[342 142]
 [184 304]]


# Interpretation
**Accuracy: The overall accuracy is 66%, which is an improvement from the logistic regression model.**

Precision and Recall for False: Precision is 0.65 and recall is 0.71 for the False class.

Precision and Recall for True: Precision is 0.68 and recall is 0.62 for the True class.

Balanced Performance: The model shows a balanced performance between the two classes.

# More improvement
Use Random Forest, Logistic Regression, and Gradient Boosting classifiers

In [None]:
# Get feature importances from the Random Forest model
importances = best_rf_model.feature_importances_
feature_names = features.columns
feature_importances = pd.DataFrame({'Feature': feature_names, 'Importance': importances}).sort_values(by='Importance', ascending=False)

print(feature_importances)

from sklearn.ensemble import VotingClassifier

# Initialize the models
rf_model = RandomForestClassifier(n_estimators=300, max_depth=None, min_samples_split=2, random_state=42)
log_reg = LogisticRegression(random_state=42)
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Create an ensemble of models
ensemble_model = VotingClassifier(estimators=[
    ('rf', rf_model),
    ('lr', log_reg),
    ('gb', gb_model)
], voting='soft')

# Train the ensemble model
ensemble_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = ensemble_model.predict(X_test)

# Evaluate the model's performance
classification_report_result = classification_report(y_test, y_pred)
confusion_matrix_result = confusion_matrix(y_test, y_pred)

print(classification_report_result)
print(confusion_matrix_result)


                   Feature  Importance
0        max_temperature_v    0.168323
1        min_temperature_v    0.166715
2  max_relative_humidity_v    0.147733
3  min_relative_humidity_v    0.127357
4          precipitation_v    0.115219
5                   rain_v    0.114776
6                   snow_v    0.093263
7             month_number    0.066614
              precision    recall  f1-score   support

       False       0.67      0.69      0.68       484
        True       0.68      0.66      0.67       488

    accuracy                           0.67       972
   macro avg       0.67      0.67      0.67       972
weighted avg       0.67      0.67      0.67       972

[[332 152]
 [166 322]]


# Interpretation

A**ccuracy: The overall accuracy is 67%.**

Precision and Recall for False: Precision is 0.
67 and recall is 0.69 for the False class.

Precision and Recall for True: Precision is 0.68 and recall is 0.66 for the True class.

Feature Importance: The most important features are max_temperature_v, min_temperature_v, and max_relative_humidity_v.

# Further improvement

Adding Polynomial and Interaction Features

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Generate polynomial and interaction features
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X_resampled)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_poly, y_resampled, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train the ensemble model with polynomial features
ensemble_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = ensemble_model.predict(X_test)

# Evaluate the model's performance
classification_report_result_poly = classification_report(y_test, y_pred)
confusion_matrix_result_poly = confusion_matrix(y_test, y_pred)

print(classification_report_result_poly)
print(confusion_matrix_result_poly)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


              precision    recall  f1-score   support

       False       0.65      0.70      0.68       484
        True       0.68      0.63      0.65       488

    accuracy                           0.67       972
   macro avg       0.67      0.67      0.67       972
weighted avg       0.67      0.67      0.67       972

[[340 144]
 [181 307]]


# Conclusions (For now)
Based on the limited data from 2018-2021, I could only reach an accuracy of **67%**. However, with more data and improved models, I am confident that the figure will improve.