# 2 Machine Learning Models

## 2.3 LARS Regression

This notebook runs the LARS linear regression model from the SciKit Learn python library (https://scikit-learn.org/stable/user_guide.html).

The problem posed is to predict electricity consumption at a local (LSOA) level, the data is therefore continuous and numerical and is labelled. This lends to supervised machine learning (ML) and regression models.

It is hypothesised that there will be a correlation between mean house price sales (as an indicator of general prosperity and condition of a property) and electricity consumption. This data has previously been cleaned, is read in and split into training and test data. Results are printed in line and exported to csv for comparison against other regression models.

The feature variables include: 'Year', 'Mean price paid', and OS coordinate location data for the population weighted centroid of the LSOAs.

Initially the models were tested with 'Year' and 'Mean price paid', followed by a second model run with all the above feature variables included.

## 2.3.1 Import Model Libraries

Data handling and scientific libraries used include:

numpy - (scientific numerical package for Python that enables working with arrays) pandas - (data analysis library) matplotlib - (enables plotting and visualisation in Python) openpyxl / load_workbook (opens Excel xlsx files)

In [1]:
#Import Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from openpyxl import load_workbook

Imports the linear model from the SciKit Learn library.

In [3]:
#Import ML models

In [4]:
from sklearn import linear_model

#from sklearn.linear_model import LinearRegression
#from sklearn.linear_model import Lasso
#from sklearn.linear_model import Ridge
#from sklearn.linear_model import ElasticNet
#from sklearn.linear_model import HuberRegressor
from sklearn.linear_model import Lars
#from sklearn.linear_model import LassoLars
#from sklearn.linear_model import PassiveAggressiveRegressor
#from sklearn.linear_model import RANSACRegressor
#from sklearn.linear_model import SGDRegressor
#from sklearn.svm import SVR
#from sklearn.tree import DecisionTreeRegressor
#from sklearn.ensemble import RandomForestRegressor
#from sklearn.ensemble import AdaBoostRegressor
#from sklearn.neural_network import MLPRegressor
#from sklearn.neighbors import KNeighborsRegressor


Imports the standard error metric functions of Variance (explained variance), MAE, MSE and R2 from the SciKit Learn library.

Explained variance is an indicator of whether the model is accounting for the variance in the dataset.

MAE - is the real error averaged across the predicted values.

MSE - is the square of the mean error, which indicates risk and the quality of the prediction. Values closer to zero are better.

R squared or R2 'represents the proportion of variance (of y) that has been explained by the independent variables in the model. It provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance.'1 An R2 value closer to 1 indicates a good fit.

1 - https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

In [5]:
#Import Error Metrics

In [6]:
from sklearn.metrics import explained_variance_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [7]:
#Import Test/Train split function

In [8]:
from sklearn.model_selection import train_test_split

## 2.3.2 Import Data & Split into Train/Test Data

The train_test_split function allows you to defined test_size and train_size, if one isn't specified it is set as the complementary value.

A training size of 0.75-0.8 is generally recommended. Sensitivity testing to the training size showed continuous improvement with a larger training size. Beyond 80% of the data for training may lead to overfitting hence selection at this point.

Defining the random_state with an integer allows repetition across different Notebooks/model runs.

In [9]:
#Read in datafile

In [10]:
df_LSOA_Location_Energy_Sales = pd.read_csv('1_DataCleaning/LSOA_Location_Energy_Sales.csv')

In [11]:
#Split f_LSOA_Location_Energy_Sales dataset 80:20

In [12]:
train, test = train_test_split(df_LSOA_Location_Energy_Sales, test_size=0.2, train_size=0.8, random_state=10)

## 2.3.3 Model Set-up & Training

The linear model is defined, as well as the training feature variables to be passed to the machine learning model.

In the first instance 'Year' and 'Mean Price Paid' are selected as feature variables to predict 'Mean_domestic_electricity_consumption_kWh_per_meter'.

The model is trained using .fit. 

This model is expected to predict with a better fit relatively than when either 'Year' or 'Mean Price Paid' were taken as the sole feature variable as plotting of the data shows a (weak) positive correlation.

From the previous simple linear model runs it is expected that 'Mean Price Paid' is the more dominant feature variable, but there is expected to be some degree of annual correlation that together could provide a better fit model.

However, as many factors affect electricity consumption it is expected that either a greater number of feature variables or a model better able to deal with variance/complexity in the data will provide a better fit.

Location as a feature variable is expected to improve the model but could lead to overfitting. This will be explored further and through alternative machine learning models.

In [13]:
#Define the model

In [14]:
regr = linear_model.Lars()

## 2.3.3.1 - Model Training (Feature Variables = Mean Price Paid, Year)

In [15]:
#First pass LARS regression model

In [16]:
#Set the training data

In [17]:
train_x = np.asanyarray(train[['Mean_price_paid','Year']])
train_y = np.asanyarray(train[['Mean_domestic_electricity_consumption_kWh_per_meter']])

In [18]:
#Train the model on the set training data

In [19]:
regr.fit(train_x, train_y)

Lars(copy_X=True, eps=2.220446049250313e-16, fit_intercept=True,
   fit_path=True, n_nonzero_coefs=500, normalize=True, positive=False,
   precompute='auto', verbose=False)

In [20]:
#Predict results using the trained model and the previously defined test data

In [21]:
y_hat= regr.predict(test[['Mean_price_paid','Year']])
test_x = np.asanyarray(test[['Mean_price_paid','Year']])
test_y = np.asanyarray(test[['Mean_domestic_electricity_consumption_kWh_per_meter']])

In [22]:
#Recheck the shape of the data

In [23]:
y_hat.shape

(61982,)

In [24]:
test_x.shape

(61982, 2)

In [25]:
test_y.shape

(61982, 1)

In [26]:
test_y = np.squeeze(test_y)
test_y.shape

(61982,)

## 2.3.3.2 - Model Evaluation (Feature Variables = Mean Price Paid, Year)

In [27]:
#Run evaluation metrics to check the model performance

In [28]:
print("Explained Variance Score: %.2f" % explained_variance_score(test_y, y_hat))
print("MAE: %.2f" % mean_absolute_error(test_y, y_hat))
print("MSE: %.2f" % mean_squared_error(test_y, y_hat))
print("R2-score: %.2f" % r2_score(test_y, y_hat))

Explained Variance Score: 0.21
MAE: 532.92
MSE: 604146.13
R2-score: 0.21


Explained Variance - is low, showing that the model doesn't deal well with the variance in the model.

MAE - the MAE is improved over the single variate model runs but still high, which would not be considered accurate enough for industry use. Although in this instance as the measure of fitness and quality of prediction are also very low these collectively indicate the model is not reliable.

MSE - is very high, reflecting a high risk of a low quality prediction.

R2 - is low demonstrating the model has poor 'fit' and doesn't deal well with the variance in the model.

The Lasso linear model does not perform any better than the simple Linear_Regression model. This is not unexpected, but a range of linear models are checked to determine whether any perform relatively better.

## 2.3.3.3 - Model Training (Feature Variables = Mean Price Paid, Year, Location)

In [29]:
#First pass LARS regression model

In [30]:
#Set the training data

In [31]:
train_x2 = np.asanyarray(train[['Mean_price_paid','Year', 'X', 'Y']])
train_y2 = np.asanyarray(train[['Mean_domestic_electricity_consumption_kWh_per_meter']])

In [32]:
#Train the model on the set training data

In [33]:
regr.fit(train_x2, train_y2)

Lars(copy_X=True, eps=2.220446049250313e-16, fit_intercept=True,
   fit_path=True, n_nonzero_coefs=500, normalize=True, positive=False,
   precompute='auto', verbose=False)

In [34]:
#Predict results using the trained model and the previously defined test data

In [35]:
y_hat2= regr.predict(test[['Mean_price_paid','Year', 'X', 'Y']])
test_x2 = np.asanyarray(test[['Mean_price_paid','Year', 'X', 'Y']])
test_y2 = np.asanyarray(test[['Mean_domestic_electricity_consumption_kWh_per_meter']])

In [36]:
#Recheck the shape of the data

In [37]:
y_hat2.shape

(61982,)

In [38]:
test_x2.shape

(61982, 4)

In [39]:
test_y2.shape

(61982, 1)

In [40]:
test_y2 = np.squeeze(test_y2)
test_y2.shape

(61982,)

## 2.3.3.4 - Model Evaluation (Feature Variables = Mean Price Paid, Year, Location)

In [41]:
#Run evaluation metrics to check the model performance

In [42]:
print("Explained Variance Score: %.2f" % explained_variance_score(test_y2 , y_hat2))
print("MAE: %.2f" % mean_absolute_error(test_y2, y_hat2))
print("MSE: %.2f" % mean_squared_error(test_y2, y_hat2))
print("R2-score: %.2f" % r2_score(test_y2, y_hat2))

Explained Variance Score: 0.22
MAE: 531.93
MSE: 599471.67
R2-score: 0.22


Explained Variance - is low, showing that the model doesn't deal well with the variance in the model.

MAE - the MAE is improved over the single variate model runs but still high, which would not be considered accurate enough for industry use. Although in this instance as the measure of fitness and quality of prediction are also very low these collectively indicate the model is not reliable.

MSE - is very high, reflecting a high risk of a low quality prediction.

R2 - is low demonstrating the model has poor 'fit' and doesn't deal well with the variance in the model.

The Lasso linear model does not perform any better than the simple Linear_Regression model. This is not unexpected, but a range of linear models are checked to determine whether any perform relatively better.

## 2.3.4 Results Export

In [43]:
#Set up dataframe for results

In [44]:
df_Results = None

In [45]:
#Print results to the dataframe

In [46]:
df_Results = pd.DataFrame({'Explained Variance Score': [explained_variance_score(test_y , y_hat), explained_variance_score(test_y2 , y_hat2)]},
                  index=['LARS_Year_Price', 'LARS_Year_Price_X_Y'])

In [47]:
df_Results.insert(1,'MAE', [mean_absolute_error(test_y , y_hat), mean_absolute_error(test_y2 , y_hat2)])

In [48]:
df_Results.insert(2, 'MSE', [mean_squared_error(test_y , y_hat), mean_squared_error(test_y2 , y_hat2)])

In [49]:
df_Results.insert(3,'R2-score', [r2_score(test_y , y_hat), r2_score(test_y2 , y_hat2)])

In [50]:
df_Results.insert(0, 'Model', 'LARS')

In [51]:
df_Results.insert(1, 'Feature Variables', ['Year, Price', 'Year, Price, X, Y'])

In [52]:
#Check dataframe 

In [53]:
df_Results.head()

Unnamed: 0,Model,Feature Variables,Explained Variance Score,MAE,MSE,R2-score
LARS_Year_Price,LARS,"Year, Price",0.210802,532.916798,604146.130784,0.210797
LARS_Year_Price_X_Y,LARS,"Year, Price, X, Y",0.216908,531.933287,599471.668518,0.216903


In [54]:
#Export results to csv

In [55]:
df_Results.to_csv('2_ModelResults/LARS_Results.csv')

In [56]:
#END