## Imports

We import the libraries needed for our program.

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statistics
from sklearn.linear_model import LinearRegression

# from google.colab import drive
# drive.mount('/content/drive')

## Loading Data

We load the training dataset and assign it to train. <br>
We load the testing dataset and assign it to test. <br>

In [0]:
train = pd.read_csv('.../data/training_dataset_500.csv')
test = pd.read_csv('.../data/test_dataset_500.csv')

## Exploratory Data Analysis

Before choosing a model, we will get to know our datasets through exploratory data analysis.

**First Five Observations from the Training Dataset**



In [3]:
train.head()

Unnamed: 0,ID,Label,House,Year,Month,Temperature,Daylight,EnergyProduction
0,0,0,1,2011,7,26.2,178.9,740
1,1,1,1,2011,8,25.8,169.7,731
2,2,2,1,2011,9,22.8,170.2,694
3,3,3,1,2011,10,16.4,169.1,688
4,4,4,1,2011,11,11.4,169.1,650


**First 5 observation from the Testing Dataset**

In [4]:
test.head()

Unnamed: 0,ID,Label,House,Year,Month,Temperature,Daylight,EnergyProduction
0,23,23,1,2013,6,22.0,125.5,778
1,47,23,2,2013,6,21.1,123.1,627
2,71,23,3,2013,6,21.9,126.8,735
3,95,23,4,2013,6,20.2,125.2,533
4,119,23,5,2013,6,20.2,125.2,533


**Shape of the Training Dataset**

In [5]:
train.shape

(11500, 8)

The training dataset comprises 11500 observations (rows) and 8 characteristics (columns). 

**Shape of the Testing Dataset**

In [6]:
test.shape

(500, 8)

The testing dataset comprises 500 observations (rows) and 8 characetristics (columns). <br>


The training dataset holds monthly data on 500 households from July 2011 to May 2013. <br>
The testing dataset holds data on 500 households for the month of June 2013.

**Information on the Training Dataset**

In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11500 entries, 0 to 11499
Data columns (total 8 columns):
ID                  11500 non-null int64
Label               11500 non-null int64
House               11500 non-null int64
Year                11500 non-null int64
Month               11500 non-null int64
Temperature         11500 non-null float64
Daylight            11500 non-null float64
EnergyProduction    11500 non-null int64
dtypes: float64(2), int64(6)
memory usage: 718.8 KB


The training dataset is made up of integer and float values. <br>
No column has null/missing values. <br>
The dependent variable, EnergyProduction is numerical and of integer type.



**Information on the Testing Dataset**

In [8]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
ID                  500 non-null int64
Label               500 non-null int64
House               500 non-null int64
Year                500 non-null int64
Month               500 non-null int64
Temperature         500 non-null float64
Daylight            500 non-null float64
EnergyProduction    500 non-null int64
dtypes: float64(2), int64(6)
memory usage: 31.3 KB


Like the training dataset, the testing dataset is made up of integer and float values, and has no null/missing values. 

**Summary Statistics of the Training Dataset**

In [9]:
train.describe()

Unnamed: 0,ID,Label,House,Year,Month,Temperature,Daylight,EnergyProduction
count,11500.0,11500.0,11500.0,11500.0,11500.0,11500.0,11500.0,11500.0
mean,5999.0,11.0,250.5,2011.956522,6.521739,14.372348,189.12187,612.74887
std,3464.251661,6.633538,144.343555,0.690226,3.524843,8.490811,29.432125,142.006144
min,0.0,0.0,1.0,2011.0,1.0,0.8,133.7,254.0
25%,2999.5,5.0,125.75,2011.0,3.0,5.3,169.1,509.0
50%,5999.0,11.0,250.5,2012.0,7.0,13.2,181.8,592.0
75%,8998.5,17.0,375.25,2012.0,10.0,22.8,205.2,698.0
max,11998.0,22.0,500.0,2013.0,12.0,29.0,271.3,1254.0


**Summary Statistics of the Testing Dataset**

In [10]:
test.describe()

Unnamed: 0,ID,Label,House,Year,Month,Temperature,Daylight,EnergyProduction
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,6011.0,23.0,250.5,2013.0,6.0,21.7054,125.1114,586.774
std,3467.563986,0.0,144.481833,0.0,0.0,0.86661,1.595726,100.292653
min,23.0,23.0,1.0,2013.0,6.0,19.3,121.8,451.0
25%,3017.0,23.0,125.75,2013.0,6.0,21.1,123.9,518.0
50%,6011.0,23.0,250.5,2013.0,6.0,21.9,125.2,565.0
75%,9005.0,23.0,375.25,2013.0,6.0,22.5,126.0,668.0
max,11999.0,23.0,500.0,2013.0,6.0,22.8,129.1,886.0


**Correlation Matrix between the Variables**

In [11]:
variables = pd.DataFrame(data=train, columns=['House', 'Year', 'Month', 'Temperature', 'Daylight', 'EnergyProduction'])
corr = variables.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,House,Year,Month,Temperature,Daylight,EnergyProduction
House,1.0,0.0,-1.8e-18,0.00088,0.0016,-0.0083
Year,0.0,1.0,-0.63,-0.36,0.52,0.27
Month,-1.8e-18,-0.63,1.0,0.35,-0.28,-0.23
Temperature,0.00088,-0.36,0.35,1.0,-0.053,0.27
Daylight,0.0016,0.52,-0.28,-0.053,1.0,0.53
EnergyProduction,-0.0083,0.27,-0.23,0.27,0.53,1.0


EnergyProduction is most strongly correlated to Daylight, with a positive correlation. <br>
It is equally and positively correlated to Year and Temperature. <br>
It is slighltly less correlated to Month, with a negative correlation. <br>
Its correlation to House is very weak and negative. <br>

## MAPE

First we create two variables, y_train and y_test. <br>
We assign them the values of the dependent variable, EnergyProduction, from the training and testing datasets respectively. 

In [0]:
y_train = train['EnergyProduction'].values.reshape(-1, 1)
y_test = test['EnergyProduction']

We create a function, mape, that takes the predicted values of EnergyProduction, and returns the mean absolute percentage error. 

In [0]:
def mape(predictions):
  return round(statistics.mean(abs((y_test - predictions) / y_test)) * 100, 1)

## Simple Linear Regression


Since we have input variables and an output variable, and our goal is to approximate the mapping function so that with new input data, we can predict the output variables for that data, we need a supervised machine learning algortihm. <br>
Since the output variable is of numerical and continous value, linear regression is an appropriate model.

During our  exploratory data analysis, we found EnergyProduction  to have the strongest correlation to Daylight. <br>
We will start by performing a simple linear regression with Daylight as the independent variable, finding the MAPE, and adjusting our model based on the results. 

In [47]:
x_train = train['Daylight'].values.reshape(-1, 1)
x_test = test['Daylight'].values.reshape(-1, 1)

simple_linear_reg = LinearRegression()
simple_linear_reg.fit(x_train, y_train)

y_pred = simple_linear_reg.predict(x_test)
mape(y_pred.flatten())

21.5

MAPE is 21.5 % which is too high. <br>
To lower the MAPE, we need a more accurate regression. <br>
We can do this by adding independent variables to our model.

## Multiple Linear Regression

We will now add Temperature as an independent variable in our regression model, turning it into a multiple linear regression.

In [48]:
x_train = train[['Daylight', 'Temperature']]
x_test = test[['Daylight', 'Temperature']]

multiple_linear_reg = LinearRegression()
multiple_linear_reg.fit(x_train, y_train)

y_pred = multiple_linear_reg.predict(x_test)
mape(y_pred.flatten())

17.6

MAPE decreased by 3.9 points and is now 17.6%.

We have to make sure that this model is valid by proving no multicollinearity exists between the independent variables, Temperature and Daylight. <br>
To do this we will calculate the variance inflation factor (VIF). <br>
A VIF below 4 is accepted.

In [26]:
r_square = multiple_linear_reg.score(x_train, y_train)
vif = 1 / (1 - r_square)
vif

1.596238194504262

vif = 1.60 <br>
No strong multicollinearity exists between the independent variables, so we accept the model.

Seeing as EnergyProduction is equally correlated to Temperature and Year, we will add Year as an independent variable to the multiple linear regression and see if it has a positive impact on MAPE.

In [49]:
x_train = train[['Daylight', 'Temperature', 'Year']]
x_test = test[['Daylight', 'Temperature', 'Year']]

multiple_linear_reg.fit(x_train, y_train)

y_pred = multiple_linear_reg.predict(x_test)
mape(y_pred.flatten())

13.4

MAPE decreased by another 4.2 points and is now 13.4%.





In [18]:
r_square = multiple_linear_reg.score(x_train, y_train)
vif = 1 / (1 - r_square)
vif

1.6288443466134144

vif = 1.63 <br>
No strong multicollinearity exists between the independent variables, so we accept the model.

The model we choose to use is the multiple linear regression with Daylight, Temperature, and Year as independent variables.

## Output

We create a file called mape.txt that has the MAPE value from our chosen model.

In [0]:
mape = mape(y_pred.flatten())

mape_file = open('mape.txt', 'w')
mape_file.write('MAPE: ' + str(mape) + "\n")
mape_file.close()

We create a file called predicted_energy_production.csv that has the predicted EnergyProduction for June 2013 of each house.

In [0]:
predictions = y_pred.flatten()
i = 1

prediction_file = open('predicted_energy_production.csv', 'w')
prediction_file.write('House,EnergyProduction\n')
for prediction in predictions:    
    prediction_file.write(f"{i}, {prediction}\n")
    i += 1