# Assignment 9

In this assignment I will try to explain all of the steps we went through in the _data-driven consumption prediction project_, for a building in Austin.

First, I'm importing all the modules I'll use later.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os 

With the following command I avoid the *warning* caused by the attempt of changing the value of objects in data frames.

In [2]:
pd.set_option('mode.chained_assignment', None)

I'm now changing the directory to the one where all the files I need are.

In [3]:
ExternalFilesFolder =  r"C:\Users\Mirko\Desktop\Assignment9_MFerrari"
os.chdir(ExternalFilesFolder)

Now I'm going to name the files present in the folder.

In [4]:
ConsumptionFileName= "consumption_5545.csv"
TemperatureFileName= "Austin_weather_2014.csv"
IrradianceFileName= "irradiance_2014_gen.csv"

Is now easier to find the **paths** necessary to reach these files.

In [5]:
path_consumptionFile = os.path.join(ExternalFilesFolder,ConsumptionFileName)
path_TemperatureFile = os.path.join(ExternalFilesFolder,TemperatureFileName)
path_IrradianceFile = os.path.join(ExternalFilesFolder,IrradianceFileName)

I can now create a dataframe which will include the data related to the yearly AC consumption.

In [6]:
DF_consumption = pd.read_csv(path_consumptionFile, sep=",", index_col=0)

With the next commands, I'm changing the type of the indexes from "object" to "datatime", because I'm interested in a _data-driven model_ now.

In [7]:
PreviousIndex = DF_consumption.index
NewParsedIndex= pd.to_datetime(PreviousIndex)
DF_consumption.index =NewParsedIndex 

I'm doing the same thing with the weather file.

In [8]:
DF_weather = pd.read_csv(path_TemperatureFile,sep=";",index_col=0)
previousIndex_weather=DF_weather.index
newIndex_weather=pd.to_datetime(previousIndex_weather)
DF_weather.index = newIndex_weather

This last dataframe includes a lot of informations about Austin weather during the year, but I'm mostly interested in temperatures, then I'm going to extract just them, creating a new dataframe.

In [9]:
DF_Temperature= DF_weather[["temperature"]]

I'm now creating a new dataframe which includes the irradiance data in Austin, collected from a building provided with PV panels.

In [10]:
DF_irradianceSource = pd.read_csv(path_IrradianceFile,sep=";",index_col=1)

As for the weather one, also this last dataframe includes a lot of informations, more than I need in this case. Since I'm now interested only in irradiance, I'm creating a new dataframe which contains only this information.

In [11]:
DF_irradiance = DF_irradianceSource[["gen"]]

Since some of these are **negative**, I'm setting them to zero.

In [12]:
DF_irradiance["gen"]<0
DF_irradiance[DF_irradiance["gen"]<0] = 0

With the next commmand, I create a _new dataframe including the last three I've just created_.

In [13]:
DF_joined = DF_consumption.join([DF_Temperature,DF_irradiance])

I create now a copy of this last dataframe, to not modify the original one.

In [15]:
DF_mod = DF_joined.copy()

With the next command I'm going to apply the **time-zone** to the temperatures, which are reported in the file in standard conditions.

In [16]:
DF_mod["temperature"] = DF_mod["temperature"].shift(-5)

Since this last command skips rows, NaN are created in the dataframe, then I need to remove them.

In [17]:
DF_mod.dropna(inplace = True)

I can now define a function which creates new columns of temperature "lagged" in time. This has to be done because the actual consumption depends also on the temperature in the previous hours.

In [18]:
def lag_feature(DF,column_name,lag_start,lag_end, lag_interval):
    for i in range(lag_start,lag_end +1, lag_interval):
        new_column_name = column_name + "-" + str(i) + "hr"
        DF[new_column_name] = (DF[column_name]).shift(i)
        DF.dropna(inplace = True)
    return DF

Let's first apply this function to the temperature's column.

In [19]:
DF_mod = lag_feature(DF_mod,"temperature",1,6,1)

Before doing the same thing to the other columns of *DF_mod*, I want to rename them.

In [21]:
DF_mod = DF_mod.rename(columns = {"air conditioner_5545":"AC_consumption","gen":"irradiance"})

I now apply the function to the irradiance column, between 3 to 6 hours.

In [22]:
DF_mod = lag_feature(DF_mod,"irradiance",3,6,1)

For the AC consumption, I apply the function for the previous 24 hours.

In [23]:
DF_mod = lag_feature(DF_mod,"AC_consumption",1,24,1)

I've now to add the seasonality parameters (hours, days of the week, weeks of the year, months). These parameters are important because the consumption depends on the presence of people in the building, which depends on the hour/day we're considering ( work hour, festivity, sundays, etc.).

In [25]:
DF_mod["hour"] = DF_mod.index.hour

In [26]:
DF_mod["day_of_week"] = DF_mod.index.dayofweek

In [28]:
DF_mod["week_of_year"] = DF_mod.index.week

In [27]:
DF_mod["month"] = DF_mod.index.month

For the hours we've the problem that they're describing a continous time interval, with discontinuous values ( it goes from 0 to 23, and then back to 0 ).
To solve this problem we can introduce the functions sine or cosine, which are continuous in time.

In [29]:
DF_mod["sin_hour"] = np.sin(DF_mod.index.hour*2*np.pi/24)
DF_mod["cos_hour"] = np.cos(DF_mod.index.hour*2*np.pi/24)

I can now define a function which separates weekends from working day.

In [30]:
def WeekendDetector(day):
    if (day == 5 or day == 6):
        weekendLabel = 1
    else:
        weekendLabel = 0
    return weekendLabel

I now create a new column in the dataframe, called **weekend** which reports if the day considered is a working day (0) or not (1).

In [31]:
DF_mod["weekend"] = DF_mod["day_of_week"].apply(WeekendDetector)

I can now define a function wich separates working hours from free time.

In [32]:
def DayDetector(hour):
    if (hour < 19 and hour >= 9):
        DayLabel = 1
    else:
        DayLabel = 0
    return DayLabel

I now create a new column in the dataframe, called **workingTime** which states if the hour considered is a working hour (1) or not (0).

In [33]:
DF_mod["workingTime"] = DF_mod["hour"].apply(DayDetector)

I'll now show all the columns of my new dataframe, and all the _correlations_ between them.

In [35]:
DF_mod.columns

Index([u'AC_consumption', u'temperature', u'irradiance', u'temperature-1hr',
       u'temperature-2hr', u'temperature-3hr', u'temperature-4hr',
       u'temperature-5hr', u'temperature-6hr', u'irradiance-3hr',
       u'irradiance-4hr', u'irradiance-5hr', u'irradiance-6hr',
       u'AC_consumption-1hr', u'AC_consumption-2hr', u'AC_consumption-3hr',
       u'AC_consumption-4hr', u'AC_consumption-5hr', u'AC_consumption-6hr',
       u'AC_consumption-7hr', u'AC_consumption-8hr', u'AC_consumption-9hr',
       u'AC_consumption-10hr', u'AC_consumption-11hr', u'AC_consumption-12hr',
       u'AC_consumption-13hr', u'AC_consumption-14hr', u'AC_consumption-15hr',
       u'AC_consumption-16hr', u'AC_consumption-17hr', u'AC_consumption-18hr',
       u'AC_consumption-19hr', u'AC_consumption-20hr', u'AC_consumption-21hr',
       u'AC_consumption-22hr', u'AC_consumption-23hr', u'AC_consumption-24hr',
       u'hour', u'day_of_week', u'month', u'week_of_year', u'sin_hour',
       u'cos_hour', u'weekend',

In [36]:
DF_mod.corr()

Unnamed: 0,AC_consumption,temperature,irradiance,temperature-1hr,temperature-2hr,temperature-3hr,temperature-4hr,temperature-5hr,temperature-6hr,irradiance-3hr,...,AC_consumption-23hr,AC_consumption-24hr,hour,day_of_week,month,week_of_year,sin_hour,cos_hour,weekend,workingTime
AC_consumption,1.0,0.568967,-0.012695,0.608771,0.637029,0.650286,0.647997,0.630508,0.598963,0.364543,...,0.849304,0.89985,0.36107,-0.014515,0.128627,0.13976,-0.438641,0.212579,-0.005932,-0.053224
temperature,0.568967,1.0,0.327736,0.990924,0.968215,0.935052,0.894855,0.850852,0.805907,0.450232,...,0.578146,0.538885,0.243124,0.038664,0.162804,0.193112,-0.332523,-0.146285,0.037955,0.247145
irradiance,-0.012695,0.327736,1.0,0.238627,0.141066,0.043916,-0.044694,-0.118168,-0.172384,0.560763,...,0.091944,-0.025262,0.167131,-0.029183,-0.034465,-0.026719,-0.237118,-0.740835,-0.031255,0.766372
temperature-1hr,0.608771,0.990924,0.238627,1.0,0.990924,0.968215,0.93505,0.894848,0.850835,0.440053,...,0.606575,0.578143,0.262693,0.037578,0.163573,0.193957,-0.359332,-0.054764,0.037937,0.167417
temperature-2hr,0.637029,0.968215,0.141066,0.990924,1.0,0.990923,0.968213,0.935045,0.894836,0.398205,...,0.620548,0.606578,0.269813,0.03661,0.164362,0.194789,-0.361499,0.040584,0.037825,0.079101
temperature-3hr,0.650286,0.935052,0.043916,0.968215,0.990923,1.0,0.990923,0.96821,0.935035,0.327602,...,0.619483,0.620549,0.261097,0.03598,0.165203,0.195631,-0.338919,0.133246,0.038007,-0.010368
temperature-4hr,0.647997,0.894855,-0.044694,0.93505,0.968213,0.990923,1.0,0.990922,0.968205,0.238492,...,0.604101,0.619488,0.231785,0.035539,0.166071,0.196473,-0.293132,0.216867,0.038456,-0.09467
temperature-5hr,0.630508,0.850852,-0.118168,0.894848,0.935045,0.96821,0.990922,1.0,0.99092,0.14092,...,0.575197,0.604109,0.176851,0.035229,0.167,0.197279,-0.227286,0.285755,0.038891,-0.169806
temperature-6hr,0.598963,0.805907,-0.172384,0.850835,0.894836,0.935035,0.968205,0.99092,1.0,0.043755,...,0.535692,0.575207,0.102506,0.034955,0.16794,0.198039,-0.145865,0.335197,0.039206,-0.232576
irradiance-3hr,0.364543,0.450232,0.560763,0.440053,0.398205,0.327602,0.238492,0.14092,0.043755,1.0,...,0.44252,0.342982,0.46534,-0.02904,-0.034603,-0.026854,-0.691844,-0.355625,-0.031306,0.536995


From this last table we can notice that the actual AC consumption is correlated mainly ( >0.8 ) to the AC consumption in the very previous two hours and to the one of the previous day in the same hour. We can also see that the AC consumption depends more on the prevoius hour temperatures than on the actual one. Same for irradiance ( this is due to the reason that irradiance process takes time to occur ).