# Dataset Creation

In previous workbooks we cleaned and prepared data for houly energy load data, weather data, and holidays. This notebook describes how the dataset is compiled using the functions created in these notebooks.

#### Objectives
1. Create a single dataset that contains the following feautres at hourly intervals
    a. actual energy loads, and the TSO's predicted load
    b. weather features including temperature, pressure, humidity, rainfall, etc.
    c. weekday category and holiday information

#### Summary of data availability:
1. Energy load data hourly from 2012 onward.
2. Weather data starting from aprpox 2013
3. Weekday and holidays as needed

#### Steps in dataset creation
1. Load the weather data
2. Build a energy load dataset based on the min and max dates available in the weather data
3. Add the relevant day categories and holidays to the dataset.

In [43]:
#load packages and utility funcitons
import pandas as pd
from make_holidays_data import get_holidays
from make_weather_data import get_weather_data
from clean_energy_loads import process_energy_data

In [44]:
#load the weather data and get min and max dates
weather_data = get_weather_data()

print('First date {}'.format(weather_data.index.min()))
print('Last date {}'.format(weather_data.index.max()))

First date 2013-10-01 02:00:00
Last date 2019-08-26 02:00:00


In [45]:
weather_data.head()

Unnamed: 0_level_0,dt,dt_iso,city_id,city_name,temp,temp_min,temp_max,pressure,humidity,wind_speed,wind_deg,rain_1h,rain_3h,snow_3h,clouds_all,weather_id,weather_main,weather_description,weather_icon
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2013-10-01 02:00:00,2013-10-01 02:00:00,2013-10-01 00:00:00 +0000 UTC,2509954,Valencia,299.15,299.15,299.15,1008,61,5,290,0.0,0.0,0.0,20,801,clouds,few clouds,02n
2013-10-01 03:00:00,2013-10-01 03:00:00,2013-10-01 01:00:00 +0000 UTC,2509954,Valencia,298.15,298.15,298.15,1009,65,4,250,0.0,0.0,0.0,20,801,clouds,few clouds,02n
2013-10-01 04:00:00,2013-10-01 04:00:00,2013-10-01 02:00:00 +0000 UTC,2509954,Valencia,296.161,296.161,296.161,1009,71,4,269,0.0,0.0,0.0,10,800,clear,sky is clear,02
2013-10-01 05:00:00,2013-10-01 05:00:00,2013-10-01 03:00:00 +0000 UTC,2509954,Valencia,297.15,297.15,297.15,1008,69,1,250,0.0,0.0,0.0,20,801,clouds,few clouds,02n
2013-10-01 06:00:00,2013-10-01 06:00:00,2013-10-01 04:00:00 +0000 UTC,2509954,Valencia,294.031687,294.031687,294.031687,1009,78,4,288,0.0,0.0,0.0,0,800,clear,sky is clear,01


#### Load energy dataset 

In [46]:
file_list = ['Total Load - Day Ahead _ Actual_2015.csv',
            'Total Load - Day Ahead _ Actual_2016.csv',
            'Total Load - Day Ahead _ Actual_2017.csv',
            'Total Load - Day Ahead _ Actual_2018.csv',
            'Total Load - Day Ahead _ Actual_2019.csv']

energy_data = process_energy_data(files=file_list)

In [47]:
energy_data.head()

Unnamed: 0_level_0,day_forecast,actual_load
time,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-01 00:00:00,26118.0,25385.0
2015-01-01 01:00:00,24934.0,24382.0
2015-01-01 02:00:00,23515.0,22734.0
2015-01-01 03:00:00,22642.0,21286.0
2015-01-01 04:00:00,21785.0,20264.0


In [48]:
#check that everything loaded correctly
print('Nan values in dataset:', end='\n')
print(energy_data.isnull().sum())

print('First date {}'.format(energy_data.index.min()))
print('Last date {}'.format(energy_data.index.max()))

Nan values in dataset:
day_forecast    0
actual_load     0
dtype: int64
First date 2015-01-01 00:00:00
Last date 2019-12-31 23:00:00


The last date in the weather dataset is 2019-08-26, so we will have to remove some rows in order to aling the data between the weather and energy data.

#### Aligning the data

Strategy is to remove data after 2019-08-25 and before 2015-01-01 inclusive.

In [49]:
weather_data = weather_data['2015-01-01':'2019-08-25']
energy_data = energy_data['2015-01-01':'2019-08-25']

print('Weather data length {}'.format(len(weather_data)))
print('Energy data length {}'.format(len(energy_data)))

Weather data length 207269
Energy data length 40752


In [50]:
#export native versions
energy_data.to_csv('./data/cleaned_data/energy_loads_2015_2019.csv')

#### Create calendar day types for the data

In [34]:
#create a dataframe withthe day of the week, the holiday true false, and the holiday name
holidays_data = get_holidays(start='2015-01-01', stop='2019-08-26', country='ES')

#save the data
holidays_data.to_csv('./data/processed/holidays_data_daily.csv')

holidays_data.head()

Unnamed: 0,weekday_id,holiday_bool,holiday_name
2015-01-01,3,True,Año nuevo
2015-01-02,4,False,
2015-01-03,5,False,
2015-01-04,6,False,
2015-01-05,0,False,


The holidays data is in days while energy data is in hours. One option is to upsample the holidays into hours and concatenate with the energy data.

In [35]:
#upsample to hourly frequency and fill the Nans with the original value.
holiday_data = holidays_data.resample('H').pad()

print('New length of holiday data {}'.format(len(holiday_data)))

New length of holiday data 40753


In [38]:
#check start and stop times after the upsample
print('Start and end times for the weahter data:')
print(weather_data.index.min(),weather_data.index.max())
print('Start and end times for the weahter data:')
print(energy_data.index.min(),energy_data.index.max())

Start and end times for the weahter data:
2015-01-01 00:00:00 2019-08-25 23:00:00
Start and end times for the weahter data:
2015-01-01 00:00:00 2019-08-25 23:00:00
