# Energy Price Prediction Project

## Previous Notebooks

- [Energy data import and cleaning](1.0-GME-Data.ipynb)
- [Weather data import and cleaning](1.1-Weather-Data.ipynb)
- [Energy price futures import and cleaning](1.2-Futures-Data.ipynb)
- [Gas price import and cleaning](1.3-Gas-Data.ipynb)
- [Merging data](1.5-Merge-Data.ipynb)
- [Exploratory data analysis](2.0-EDA.ipynb)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

As in the previous notebook I'm dropping all rows for which weather data is missing and all the features that have always the same value across all the dataset. I'm also dropping the rows with the 25 hour: these represents the extra hour in the day when daylight saving time ends and are not very representative.

In [2]:
energy = pd.read_pickle('../data/interim/energy.pkl')

In [3]:
market_ts = energy.groupby('date').mean().copy()
market_ts.drop(['hour', 'AUST-XAUS', 'BRNN-SUD', 'BSP-SLOV', 'CNOR-CORS',
                'CNOR-CSUD', 'CNOR-NORD', 'CORS-CNOR', 'CORS-SARD', 'CSUD-CNOR',
                'CSUD-SARD', 'FOGN-SUD', 'FRAN-XFRA', 'MALT-SICI', 'NORD-CNOR',
                'PRGP-SICI', 'ROSN-SICI', 'ROSN-SUD', 'SARD-CORS', 'SARD-CSUD',
                'SICI-MALT', 'SICI-ROSN', 'SLOV-BSP', 'SUD-CSUD', 'XAUS-AUST',
                'XFRA-FRAN'], axis=1, inplace=True)

In [4]:
# market_weather = market_weather[~np.isnan(market_weather['hdd_liml'])].copy()

In [4]:
unique_value_cols = [col for col in energy.columns if len(energy[col].unique()) == 1]

In [5]:
energy.drop(unique_value_cols, axis=1, inplace=True)

In [6]:
energy = energy.loc[energy['hour']!=25].copy()

Next, I'm adding the mean value of the PUN in the last 30 days and some dummy variables for:

- Holidays (since in the weekends the PUN is lower I expect it to be so also in the holidays)
- Months
- Day of week
- Hour

In [7]:
def is_holiday(date):
    '''
    helper functions with the fixed holidays and Easter for 2014, 2015, and 2016
    '''
    # fixed holidays
    if date.month == 1 and date.day == 1:
        return 1
    if date.month == 1 and date.day == 6:
        return 1
    if date.month == 4 and date.day == 25:
        return 1
    if date.month == 5 and date.day == 1:
        return 1
    if date.month == 6 and date.day == 2:
        return 1
    if date.month == 8 and date.day == 15:
        return 1
    if date.month == 11 and date.day == 1:
        return 1
    if date.month == 12 and date.day == 8:
        return 1
    if date.month == 12 and date.day == 25:
        return 1
    if date.month == 12 and date.day == 26:
        return 1
    # easter
    if date.year == 2014 and date.month == 4 and date.day == 20:
        return 1
    if date.year == 2014 and date.month == 4 and date.day == 21:
        return 1
    if date.year == 2015 and date.month == 4 and date.day == 5:
        return 1
    if date.year == 2015 and date.month == 4 and date.day == 6:
        return 1
    if date.year == 2016 and date.month == 3 and date.day == 27:
        return 1
    if date.year == 2016 and date.month == 3 and date.day == 28:
        return 1
    if date.year == 2017 and date.month == 4 and date.day == 16:
        return 1
    if date.year == 2017 and date.month == 4 and date.day == 17:
        return 1
    return 0

In [8]:
energy['holiday'] = energy['date'].apply(is_holiday)

In [9]:
energy = energy.merge(market_ts['pun'].rolling(window=30).mean()\
                                      .shift(periods=1, freq='d').reset_index()\
                                      .rename(columns={'pun':'pun_last30'}),
                                  left_on='date',
                                  right_on='date',
                                  how='left')

In [10]:
energy = energy.merge(pd.get_dummies(energy['date'].dt.weekday),
                      left_index=True,
                      right_index=True,
                      how='left')\
                .rename(columns={0:'mon',
                                 1:'tue',
                                 2:'wed',
                                 3:'thu',
                                 4:'fri',
                                 5:'sat',
                                 6:'sun'})

In [11]:
energy = energy.merge(pd.get_dummies(energy['date'].dt.month),
                     left_index=True,
                     right_index=True, how='left')\
                .rename(columns={1:'jan',
                                 2:'feb',
                                 3:'mar',
                                 4:'apr',
                                 5:'may',
                                 6:'jun',
                                 7:'jul',
                                 8:'aug',
                                 9:'sep',
                                 10:'oct',
                                 11:'nov',
                                 12:'dec'})

In [12]:
energy = energy.merge(pd.get_dummies(energy['hour']),
                     left_index=True,
                     right_index=True, how='left')\
                .rename(columns={1:'h1',
                                 2:'h2',
                                 3:'h3',
                                 4:'h4',
                                 5:'h5',
                                 6:'h6',
                                 7:'h7',
                                 8:'h8',
                                 9:'h9',
                                 10:'h10',
                                 11:'h11',
                                 12:'h12',
                                 13:'h13',
                                 14:'h14',
                                 15:'h15',
                                 16:'h16',
                                 17:'h17',
                                 18:'h18',
                                 19:'h19',
                                 20:'h20',
                                 21:'h21',
                                 22:'h22',
                                 23:'h23',
                                 24:'h24'})

In [13]:
for hour in np.arange(1, 25):
    market_ts = energy.loc[energy['hour']==hour].groupby('date').mean().copy()
    market_ts.drop(['AUST-XAUS', 'BRNN-SUD', 'BSP-SLOV', 'CNOR-CORS',
                    'CNOR-CSUD', 'CNOR-NORD', 'CORS-CNOR', 'CORS-SARD', 'CSUD-CNOR',
                    'CSUD-SARD', 'FOGN-SUD', 'FRAN-XFRA', 'MALT-SICI', 'NORD-CNOR',
                    'PRGP-SICI', 'ROSN-SICI', 'ROSN-SUD', 'SARD-CORS', 'SARD-CSUD',
                    'SICI-MALT', 'SICI-ROSN', 'SLOV-BSP', 'SUD-CSUD', 'XAUS-AUST',
                    'XFRA-FRAN'], axis=1, inplace=True)
    energy = energy.merge(market_ts[['pun', 'hour']].rolling(window=7).mean()\
                                          .shift(periods=1, freq='d').reset_index()\
                                          .rename(columns={'pun':'pun_hour_last7_{}'.format(str(hour))}),
                                      left_on=['date', 'hour'],
                                      right_on=['date', 'hour'],
                                      how='left')
    energy['hour'] = pd.to_numeric(energy['hour'])

In [14]:
energy['pun_hour_last7'] = energy['pun_hour_last7_1'].fillna(0) + energy['pun_hour_last7_2'].fillna(0) +\
energy['pun_hour_last7_3'].fillna(0) + energy['pun_hour_last7_4'].fillna(0) +\
energy['pun_hour_last7_5'].fillna(0) + energy['pun_hour_last7_6'].fillna(0) +\
energy['pun_hour_last7_7'].fillna(0) + energy['pun_hour_last7_8'].fillna(0) +\
energy['pun_hour_last7_9'].fillna(0) + energy['pun_hour_last7_10'].fillna(0) +\
energy['pun_hour_last7_11'].fillna(0) + energy['pun_hour_last7_12'].fillna(0) +\
energy['pun_hour_last7_13'].fillna(0) + energy['pun_hour_last7_14'].fillna(0) +\
energy['pun_hour_last7_15'].fillna(0) + energy['pun_hour_last7_16'].fillna(0) +\
energy['pun_hour_last7_17'].fillna(0) + energy['pun_hour_last7_18'].fillna(0) +\
energy['pun_hour_last7_19'].fillna(0) + energy['pun_hour_last7_20'].fillna(0) +\
energy['pun_hour_last7_21'].fillna(0) + energy['pun_hour_last7_22'].fillna(0) +\
energy['pun_hour_last7_23'].fillna(0) + energy['pun_hour_last7_24'].fillna(0)

energy.drop(['pun_hour_last7_1', 'pun_hour_last7_2', 'pun_hour_last7_3', 'pun_hour_last7_4', 
                    'pun_hour_last7_5', 'pun_hour_last7_6', 'pun_hour_last7_7', 'pun_hour_last7_8', 
                    'pun_hour_last7_9', 'pun_hour_last7_10', 'pun_hour_last7_11', 'pun_hour_last7_12', 
                    'pun_hour_last7_13', 'pun_hour_last7_14', 'pun_hour_last7_15', 'pun_hour_last7_16', 
                    'pun_hour_last7_17', 'pun_hour_last7_18', 'pun_hour_last7_19', 'pun_hour_last7_20', 
                    'pun_hour_last7_21', 'pun_hour_last7_22', 'pun_hour_last7_23', 'pun_hour_last7_24', ], axis=1, inplace=True)

In [15]:
energy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33596 entries, 0 to 33595
Data columns (total 88 columns):
date              33596 non-null datetime64[ns]
hour              33596 non-null int64
pun               33596 non-null float64
italy             33596 non-null int64
cnorth            33596 non-null int64
csouth            33596 non-null int64
north             33596 non-null int64
sardinia          33596 non-null int64
sicily            33596 non-null int64
south             33596 non-null int64
AUST-XAUS         23853 non-null float64
BRNN-SUD          33596 non-null float64
BSP-SLOV          33596 non-null float64
CNOR-CORS         33596 non-null float64
CNOR-CSUD         33596 non-null float64
CNOR-NORD         33596 non-null float64
CORS-CNOR         33596 non-null float64
CORS-SARD         33596 non-null float64
CSUD-CNOR         33596 non-null float64
CSUD-SARD         33596 non-null float64
FOGN-SUD          33596 non-null float64
FRAN-XFRA         23853 non-null float6

In [16]:
energy.head()

Unnamed: 0,date,hour,pun,italy,cnorth,csouth,north,sardinia,sicily,south,...,h16,h17,h18,h19,h20,h21,h22,h23,h24,pun_hour_last7
0,2014-01-17,1,50.393484,28430,3174,4275,15963,818,1775,2425,...,0,0,0,0,0,0,0,0,0,55.826318
1,2014-01-17,2,45.7,26631,2966,3909,15145,756,1646,2209,...,0,0,0,0,0,0,0,0,0,50.332167
2,2014-01-17,3,41.973579,25711,2732,3727,14865,727,1570,2090,...,0,0,0,0,0,0,0,0,0,46.041461
3,2014-01-17,4,40.261427,25468,2688,3655,14830,719,1531,2045,...,0,0,0,0,0,0,0,0,0,40.197408
4,2014-01-17,5,40.103296,25725,2715,3660,15024,727,1518,2081,...,0,0,0,0,0,0,0,0,0,40.875501


In [17]:
energy.to_pickle('../data/processed/energy_processed.pkl')

## Following Notebooks

- [More exploratory data analysis](4.0-EDA-Bis.ipynb)
- [Predictive model](5.0-Model.ipynb)