# Energy Price Prediction Project

## Previous Notebooks

- [Energy data import and cleaning](1.0-GME-Data.ipynb)
- [Weather data import and cleaning](1.1-Weather-Data.ipynb)

In [1]:
import numpy as np
import pandas as pd

In [2]:
from tabula import read_pdf
import os

## Futures Data

The future prices of the data predicted for the stock exchange may have some relation with the actual energy price.

Here I downloaded the futures from [this archive](http://www.borsaitaliana.it/borsaitaliana/statistiche/mercati/commodities/commodities.htm) and loaded them from the pdf using `tabula-py` module.

Helper function for reading futures' pdf files:

In [3]:
def read_futures_pdf(file, page):
    '''
    Reads futures pdf file and outputs pandas dataframe
    '''
    pdf = read_pdf(file, pages=page)
    # if the dataframe read from the pdf has more than 5 columns then
    # the first is the date, the second contains the predicted month
    # and the third the price
    if pdf.shape[1] >= 5:
        pdf_filtered = pdf.loc[pdf['MONTHLY BASELOAD FUTURES'].str[0:3].isin(['Jan', 'Feb', 'Mar',
                                                                              'Apr', 'May', 'Jun',
                                                                              'Jul', 'Aug', 'Sep',
                                                                              'Oct', 'Nov', 'Dec']),
                               ['Unnamed: 0', 'MONTHLY BASELOAD FUTURES', 'Unnamed: 2']]\
                            .rename(columns={'Unnamed: 0': 'date',
                                             'MONTHLY BASELOAD FUTURES':'futures_month',
                                             'Unnamed: 2':'baseload'})
        pdf_filtered['futures_month'] = pdf_filtered['futures_month'].str[0:8]
    # if the dataframe read from the pdf has more than 4 columns then
    # the first is the date and the second contains the predicted month
    # and the price
    elif pdf.shape[1] == 4:
        pdf_filtered = pdf.loc[pdf['MONTHLY BASELOAD FUTURES'].str[0:3].isin(['Jan', 'Feb', 'Mar',
                                                                              'Apr', 'May', 'Jun',
                                                                              'Jul', 'Aug', 'Sep',
                                                                              'Oct', 'Nov', 'Dec']),
                               ['Unnamed: 0', 'MONTHLY BASELOAD FUTURES']]\
                            .rename(columns={'Unnamed: 0': 'date',
                                             'MONTHLY BASELOAD FUTURES':'futures_month'})

        pdf_filtered['baseload'] = pdf_filtered['futures_month'].apply(lambda x: x[x.rfind(' ')+1:])
        pdf_filtered['futures_month'] = pdf_filtered['futures_month'].str[0:8]
    # in the other cases we have something else going on
    else:
        print('Yet another format...')
    return pdf_filtered

Looping through all files in the directory and cleaning dates and price:

In [4]:
directory = os.fsencode('../data/raw')
futures = pd.DataFrame(columns=['date', 'futures_month', 'baseload'])
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith('.pdf'):
        print(filename)
        df = read_futures_pdf('{}{}'.format('../data/raw/', filename), page=2)
        futures = futures.append(df)

idexstat201310.pdf
idexstat201311.pdf
idexstat201312.pdf
idexstat201401.pdf
idexstat201402.pdf
idexstat201403.pdf
idexstat201404.pdf
idexstat201405.pdf
idexstat201406.pdf
idexstat201407.pdf
idexstat201408.pdf
idexstat201409.pdf
idexstat201410.pdf
idexstat201411.pdf
idexstat201412.pdf
idexstat201501.pdf
idexstat201502.pdf
idexstat201503.pdf
idexstat201504.pdf
idexstat201505.pdf
idexstat201506.pdf
idexstat201507.pdf
idexstat201508.pdf
idexstat201509.pdf
idexstat201510.pdf
idexstat201511.pdf
idexstat201512.pdf
idexstat201601.pdf
idexstat201602.pdf
idexstat201603.pdf
idexstat201604.pdf
idexstat201605.pdf
idexstat201606.pdf
idexstat201607.pdf
idexstat201608.pdf
idexstat201609.pdf
idexstat201610.pdf
idexstat201611.pdf
idexstat201612.pdf
idexstat201701.pdf
idexstat201702.pdf
idexstat201703.pdf
idexstat201704.pdf
idexstat201705.pdf
idexstat201706.pdf
idexstat201707.pdf
idexstat201708.pdf
idexstat201709.pdf


In [5]:
futures['baseload'] = pd.to_numeric(futures['baseload'].str.replace(' ',''))

In [6]:
# looking at the respective pdf file there is a 'c' instead of the actual date
futures.loc[futures['date']=='c', 'date'] = '01/03/2017'

In [7]:
futures['date'] = pd.to_datetime(futures['date'], format='%d/%m/%Y')
futures['futures_month'] = pd.to_datetime(futures['futures_month'], format='%b %Y')

In [None]:
futures.reset_index(inplace=True, drop=True)

In [35]:
futures.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3042 entries, 1 to 63
Data columns (total 3 columns):
date             3042 non-null datetime64[ns]
futures_month    3042 non-null datetime64[ns]
baseload         3042 non-null float64
dtypes: datetime64[ns](2), float64(1)
memory usage: 175.1 KB


In [21]:
futures.to_pickle('../data/interim/futures.pkl')

## Following Notebooks

- [Gas price import and cleaning](1.3-Gas-Data.ipynb)
- [Merging data](1.5-Merge-Data.ipynb)
- [Exploratory data analysis](2.0-EDA.ipynb)
- [Feature engineering](3.0-Feature-Engineering.ipynb)
- [More exploratory data analysis](4.0-EDA-Bis.ipynb)
- [Predictive model](5.0-Model.ipynb)