# Energy Price Prediction Project

[Energy market in Italy](https://en.wikipedia.org/wiki/Italian_Power_Exchange) takes place every day in order to determine the energy price (**PUN**) for the subsequent day.

The PUN is determined as the balance between the supply and demand subject to some conditions on the transit of energy between different zones of Italy and between Italy and neighboring countries.

The aim of this project is to predict the energy price starting from the publicly available data on the [energy market's managing institution (GME)](http://www.mercatoelettrico.org/It/default.aspx). Reliable price predictions are very useful to energy producers for two reasons

1. it allows them to sell their energy at the best possible price and
2. it avoids for them not to succeed in the selling, incurring in the possibility of having to dissipate their energy with all the related costs.

In [1]:
import numpy as np
import pandas as pd

In [2]:
import xml.etree.ElementTree
import zipfile
import os
import datetime

Some helper functions to read the xml files in which I have the data, loop through the zip files in which they are compressed and through the various zip files in the data folder:

In [3]:
def import_xml_file(file, row_names, cols):
    '''
    Reads xml file and outputs list of data
    Inputs:
        file -> file name
        row_names -> list of rows tags identifier
        cols -> list of columns tags identifier
    '''
    data = xml.etree.ElementTree.parse(file).getroot()
    all_data = []
    for row_name in row_names:
        for row_data in data.findall(row_name):
            row = []
            for col in cols:
                col_data = row_data.find(col).text
                row.append(col_data)
            all_data.append(row)
    
    return all_data

def import_xml_zip_file(folder, zip_file, row_names, cols, out_cols):
    '''
    Loops through xml files in zip archive and outputs a pandas dataframe with all xml files data
    Uses import_xml_file
    Inputs:
        folder -> zip file folder
        zip_file -> zip file name
        row_names -> list of rows tags identifier
        cols -> list of columns tags identifier
        out_cols -> columns names for the pandas dataframe
    '''
    with zipfile.ZipFile('{}/{}'.format(folder, zip_file)) as z:
        all_data = []
        for file in z.namelist():
            f = z.open(file)
            file_data = import_xml_file(f, row_names, cols)
            all_data = all_data + file_data
    
    xml_df = pd.DataFrame(all_data, columns=out_cols)
    return xml_df

def import_xml_from_folder(folder, file_name_path, row_names, cols, out_cols, int_cols, num_cols, date_cols):
    '''
    Loops through all zip archives in a folder and outputs a pandas dataframe with all the zipped data
    Uses import_xml_zip_file
    Inputs:
        folder -> zip file folder
        file_name_path -> start of files name to search for
        row_names -> list of rows tags identifier
        cols -> list of columns tags identifier
        out_cols -> columns names for the pandas dataframe
        int_cols -> list of columns to convert to int
        num_cols -> list of columns to convert to numeric
        date_cols -> list of columns to convert to date
    '''
    directory = os.fsencode(folder)
    out_df = pd.DataFrame(columns=out_cols)
    for file in os.listdir(directory):
        filename = os.fsdecode(file)
        if filename.startswith(file_name_path) and (filename.endswith('.zip') or filename.endswith('.7z')):
            df = import_xml_zip_file(folder, filename, row_names, cols, out_cols)
            out_df = out_df.append(df)
    
    for col in int_cols:
        out_df[col] = pd.to_numeric(out_df[col])
    for col in num_cols:
        out_df[col] = out_df[col].apply(lambda x: x.replace(',', '.'))
        out_df[col] = pd.to_numeric(out_df[col])
    for col in date_cols:
        out_df[col] = pd.to_datetime(out_df[col])
    
    return out_df.reset_index(drop=True)

Import the data.

- Prices data: one row per day and hour, PUN is the national price of energy and the variable to predict.
- Demand data: estimates of the energy need for each zone in Italy and for the whole country; again there is one row per day and hour.
- Transit data: limits of energy exchange between neighboring zones in Italy and between Italy and neighboring countries, these limits condition the energy price. This dataset has one row per day, hour, zone of origin and zone of destination.

In [4]:
pun = import_xml_from_folder(folder='../data/raw',
                             file_name_path='MGP_Prezzi',
                             row_names=['Prezzi'],
                             cols=['Data', 'Ora', 'Mercato', 'PUN'],
                             out_cols=['date', 'hour', 'market', 'pun'],
                             int_cols=['hour'],
                             num_cols=['pun'],
                             date_cols=['date'])
demand = import_xml_from_folder(folder='../data/raw',
                                file_name_path='MGP_StimeFabbisogno',
                                row_names=['Fabbisogno', 'StimeFabbisogno', 'stimeFabbisogno', 'marketintervaldetail'],
                                cols=['Data', 'Ora', 'Mercato', 'Totale', 'CNOR', 'CSUD', 'NORD', 'SARD', 'SICI', 'SUD'],
                                out_cols=['date', 'hour', 'market', 'italy', 'cnorth', 'csouth', 'north', 'sardinia', 'sicily', 'south'],
                                int_cols=['hour'],
                                num_cols=['italy', 'cnorth', 'csouth', 'north', 'sardinia', 'sicily', 'south'],
                                date_cols=['date'])
transit_limit = import_xml_from_folder(folder='../data/raw',
                                       file_name_path='MGP_LimitiTransito',
                                       row_names=['LimitiTransito'],
                                       cols=['Data', 'Ora', 'Mercato', 'Da', 'A', 'Limite', 'Coefficiente'],
                                       out_cols=['date', 'hour', 'market', 'zone_from', 'zone_to', 'limit', 'coefficient'],
                                       int_cols=['hour'],
                                       num_cols=['limit', 'coefficient'],
                                       date_cols=['date'])

In [5]:
pun.head()

Unnamed: 0,date,hour,market,pun
0,2014-01-17,1,MGP,50.393484
1,2014-01-17,2,MGP,45.7
2,2014-01-17,3,MGP,41.973579
3,2014-01-17,4,MGP,40.261427
4,2014-01-17,5,MGP,40.103296


In [6]:
demand.head()

Unnamed: 0,date,hour,market,italy,cnorth,csouth,north,sardinia,sicily,south
0,2014-01-17,1,MGP,28430,3174,4275,15963,818,1775,2425
1,2014-01-17,2,MGP,26631,2966,3909,15145,756,1646,2209
2,2014-01-17,3,MGP,25711,2732,3727,14865,727,1570,2090
3,2014-01-17,4,MGP,25468,2688,3655,14830,719,1531,2045
4,2014-01-17,5,MGP,25725,2715,3660,15024,727,1518,2081


In [7]:
transit_limit.head()

Unnamed: 0,date,hour,market,zone_from,zone_to,limit,coefficient
0,2014-01-26,1,MGP,AUST,NORD,10000.0,1.0
1,2014-01-26,2,MGP,AUST,NORD,10000.0,1.0
2,2014-01-26,3,MGP,AUST,NORD,10000.0,1.0
3,2014-01-26,4,MGP,AUST,NORD,10000.0,1.0
4,2014-01-26,5,MGP,AUST,NORD,10000.0,1.0


I want the transit data to have the structure of one row per day and hour, so I add a feature for each pair of zones of origin and destination.

In [8]:
transit_limit['from_to'] = transit_limit['zone_from'] + '-' + transit_limit['zone_to']

In [9]:
transit_limit = transit_limit.pivot_table(index=['date', 'hour'], columns='from_to', values='limit').reset_index()

In [10]:
print(pun.shape)
print(demand.shape)
print(transit_limit.shape)

(33600, 4)
(33600, 10)
(33600, 46)


In [11]:
print(pun['date'].min(), pun['date'].max())
print(demand['date'].min(), demand['date'].max())
print(transit_limit['date'].min(), transit_limit['date'].max())

2014-01-01 00:00:00 2017-10-31 00:00:00
2014-01-01 00:00:00 2017-10-31 00:00:00
2014-01-01 00:00:00 2017-10-31 00:00:00


Finally, I merge all the dataset on the date and hour columns.

In [12]:
market = pd.merge(left=pun, right=demand, how='inner', left_on=['date', 'hour'], right_on=['date', 'hour']).drop(['market_x', 'market_y'], axis=1)
market = pd.merge(left=market, right=transit_limit, how='inner', left_on=['date', 'hour'], right_on=['date', 'hour'])
market.to_pickle('../data/interim/market.pkl')

## Following Notebooks

- [Weather data import and cleaning](1.1-Weather-Data.ipynb)
- [Energy price futures import and cleaning](1.2-Futures-Data.ipynb)
- [Gas price import and cleaning](1.3-Gas-Data.ipynb)
- [Merging data](1.5-Merge-Data.ipynb)
- [Exploratory data analysis](2.0-EDA.ipynb)
- [Feature engineering](3.0-Feature-Engineering.ipynb)
- [More exploratory data analysis](4.0-EDA-Bis.ipynb)
- [Predictive model](5.0-Model.ipynb)