# Energy Price Prediction Project

## Previous Notebooks

- [Energy data import and cleaning](1.0-GME-Data.ipynb)
- [Weather data import and cleaning](1.1-Weather-Data.ipynb)
- [Energy price futures import and cleaning](1.2-Futures-Data.ipynb)

In [1]:
import numpy as np
import pandas as pd

In [2]:
import xml.etree.ElementTree
import zipfile
import os
import datetime

Among the publicly available data on the [energy market's managing institution (GME)](http://www.mercatoelettrico.org/It/default.aspx) there is the gas market data, which includes some prices for the next few days. Because the aim is to predict next day price and on a given day I have the data from the day before, among all the daily prices in each file I will use the one for two days in the future.

First, I'm going to reuse the helper functions for the xml files I used in the previous notebook, but since we have different column structures among the files I'm adding some checks:

In [3]:
def import_xml_file(file, row_names, cols):
    '''
    Reads xml file and outputs list of data
    Inputs:
        file -> file name
        row_names -> list of rows tags identifier
        cols -> list of columns tags identifier
    '''
    data = xml.etree.ElementTree.parse(file).getroot()
    all_data = []
    for row_name in row_names:
        for row_data in data.findall(row_name):
            row = []
            for col in cols:
                col_data = row_data.find(col)
                if col_data is None:
                    row.append('')
                else:
                    row.append(col_data.text)
            all_data.append(row)
    
    return all_data

def import_xml_zip_file(folder, zip_file, row_names, cols, out_cols):
    '''
    Loops through xml files in zip archive and outputs a pandas dataframe with all xml files data
    Uses import_xml_file
    Inputs:
        folder -> zip file folder
        zip_file -> zip file name
        row_names -> list of rows tags identifier
        cols -> list of columns tags identifier
        out_cols -> columns names for the pandas dataframe
    '''
    with zipfile.ZipFile('{}/{}'.format(folder, zip_file)) as z:
        all_data = []
        for file in z.namelist():
            f = z.open(file)
            file_data = import_xml_file(f, row_names, cols)
            all_data = all_data + file_data
    
    xml_df = pd.DataFrame(all_data, columns=out_cols)
    return xml_df

def import_xml_from_folder(folder, file_name_path, row_names, cols, out_cols, int_cols, num_cols, date_cols):
    '''
    Loops through all zip archives in a folder and outputs a pandas dataframe with all the zipped data
    Uses import_xml_zip_file
    Inputs:
        folder -> zip file folder
        file_name_path -> start of files name to search for
        row_names -> list of rows tags identifier
        cols -> list of columns tags identifier
        out_cols -> columns names for the pandas dataframe
        int_cols -> list of columns to convert to int
        num_cols -> list of columns to convert to numeric
        date_cols -> list of columns to convert to date
    '''
    directory = os.fsencode(folder)
    out_df = pd.DataFrame(columns=out_cols)
    for file in os.listdir(directory):
        filename = os.fsdecode(file)
        if filename.startswith(file_name_path) and (filename.endswith('.zip') or filename.endswith('.7z')):
            df = import_xml_zip_file(folder, filename, row_names, cols, out_cols)
            out_df = out_df.append(df)
    
    for col in int_cols:
        out_df[col] = pd.to_numeric(out_df[col])
    for col in num_cols:
        out_df[col] = out_df[col].apply(lambda x: x.replace(',', '.'))
        out_df[col] = pd.to_numeric(out_df[col])
    for col in date_cols:
        out_df[col] = pd.to_datetime(out_df[col])
    
    return out_df.reset_index(drop=True)

Import the data:

In [4]:
gas = import_xml_from_folder(folder='../data/raw',
                             file_name_path='MGPGAS_SintesiScambio',
                             row_names=['negoziazione_continua'],
                             cols=['DataSessione', 'NomeProdotto', 'PrimoPrezzo', 'UltimoPrezzo',
                                   'PrezzoMassimo', 'PrezzoMinimo', 'PrezzoMedio', 'PrezzoControllo',
                                   'Volumi', 'PosizioniAperte', 'Abbinamenti'],
                             out_cols=['date', 'market_date', 'opening_price', 'closing_price',
                                       'max_price', 'min_price', 'avg_price', 'control_price',
                                       'volumes', 'open_positions', 'pairings'],
                             int_cols=['pairings'],
                             num_cols=['opening_price', 'closing_price', 'max_price', 'min_price',
                                       'avg_price', 'control_price', 'volumes', 'open_positions'],
                             date_cols=['date'])

In [5]:
gas.head()

Unnamed: 0,date,market_date,opening_price,closing_price,max_price,min_price,avg_price,control_price,volumes,open_positions,pairings
0,2014-01-02,MGP-2014-01-03,,,,,,27.574,,0.0,0
1,2014-01-02,MGP-2014-01-04,,,,,,27.574,,0.0,0
2,2014-01-02,MGP-2014-01-05,,,,,,27.574,,0.0,0
3,2014-01-21,MGP-2014-01-22,,,,,,27.574,,0.0,0
4,2014-01-21,MGP-2014-01-23,,,,,,27.574,,0.0,0


Make market_date really a date, use the data from two days before and choose `control_price` as the price column to use (it's the one with less missing values).

In [6]:
gas['market_date'] = pd.to_datetime(gas['market_date'].str.replace('MGP-',''))
gas['market_date'] = gas['market_date'] - datetime.timedelta(days=2)
gas_daily = gas.loc[gas['date']==gas['market_date'], ['market_date', 'control_price']]

Forward fill missing values.

In [8]:
gas_daily = gas_daily.sort_values(by='market_date').fillna(method='ffill')

In [9]:
gas_daily.head()

Unnamed: 0,market_date,control_price
91,2014-01-01,27.574
1,2014-01-02,27.574
70,2014-01-03,27.574
34,2014-01-04,27.574
31,2014-01-05,27.574


In [10]:
gas_daily.to_pickle('../data/interim/gas.pkl')

## Following Notebooks

- [Merging data](1.5-Merge-Data.ipynb)
- [Exploratory data analysis](2.0-EDA.ipynb)
- [Feature engineering](3.0-Feature-Engineering.ipynb)
- [More exploratory data analysis](4.0-EDA-Bis.ipynb)
- [Predictive model](5.0-Model.ipynb)