**Purpose of script**:

Transform the xls tables containing elspotprices. Also, remove first rows containing metadata and keep only the column for Kr.sand, which is the price relevant for this project.

Originally like

|         | Hours   | Location1  |  Location2 | .... | .... |
|-------- |-------- | --------   | ---------- | ---- | ---- |
|01-01-18 | 00-01   | 310,02     | 321,09    | .... | .... |
|01-01-18 | 01-02   | 310,20     | 322,01     | .... | .... |
| ....    | ....    | ....       | ....       | .... | .... |
|%d-%m-%y | %H-%H   | nok,øre(cent) | nok,øre(cent)| .... | .... |

New look

|Timestamp            |Kr.sand   | 
|-----------------    | -------    |
|2018-01-01 00:00:00  | 313,86    | 
|2018-01-01 01:00:00  | 313,83    |
| ....                | ....       | 
|%Y-%m-%d  %H:%M:%S   | nok,øre(cent)|


In [1]:
import pandas as pd
from glob import glob

In [2]:
origin_folder = 'raw//elspot_prices'
destination_folder = 'sql_src//elspot_prices'

In [3]:
for file_path in glob(origin_folder + '\\*.xls'):

    df = pd.read_html(file_path, skiprows=2, thousands='.', header=0)[0][['Unnamed: 0', 'Hours', 'Kr.sand']]
    df['timestamp'] = [pd.to_datetime(df.iloc[i]['Unnamed: 0'] 
                                      + ' ' 
                                      + df.iloc[i]['Hours'][:2] 
                                      + ':00', dayfirst=True) 
                       for i in range(len(df.index))]
    
    df = df.drop_duplicates('timestamp')
    df.set_index('timestamp', inplace=True)
    df = df.drop(labels=['Unnamed: 0', 'Hours'], axis=1)
    
    file_name = file_path.split('\\')[-1].strip('.xls')
    df.to_csv(destination_folder + f'\\{file_name}.csv')
    print('Transformed', file_path)

Transformed raw//elspot_prices\elspot-prices_2018_hourly_nok.xls
Transformed raw//elspot_prices\elspot-prices_2019_hourly_nok.xls
Transformed raw//elspot_prices\elspot-prices_2020_hourly_nok.xls


In [4]:
df.head()

Unnamed: 0_level_0,Kr.sand
timestamp,Unnamed: 1_level_1
2020-01-01 00:00:00,31386
2020-01-01 01:00:00,31336
2020-01-01 02:00:00,31139
2020-01-01 03:00:00,30853
2020-01-01 04:00:00,30301
