# README

This notebook reads in the raw smartmeter output files located in `//datc//opschaler//smartmeter_data//*.csv`.  
It processes this data to clean and prepare it for EDA.  
Finally it combines this data per house with the weather data located at `//datc//opschaler//weather_data//weather.csv`.  
The final dataframes are saved in `//datc//opschaler//combined_dfs_gas_smart_weather//dwelling_id.csv`.

TODO: 
* Remove 0's from gasMeter column.
* Save the amount of NaNs (including how many are behind each other) per column in one Excel file for all houses.

## Step by step instruction of the notebook
* Load in the weather dataframe.
* Get a file_paths list of all the smartmeter_data files.
* Iterate over this list with main(), doing the following:

* clean_datetime: Remove non-datetime rows from the datetime column.
* clean_prepare_smart_gas: Split smartmeter data into smart (=electricity) and a gas dataframe, plus prepare them for upsampling.
* resample_smart_gas: Resample smart and gas to have the same datetime index.
* merge_smart_gas_weather: Merge the three dataframes together
* save_df: save the final dataframe

## Warning
Running main() with 61 raw smartmeter data files has been estimated to take 45-60 minutes.

# Basic imports

In [1]:
import pandas as pd
import numpy as np
import glob
import time

# Defining functions

In [2]:
def clean_datetime(df):
    """
    Input should be a df with a column called 'datetime'.
    This function checks wether a row in the df.datetime column can be parsed to a pandas datetime object,
    by trying pd.to_datetime() on it.
    If it fails it will replace that row with np.nan().
    Finally this function will return the df with the NaN rows dropped.
    It only drops the row if the datetime column contains a NaN.
    """
    for i in range(len(df)):
        try:
            pd.to_datetime(df.datetime[i])
        except ValueError:
            print('-----')
            print('ValueError at index = %s' % i)
            print(df.datetime[i])
            df.datetime = df.datetime.replace(df.datetime[i], np.nan)
    df = df.dropna(subset = ['datetime'])
    return df


def clean_prepare_smart_gas(file_path):
    """
    Input is a dwelling_id.csv file.
    Output are cleaned & prepared dataframes (smart, gas).
    Return: smart, gas
    """
    df = pd.read_csv(file_path, delimiter=';', header=0)
    smart = df.iloc[:,:7]
    gas = df.iloc[:,7:]
    
    smart = smart.rename(index=str,columns={"Timestamp":"datetime"})
    gas = gas.rename(index=str,columns={"gasTimestamp":"datetime"})

    smart = clean_datetime(smart)
    gas = clean_datetime(gas)
    
    smart['datetime'] = pd.to_datetime(smart['datetime'])
    gas['datetime'] = pd.to_datetime(gas['datetime'])

    
    smart = smart.set_index(['datetime'])
    gas = gas.set_index(['datetime'])

    return smart, gas


def resample_smart_gas(smart, gas):
    """
    Resamples the (smart, gas) dfs to 10s.
    Also calculates gasPower. 
    Returns (smart_resampled, gas_resampled)
    """
    smart_resampled = smart.resample('10s').mean()
    
    gas_resampled = gas.resample('H').mean()
    # replace 0s with NaNs
    gas_resampled = gas_resampled.resample('10s').interpolate(method='time')
    gas_resampled['gasPower'] = gas_resampled['gasMeter'].diff()
    
    return smart_resampled, gas_resampled


def merge_smart_gas_weather(smart_resampled, gas_resampled, weather):
    """
    Merges the dataframes, outputs one df.
    """
    df = pd.merge(smart_resampled, gas_resampled, left_index=True, right_index=True)
    df = pd.merge(df, weather, left_index=True, right_index=True)
    
    return df


def save_df(df, dwelling_id):
    dir = '//datc//opschaler//combined_dfs_gas_smart_weather//'
    df.to_csv(dir+dwelling_id+'.csv', sep='\t', index=True)
    print('Saved %s' % dwelling_id)

    
def main():
    for i,file_path in enumerate(file_paths):
        t1 = time.time()
        dwelling_id = file_paths[i][-15:-4]
        print('Started iteration %s, processing dwelling_id: %s' % (i,dwelling_id))
        
        smart, gas = clean_prepare_smart_gas(file_paths[i])
        smart_resampled, gas_resampled = resample_smart_gas(smart, gas)
        df = merge_smart_gas_weather(smart_resampled, gas_resampled, weather)
        save_df(df, dwelling_id)
        t2 = time.time()
        print('Finished iteration %s in %.1f [s], Finished processing dwelling_id: %s, ' % (i,(t2-t1), dwelling_id))
        print('-----')

# Importing weather plus getting the smartmeter file paths

In [3]:
weather=pd.read_csv('//datc//opschaler//weather_data//weather.csv',delimiter='\t',comment='#',parse_dates=['datetime'])
weather=weather.set_index(['datetime'])

In [4]:
path='/datc/opschaler/smartmeter_data'
file_paths = glob.glob(path + "/*.csv")

# Run main()
Warning!  
Running main() on 2 raw smartmeter dataframes takes 1,5 minutes.  
This will give a run-time of 45 - 60 minutes when there are 61 raw smartmeter dataframes.

In [5]:
print('Detected %s raw smartmeter files.' % len(file_paths))

main()

print('-----FINISHED-----')

Detected 23 raw smartmeter files.
Started iteration 0, processing dwelling_id: P01S02W6050
Saved P01S02W6050
Finished iteration 0 in 31.5 [s], Finished processing dwelling_id: P01S02W6050, 
-----
Started iteration 1, processing dwelling_id: P02S01W0998
Saved P02S01W0998
Finished iteration 1 in 18.6 [s], Finished processing dwelling_id: P02S01W0998, 
-----
Started iteration 2, processing dwelling_id: P01S02W7251
-----
ValueError at index = 170616
<br />
-----
ValueError at index = 170617
<b>Fatal error</b>:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 32 bytes) in <b>D:\wamp\www\opschaler\downloaddata.php</b> on line <b>20</b><br />
Saved P01S02W7251
Finished iteration 2 in 32.7 [s], Finished processing dwelling_id: P01S02W7251, 
-----
Started iteration 3, processing dwelling_id: P01S01W8239
-----
ValueError at index = 180725
<br />
-----
ValueError at index = 180726
<b>Fatal error</b>:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 32 bytes

KeyboardInterrupt: 