# Companion data preprocessing

This notebook is a recipe for taking HDF inputs and transforming it into a format suitable to the ML module. Use it as a lab before writing dedicated functions to do that.

The following code stylizes the pandas DataFrame for better viewing.

In [1]:
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

## Playground

#### The imports

In [2]:
import h5py
import pandas as pd
import geopandas as gpd
import pathlib2 as pl

#### Pandas options

In [3]:
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 6)
pd.set_option('display.width', 1000)

#### The input and output paths

In [4]:
# INPUT_PATH = pl.Path("../hdf_data/")
INPUT_PATH = pl.Path("/Volumes/CompanionEx/Data/hdf/")
OUTPUT_PATH = pl.Path("/Volumes/CompanionEx/Data/dfs/")
# OUTPUT_PATH = pl.Path("../dfs_data/")

INPUT_PATH = INPUT_PATH.absolute()
OUTPUT_PATH = OUTPUT_PATH.absolute()

#### Selecting files

In [5]:
files = INPUT_PATH.glob('*.hdf')
filepath = next(files)

In [6]:
print(filepath)

/Volumes/CompanionEx/Data/hdf/TS_2016-01-01-06_2016-01-01-13.hdf


#### Inspecting a file

In [7]:
f = h5py.File(str(filepath), "r")

In [9]:
list(f.attrs.items())

[('generation_datetime', b'2016-05-26-18'),
 ('start_datetime', b'2016-01-01-06'),
 ('end_datetime', b'2016-01-01-13')]

In [15]:
measurement_sites = iter(f.items())

In [17]:
_, site_group = next(measurement_sites)
site_group

<HDF5 group "/rws01_monibas_0010vwa0056ra" (5 members)>

In [20]:
site_measurements = iter(site_group.items())

In [21]:
_, measurement = next(site_measurements)
measurement

<HDF5 dataset "precipitation": shape (8, 3), type "<f8">

In [22]:
measurement[:,:]

array([[  1.45162440e+09,   1.45162800e+09,   0.00000000e+00],
       [  1.45162800e+09,   1.45163160e+09,   0.00000000e+00],
       [  1.45163160e+09,   1.45163520e+09,   0.00000000e+00],
       [  1.45163520e+09,   1.45163880e+09,   0.00000000e+00],
       [  1.45163880e+09,   1.45164240e+09,   0.00000000e+00],
       [  1.45164240e+09,   1.45164600e+09,   0.00000000e+00],
       [  1.45164600e+09,   1.45164960e+09,   0.00000000e+00],
       [  1.45164960e+09,   1.45165320e+09,   0.00000000e+00]])

In [24]:
list(measurement.attrs.keys())

['units']

In [28]:
units = measurement.attrs['units'].decode().split(", ")
units

['timestamp_start', 'timestamp_end', 'mm/h']

In [29]:
measurement.value

array([[  1.45162440e+09,   1.45162800e+09,   0.00000000e+00],
       [  1.45162800e+09,   1.45163160e+09,   0.00000000e+00],
       [  1.45163160e+09,   1.45163520e+09,   0.00000000e+00],
       [  1.45163520e+09,   1.45163880e+09,   0.00000000e+00],
       [  1.45163880e+09,   1.45164240e+09,   0.00000000e+00],
       [  1.45164240e+09,   1.45164600e+09,   0.00000000e+00],
       [  1.45164600e+09,   1.45164960e+09,   0.00000000e+00],
       [  1.45164960e+09,   1.45165320e+09,   0.00000000e+00]])

#### Converting to a pandas DataFrame

In [30]:
df = pd.DataFrame(data=measurement.value, columns=units).drop(['timestamp_end'], axis=1)

In [31]:
df.head()

Unnamed: 0,timestamp_start,mm/h
0,1451624000.0,0.0
1,1451628000.0,0.0
2,1451632000.0,0.0
3,1451635000.0,0.0
4,1451639000.0,0.0


In [32]:
df['timestamp_start'] = df['timestamp_start'].astype('int64')
df.head()

Unnamed: 0,timestamp_start,mm/h
0,1451624400,0.0
1,1451628000,0.0
2,1451631600,0.0
3,1451635200,0.0
4,1451638800,0.0


#### Convert timestamps to datetime indexes

See http://stackoverflow.com/questions/12251483/idiomatic-way-to-parse-posix-timestamps-in-pandas for this hint.

In [33]:
df['datetime_start'] = df['timestamp_start'].astype('M8[s]')
df.set_index(['datetime_start'], inplace=True)
df.head()

Unnamed: 0_level_0,timestamp_start,mm/h
datetime_start,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-01-01 05:00:00,1451624400,0.0
2016-01-01 06:00:00,1451628000,0.0
2016-01-01 07:00:00,1451631600,0.0
2016-01-01 08:00:00,1451635200,0.0
2016-01-01 09:00:00,1451638800,0.0


#### Add measurement site as categorical

In [34]:
df['site'] = site_group.name[1:]  # the group name has an annoying lead "/"
df['site'] = df['site'].astype('category')
df.head()

Unnamed: 0_level_0,timestamp_start,mm/h,site
datetime_start,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-01-01 05:00:00,1451624400,0.0,rws01_monibas_0010vwa0056ra
2016-01-01 06:00:00,1451628000,0.0,rws01_monibas_0010vwa0056ra
2016-01-01 07:00:00,1451631600,0.0,rws01_monibas_0010vwa0056ra
2016-01-01 08:00:00,1451635200,0.0,rws01_monibas_0010vwa0056ra
2016-01-01 09:00:00,1451638800,0.0,rws01_monibas_0010vwa0056ra


Look at the size!

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8 entries, 2016-01-01 05:00:00 to 2016-01-01 12:00:00
Data columns (total 3 columns):
timestamp_start    8 non-null int64
mm/h               8 non-null float64
site               8 non-null category
dtypes: category(1), float64(1), int64(1)
memory usage: 208.0 bytes


## Testing out a functional implementation

#### Import the implementation

First we change to the directory with the packages.

In [1]:
%cd '../src/'

/Users/eltdassen/Programming/python/companion-predictor/src


Now import the preprocessor module `pp`.

In [2]:
from predictor.pp import preprocessing_generator

The main advantage of using the preprocessor as a python module is that we have also available the
`preprocessing_generator` function. Check it out. But note that it might take quite some time to process a `DataFrame`.

In [4]:
print(preprocessing_generator.__doc__)


    Creates a generator that return each preprocessed file as a DataFrame one at a time.

    :param input: Input path where HDFs are found.
    :param files: Iterable object with list of file names to process in the given input path.
    :return: A dataset generator (each is a tuple (index, features, target_flow, target_speed)
    where all but the first is a numpy array and index is a pandas dataframe).
    


This return a *generator* object which we can iterate over for better performance.

In [39]:
INPUT_PATH

PosixPath('/Volumes/CompanionEx/Data/hdf')

In [40]:
dfs_gen = preprocessing_generator(input=INPUT_PATH)

In [None]:
index, features, target_flow, target_speed = next(dfs_gen)
index.head()