# Companion data preprocessing

This notebook is a recipe for taking HDF inputs and transforming it into a format suitable to the ML module. Use it as a lab before writing dedicated functions to do that.

The following code stylizes the pandas DataFrame for better viewing.

In [None]:
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

## Playground

#### The imports

In [None]:
import h5py
import pandas as pd
import geopandas as gpd
import pathlib2 as pl

#### Pandas options

In [None]:
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 6)
pd.set_option('display.width', 1000)

#### The input and output paths

In [None]:
# INPUT_PATH = pl.Path("../hdf_data/")
INPUT_PATH = pl.Path("/Volumes/CompanionEx/Data/hdf/")
OUTPUT_PATH = pl.Path("/Volumes/CompanionEx/Data/dfs/")
# OUTPUT_PATH = pl.Path("../dfs_data/")

INPUT_PATH = INPUT_PATH.absolute()
OUTPUT_PATH = OUTPUT_PATH.absolute()

#### Selecting files

In [None]:
files = INPUT_PATH.glob('*.hdf')
filepath = next(files)

In [None]:
print(filepath)

#### Inspecting a file

In [None]:
f = h5py.File(str(filepath), "r")

In [None]:
f.attrs.items()

In [None]:
measurement_sites = f.iteritems()

In [None]:
_, site_group = next(measurement_sites)
site_group

In [None]:
site_measurements = site_group.iteritems()

In [None]:
_, measurement = next(site_measurements)
measurement

In [None]:
measurement[:,:]

In [None]:
measurement.attrs.keys()

In [None]:
units = measurement.attrs['units'].split(", ")
units

In [None]:
measurement.value

#### Converting to a pandas DataFrame

In [None]:
df = pd.DataFrame(data=measurement.value, columns=units).drop(['timestamp_end'], axis=1)

In [None]:
df.head()

In [None]:
df['timestamp_start'] = df['timestamp_start'].astype('int64')
df.head()

#### Convert timestamps to datetime indexes

See http://stackoverflow.com/questions/12251483/idiomatic-way-to-parse-posix-timestamps-in-pandas for this hint.

In [None]:
df['datetime_start'] = df['timestamp_start'].astype('M8[s]')
df.set_index(['datetime_start'], inplace=True)
df.head()

#### Add measurement site as categorical

In [None]:
df['site'] = site_group.name[1:]  # the group name has an annoying lead "/"
df['site'] = df['site'].astype('category')
df.head()

Look at the size!

In [None]:
df.info()

## Testing out a functional implementation

#### Import the implementation

First we change to the directory with the packages.

In [None]:
%cd '../src/'

Now import the preprocessor module `pp`.

In [None]:
from predictor.pp import preprocessing_generator

The main advantage of using the preprocessor as a python module is that we have also available the
`preprocessing_generator` function. Check it out. But note that it might take quite some time to process a `DataFrame`.

In [None]:
preprocessing_generator?

This return a *generator* object which we can iterate over for better performance.

In [None]:
INPUT_PATH

In [None]:
dfs_gen = preprocessing_generator(input=INPUT_PATH)

In [None]:
df = next(dfs_gen)
df.head()