# Companion data preprocessing

This notebook is a recipe for taking HDF inputs and transforming it into a format suitable to the ML module. Use it as a lab before writing dedicated functions to do that.

The following code stylizes the pandas DataFrame for better viewing.

In [1]:
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

## Playground

#### The imports

In [2]:
import h5py
import pandas as pd
import geopandas as gpd
import pathlib2 as pl

#### Pandas options

In [3]:
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 6)
pd.set_option('display.width', 1000)

#### The input and output paths

In [4]:
# INPUT_PATH = pl.Path("../hdf_data/")
INPUT_PATH = pl.Path("/Volumes/CompanionEx/Data/hdf/")
OUTPUT_PATH = pl.Path("../dfs_data/")

INPUT_PATH = INPUT_PATH.absolute()
OUTPUT_PATH = OUTPUT_PATH.absolute()

#### Selecting files

In [8]:
files = INPUT_PATH.glob('*.hdf')
filepath = next(files)

In [9]:
print(filepath)

/Users/eltdassen/Programming/python/companion-predictor/nb/../hdf_data/TS_2016-01-12-06_2016-01-12-13.hdf


#### Inspecting a file

In [8]:
f = h5py.File(str(filepath), "r")

In [9]:
f.attrs.items()

[(u'generation_datetime', '2016-05-19-16'),
 (u'start_datetime', '2016-01-12-06'),
 (u'end_datetime', '2016-01-12-13')]

In [10]:
measurement_sites = f.iteritems()

In [11]:
_, site_group = next(measurement_sites)
site_group

<HDF5 group "/RWS01_MONIBAS_0131hrl0035ra" (5 members)>

In [12]:
site_measurements = site_group.iteritems()

In [13]:
_, measurement = next(site_measurements)
measurement

<HDF5 dataset "precipitation": shape (8, 3), type "<f8">

In [14]:
measurement[:,:]

array([[  1.45257480e+09,   1.45257840e+09,   5.00000000e-01],
       [  1.45257840e+09,   1.45258200e+09,   0.00000000e+00],
       [  1.45258200e+09,   1.45258560e+09,   2.00000000e-01],
       [  1.45258560e+09,   1.45258920e+09,   0.00000000e+00],
       [  1.45258920e+09,   1.45259280e+09,   0.00000000e+00],
       [  1.45259280e+09,   1.45259640e+09,   2.00000000e-01],
       [  1.45259640e+09,   1.45260000e+09,   0.00000000e+00],
       [  1.45260000e+09,   1.45260360e+09,   0.00000000e+00]])

In [15]:
measurement.attrs.keys()

[u'units']

In [16]:
units = measurement.attrs['units'].split(", ")
units

['timestamp_start', 'timestamp_end', 'mm/h']

In [17]:
measurement.value

array([[  1.45257480e+09,   1.45257840e+09,   5.00000000e-01],
       [  1.45257840e+09,   1.45258200e+09,   0.00000000e+00],
       [  1.45258200e+09,   1.45258560e+09,   2.00000000e-01],
       [  1.45258560e+09,   1.45258920e+09,   0.00000000e+00],
       [  1.45258920e+09,   1.45259280e+09,   0.00000000e+00],
       [  1.45259280e+09,   1.45259640e+09,   2.00000000e-01],
       [  1.45259640e+09,   1.45260000e+09,   0.00000000e+00],
       [  1.45260000e+09,   1.45260360e+09,   0.00000000e+00]])

#### Converting to a pandas DataFrame

In [18]:
df = pd.DataFrame(data=measurement.value, columns=units).drop(['timestamp_end'], axis=1)

In [19]:
df.head()

Unnamed: 0,timestamp_start,mm/h
0,1452575000.0,0.5
1,1452578000.0,0.0
2,1452582000.0,0.2
3,1452586000.0,0.0
4,1452589000.0,0.0


In [20]:
df['timestamp_start'] = df['timestamp_start'].astype('int64')
df.head()

Unnamed: 0,timestamp_start,mm/h
0,1452574800,0.5
1,1452578400,0.0
2,1452582000,0.2
3,1452585600,0.0
4,1452589200,0.0


#### Convert timestamps to datetime indexes

See http://stackoverflow.com/questions/12251483/idiomatic-way-to-parse-posix-timestamps-in-pandas for this hint.

In [21]:
df['datetime_start'] = df['timestamp_start'].astype('M8[s]')
df.set_index(['datetime_start'], inplace=True)
df.head()

Unnamed: 0_level_0,timestamp_start,mm/h
datetime_start,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-01-12 05:00:00,1452574800,0.5
2016-01-12 06:00:00,1452578400,0.0
2016-01-12 07:00:00,1452582000,0.2
2016-01-12 08:00:00,1452585600,0.0
2016-01-12 09:00:00,1452589200,0.0


#### Add measurement site as categorical

In [22]:
df['site'] = site_group.name[1:]  # the group name has an annoying lead "/"
df['site'] = df['site'].astype('category')
df.head()

Unnamed: 0_level_0,timestamp_start,mm/h,site
datetime_start,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-01-12 05:00:00,1452574800,0.5,RWS01_MONIBAS_0131hrl0035ra
2016-01-12 06:00:00,1452578400,0.0,RWS01_MONIBAS_0131hrl0035ra
2016-01-12 07:00:00,1452582000,0.2,RWS01_MONIBAS_0131hrl0035ra
2016-01-12 08:00:00,1452585600,0.0,RWS01_MONIBAS_0131hrl0035ra
2016-01-12 09:00:00,1452589200,0.0,RWS01_MONIBAS_0131hrl0035ra


Look at the size!

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8 entries, 2016-01-12 05:00:00 to 2016-01-12 12:00:00
Data columns (total 3 columns):
timestamp_start    8 non-null int64
mm/h               8 non-null float64
site               8 non-null category
dtypes: category(1), float64(1), int64(1)
memory usage: 208.0 bytes


## Testing out a functional implementation

#### Import the implementation

First we change to the directory with the packages.

In [5]:
%cd '../src/'

/Users/eltdassen/Programming/python/companion-predictor/src


Now import the preprocessor module `pp`.

In [6]:
from predictor.pp import preprocessing_generator

The main advantage of using the preprocessor as a python module is that we have also available the
`preprocessing_generator` function. Check it out. But note that it might take quite some time to process a `DataFrame`.

In [12]:
preprocessing_generator?

This return a *generator* object which we can iterate over for better performance.

In [7]:
INPUT_PATH

PosixPath('/Users/eltdassen/Programming/python/companion-predictor/nb/../hdf_data')

In [8]:
dfs_gen = preprocessing_generator(input=INPUT_PATH)

In [9]:
df = next(dfs_gen)
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,precipitation mm/h,temperature C,timestamp_start,trafficflow counts/h,trafficspeed km/h,windspeed m/s
site,datetime_start,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
RWS01_MONIBAS_0131hrl0035ra,2016-01-12 04:58:00,0.5,6.8,1452574680,600.0,100.0,8.0
RWS01_MONIBAS_0131hrl0035ra,2016-01-12 04:59:00,0.5,6.8,1452574740,540.0,101.0,8.0
RWS01_MONIBAS_0131hrl0035ra,2016-01-12 05:00:00,0.5,6.8,1452574800,720.0,103.333333,8.0
RWS01_MONIBAS_0131hrl0035ra,2016-01-12 05:01:00,0.5,6.8,1452574860,480.0,98.666667,8.0
RWS01_MONIBAS_0131hrl0035ra,2016-01-12 05:02:00,0.5,6.8,1452574920,420.0,108.0,8.0
