# Purpose
Compare storage techniques for time series data.

## Techniques
### Serialized Pandas DataFrames
Data is stored in the database as binary data in the form of pickle pandas DataFrames.  This allows heterogeneous data storage where the timestamps, numeric, and textual data is all stored in the same data structure.  Queries where a client wishes to run analysis on the data can be done by pulling the data directly but slicing the data frame server side requires unpickling, slicing, then repickling.  This technique is better suited for growth into Big Data techniques than the other techniques but still suffers from the fact that ultimately supporting a pickled binary format covers less ideal situations than a regular binary format such as HDF5, which supports similar data storage techniques (pandas can be serialized to HDF5 through native functions).

### Arrays of records
Data is stored using the postgres Array as an array of records.  Since data types can't be mixed this requires that timestamp data is stored separately from the data.  Different tables are requried for floating point, integer, and string data.  Data is stored in a native postgres format so querying and analysis server-side should be easier.  Postgres functions can be written to create DataFrames to be sent to a client for analysis client side.  This has some potential benefits when using multiple values on the array since both values at the same time are stored as a sub-array, which reduces the required number of iterations through the data.  This potentially makes database queries faster.

### Arrays of series
This structure better matches the inputs to create data frames than the "Array of records" approach discussed above.  That would make creating pandas data frames more efficient.  However, to perform analysis of multiple values, additional correlation is required.

## Tasks
* Task 1.1. Query all values of one parameter with time.
* Task 1.2. Query a slice of all values of one parameter by time.
* Task 2.1. Query multiple values correlated with time.
* Task 2.2. Query a slice of time with multuple values and time.
* Task 3.1. Combine multiple values at offset timestamps using interpolation techniques.
* Task X. Perform roadmatching with latitude and longitude.
* Task X. Store non-timestamped data.

Imports to assist with reading data files and querying the database.

In [18]:
from taxidb import execute, format_results  # Functions reused among database query notebooks
from entity.loader.taxi import Shenzhen

Collect statistics on the input database.

In [5]:
q = """
SELECT
    COUNT(geometry) AS "num_trips",
    SUM(ST_NPoints(geometry)) AS "num_samples"
FROM entity_trip
"""
format_results(execute(q))

num_trips, num_samples = 17385125, 288284766

In [4]:
q = """
    SELECT
        sum((measures->'speed'->>'count')::int)
    FROM streetcube_streettaxicell
"""
format_results(execute(q))

sum = 2139681480

In [6]:
loader = Shenzhen()

In [7]:
loader.organization

<Organization: Shenzhen>

In [9]:
import os
shenzhen_data = '/home/dingbat/data/taxi/shenzhen/2012-Shenzhen'
df = loader.resource_to_dataframe(os.path.join(shenzhen_data, '2012-06-27.good.sample'))
df[:10]

Unnamed: 0_level_0,Unnamed: 1_level_0,passenger,speed,heading,latitude,longitude
common_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B40P00,2012-06-27 00:01:46+08:00,1,0,0,22.541918,114.110046
SCC661,2012-06-27 00:01:39+08:00,0,55,180,22.649248,113.824486
SKS991,2012-06-27 00:01:39+08:00,0,85,315,23.087866,113.673447
SBZ910,2012-06-27 00:01:39+08:00,0,0,0,22.858015,113.843796
SBS623,2012-06-27 00:01:39+08:00,0,0,270,22.98815,113.701981
SBR001,2012-06-27 00:01:39+08:00,0,0,270,23.034866,113.7612
SLP610,2012-06-27 00:01:39+08:00,1,0,0,22.90605,114.062347
SBZ205,2012-06-27 00:01:39+08:00,0,22,135,23.0182,114.092865
SBG776,2012-06-27 00:01:39+08:00,0,0,0,22.982033,113.998901
SKZ403,2012-06-27 00:01:39+08:00,0,0,0,23.040434,113.773163


In [22]:
import tables
import uuid
import pandas as pd

def to_h5(df):
    h5 = pd.HDFStore(
        uuid.uuid1().hex,
        mode='w',
        driver="H5FD_CORE",
        driver_core_backing_store=0
    )
    df.to_hdf(h5, 'df')
    return h5._handle.get_file_image()

def from_h5(h5):
    tables.open_file("in-memory-sample.h5", driver="H5FD_CORE",
                              driver_core_image=image,
                              driver_core_backing_store=0)

In [23]:
%timeit to_h5(df)

10 loops, best of 3: 123 ms per loop


In [27]:
import pickle

def to_pickle(df):
    return pickle.dumps(df, protocol=pickle.HIGHEST_PROTOCOL)

In [28]:
%timeit to_pickle(df)

10 loops, best of 3: 26.9 ms per loop
