In [1]:
%matplotlib inline
%pylab inline

import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

figsize(15, 5)

data_dir = '/home/dingbat/data/aviation/nasa/corrected'

Populating the interactive namespace from numpy and matplotlib


## Read in the NASA MATLAB File

The scipy package offers the ability to load data files from a variety of sources.  MATLAB has actually migrated to using HDF-5 files since they serve well for general series data.  However, the NASA data files are the previous MATLAB format and scipy doesn't have any issue with them.  The following cells provide some information on loading the data and what you get afterward.

In [2]:
import os
filename = os.path.join(data_dir, '652/652200305311132.mat')
filename

'/home/dingbat/data/aviation/nasa/corrected/652/652200305311132.mat'

In [3]:
import re
RE_FROM_FILENAME = re.compile(
    '(\d{3})(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})'
)
id_idx = 1
year_idx = 2
mon_idx = 3
day_idx = 4
hour_idx = 5
min_idx = 6
filename_data = re.match(
    RE_FROM_FILENAME,
    os.path.splitext(os.path.split(filename)[1])[0]
)
filename_data.group(id_idx)

'652'

In [4]:
import scipy.io as sio
mat = sio.loadmat(
    filename,
    squeeze_me=True,
    struct_as_record=False
)
len(mat)

189

In [5]:
type(mat)

dict

In [6]:
'StartTimeVec' in mat

False

In [7]:
type(mat['DATE_YEAR'])

scipy.io.matlab.mio5_params.mat_struct

In [8]:
type(mat['DATE_YEAR'].data)

numpy.ndarray

In [11]:
mat['DATE_YEAR'].data[0]

2003

In [10]:
len(mat['DATE_YEAR'].data)

701

In [12]:
vars(mat['DATE_YEAR'])

{'Units': 'Year',
 'Alpha': 'DATE.YEAR',
 '_fieldnames': ['data', 'Rate', 'Units', 'Description', 'Alpha'],
 'Description': 'Date (Year)',
 'data': array([2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003,
        2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003,
        2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003,
        2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003,
        2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003,
        2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003,
        2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003,
        2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003,
        2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003,
        2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003,
        2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003,
        2003, 2003, 2003, 2003, 2003, 2

In [13]:
mat['DATE_YEAR'].Rate

0.25

In [14]:
mat['GMT_MINUTE'].Rate

2

In [15]:
start_date = '{}/{}/{} {}{}{}'.format(
    mat['DATE_MONTH'].data[0],
    mat['DATE_DAY'].data[0],
    mat['DATE_YEAR'].data[0],
    mat['GMT_HOUR'].data[0],
    mat['GMT_MINUTE'].data[0],
    mat['GMT_SEC'].data[0]
)
start_date

'5/31/2003 113118'

In [16]:
import pandas as pd
from datetime import timezone

param = mat['DATE_YEAR']
timestamps = pd.date_range(
    start_date,
    periods=len(param.data),
    freq='{}L'.format(int(1000*(1.0/param.Rate))),
    tz=timezone.utc,
)
timestamps[0], timestamps[1]

(Timestamp('2003-05-31 11:31:18+0000', tz='UTC+00:00', offset='4000L'),
 Timestamp('2003-05-31 11:31:22+0000', tz='UTC+00:00', offset='4000L'))

## How to Create the Time-Series Data

First we need to sort the data because the varying series have to be created individually.

In [17]:
d = {}  # Time series data
m = {}  # Meta data
for p in mat:
    param = mat[p]
    # print(p, type(param))
    if isinstance(param, sio.matlab.mio5_params.mat_struct):
        if param.Rate not in d:
            d[param.Rate] = {
                'p': {},
                't': pd.date_range(
                    start_date,
                    periods=len(param.data),
                    freq='{}U'.format(int(1000000*(1.0/param.Rate))),
                    tz=timezone.utc,
                    name='timestamp'
                ),
                'm': {}
            }
        d[param.Rate]['p'][p] = param.data
        d[param.Rate]['m'][p] = {
            'rate': param.Rate,
            'units': param.Units,
            'alpha': param.Alpha,
            'description': param.Description
        }
    else:
        m[p] = param

The data is organized by rates and then a dataframe is created per rate.

In [18]:
params = {}
for k, v in d.items():
    rate_params = v
    params[k] = pd.DataFrame(rate_params['p'], index=rate_params['t'])
[p.iloc[:2, :3] for p in params.values()]  # Select 2 rows from each data

[                           ACID  DATE_DAY  DATE_MONTH
 timestamp                                            
 2003-05-31 11:31:18+00:00   652        31           5
 2003-05-31 11:31:22+00:00   652        31           5,
                                  ABRK  ACMT      AIL_1
 timestamp                                             
 2003-05-31 11:31:18+00:00  119.983559    66  87.766838
 2003-05-31 11:31:19+00:00  119.983559    66  87.766838,
                                   APUF  CCPC  CCPF
 timestamp                                         
 2003-05-31 11:31:18+00:00            0    20  1930
 2003-05-31 11:31:18.500000+00:00     0    20  1930,
                                   ALT  ALTR      AOA1
 timestamp                                            
 2003-05-31 11:31:18+00:00         675   -32 -4.042937
 2003-05-31 11:31:18.250000+00:00  675   -16 -4.042937,
                                       BLAC  CTAC  FPAC
 timestamp                                             
 2003-05-31 

In [19]:
m

{'__version__': '1.0',
 '__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Tue Jan 28 10:53:16 2014',
 '__globals__': []}

**Option 2**: Create a single data frame for each rate group.  This ends up being the join of all the current data frames into a single frame.

In [20]:
params_combined = pd.DataFrame()
params_combined = params_combined.join([v for v in params.values()], how='outer')
# One from each rate group. 65 rows starts and ends on complete samples.
rate_sample = params_combined.ix[128:193, ['ACID', 'ABRK', 'APUF', 'ALT', 'PTCH', 'BLAC']]
rate_sample

Unnamed: 0_level_0,ACID,ABRK,APUF,ALT,PTCH,BLAC
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2003-05-31 11:31:26+00:00,652,119.983559,0,675,0.153804,0.002931
2003-05-31 11:31:26.062500+00:00,,,,,,0.001954
2003-05-31 11:31:26.125000+00:00,,,,,0.153804,0.002931
2003-05-31 11:31:26.187500+00:00,,,,,,0.002931
2003-05-31 11:31:26.250000+00:00,,,,676,0.142818,0.003908
2003-05-31 11:31:26.312500+00:00,,,,,,0.003908
2003-05-31 11:31:26.375000+00:00,,,,,0.142818,0.003908
2003-05-31 11:31:26.437500+00:00,,,,,,0.003908
2003-05-31 11:31:26.500000+00:00,,,0,675,0.142818,0.004885
2003-05-31 11:31:26.562500+00:00,,,,,,0.004885


In [21]:
test_series = pd.Series([10,10,10,3,3,3,3])
test_series.value_counts().index[0]

3

In [22]:
obj_id = None
for k, v in params.items():
    if 'ACID' in v.columns:
        obj_id = v['ACID']
        break
obj_id.value_counts().index[0]

652

## Interpolation

In [23]:
interpolated = rate_sample.interpolate(method='linear')
interpolated

Unnamed: 0_level_0,ACID,ABRK,APUF,ALT,PTCH,BLAC
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2003-05-31 11:31:26+00:00,652,119.983559,0,675.00,0.153804,0.002931
2003-05-31 11:31:26.062500+00:00,652,119.983559,0,675.25,0.153804,0.001954
2003-05-31 11:31:26.125000+00:00,652,119.983559,0,675.50,0.153804,0.002931
2003-05-31 11:31:26.187500+00:00,652,119.983559,0,675.75,0.148311,0.002931
2003-05-31 11:31:26.250000+00:00,652,119.983559,0,676.00,0.142818,0.003908
2003-05-31 11:31:26.312500+00:00,652,119.983559,0,675.75,0.142818,0.003908
2003-05-31 11:31:26.375000+00:00,652,119.983559,0,675.50,0.142818,0.003908
2003-05-31 11:31:26.437500+00:00,652,119.983559,0,675.25,0.142818,0.003908
2003-05-31 11:31:26.500000+00:00,652,119.983559,0,675.00,0.142818,0.004885
2003-05-31 11:31:26.562500+00:00,652,119.983559,0,675.25,0.137325,0.004885


## Persisting the DataFrame for later use

### DataFrame to SQL

This creates one table per data frame.  For a repository of hundreds of thousands of flights thats going to get pretty heavy pretty quickly...  However, is it SQLite and it was meant to be used at smaller scale so one database per flight with one table per data frame would be tolerable.

In [24]:
import os
from sqlalchemy import create_engine

#### Combined DataFrame

In [None]:
com_db_file = 'combined.db'
def combined_sql_store():
    combined_engine = create_engine('sqlite:///{}'.format(com_db_file))
    params_combined.to_sql('combined', combined_engine, if_exists='replace')
%timeit -n1 combined_sql_store()

In [50]:
'{:,}K'.format(os.path.getsize(com_db_file)//1024)

'35,774K'

#### Separate DataFrames

In [52]:
sep_db_file = 'separate.db'
def separate_sql_store():
    separate_engine = create_engine('sqlite:///{}'.format(sep_db_file))
    for k, v in params.items():
        v.to_sql('Table{}'.format(int(k*1000)), separate_engine, if_exists='replace')
%timeit -n1 separate_sql_store()

1 loops, best of 3: 16.6 s per loop


In [53]:
'{:,}K'.format(os.path.getsize(sep_db_file)//1024)

'22,893K'

#### Getting Data into Memory

SQLite allows placing the database into memory.  However, research would need to go into getting it out of memory to persist it in a database.  Given that saving to the database is at a minimum of 16.6 seconds and occupies nearly 23K, the database approach is not viable if one of the other formats supports serialization so no further study of the SQLite database is expected.

### DataFrame to HDF5

H5 is a popular scientific format.  It offers an interface to access data, stores the data in binary, preserves data types, supportes many data types, and can save data in hierarchies.  It is used as the output data file for satellites and even MATLAB.  Version 1.8.9 introduced the ability to place the H5 data files in memory.

In [54]:
import tables
'PyTables was built against HDF-5 version {}'.format(tables.hdf5_version)  # We want at least 1.8.9 for memory backed H5

'PyTables was built against HDF-5 version 1.8.14'

#### Combined DataFrame

In [87]:
com_h5_file = 'combined.h5'
def combined_h5_store():
    combined_store = pd.HDFStore(com_h5_file)
    combined_store['Combined'] = params_combined
%timeit -n1 -r1 combined_h5_store()

1 loops, best of 1: 215 ms per loop


In [88]:
with pd.HDFStore(com_h5_file) as combined_store:
    print(combined_store)

<class 'pandas.io.pytables.HDFStore'>
File path: combined.h5
/Combined            frame        (shape->[69440,186])


In [89]:
'{:,}K'.format(os.path.getsize(com_h5_file)//1024)

'101,051K'

#### Separate DataFrame

In [90]:
sep_h5_file = 'separate.h5'
def separate_h5_store():
    separate_store = pd.HDFStore(sep_h5_file)
    for k, v in params.items():
        separate_store['Table{}'.format(int(k*1000))] = v
%timeit -n1 -r1 separate_h5_store()

1 loops, best of 1: 88 ms per loop


In [91]:
with pd.HDFStore(sep_h5_file) as separate_store:
    print(separate_store)

<class 'pandas.io.pytables.HDFStore'>
File path: separate.h5
/Table1000             frame        (shape->[4340,88]) 
/Table16000            frame        (shape->[69440,4]) 
/Table2000             frame        (shape->[8680,18]) 
/Table250              frame        (shape->[1085,23]) 
/Table4000             frame        (shape->[17360,49])
/Table8000             frame        (shape->[34720,4]) 


In [92]:
'{:,}K'.format(os.path.getsize(sep_h5_file)//1024)

'11,322K'

#### Reading in from H5

In [48]:
params2 = {}
separate_store = pd.HDFStore(sep_h5_file)
for i in separate_store.iteritems():
    params2[i[0]] = separate_store[i[0]]
separate_store.close()
[p.iloc[:1, :3] for p in params2.values()]  # Select 2 rows from each data

[                               BLAC  CTAC  FPAC
 timestamp                                      
 2001-04-11 14:40:24+00:00 -0.008793     0     0,
                            ALT  ALTR      AOA1
 timestamp                                     
 2001-04-11 14:40:24+00:00  987   -16 -4.306609,
                                PTCH  RALT      ROLL
 timestamp                                          
 2001-04-11 14:40:24+00:00 -0.505356  2.25 -0.285638,
                                  ABRK  ACMT      AIL_1
 timestamp                                             
 2001-04-11 14:40:24+00:00  119.983559    60  87.275848,
                            APUF  CCPC  CCPF
 timestamp                                  
 2001-04-11 14:40:24+00:00     0  1957  1876,
                            ACID  DATE_DAY  DATE_MONTH
 timestamp                                            
 2001-04-11 14:40:24+00:00   687        11           4]

### DataFrame to Pickle

Pickle is python's data serialization interface.  Combined with BytesIO it's possible to serialize data in and out making this the most straight-forward approaches to persisting data to a database and then serializing in and out of memory as needed.  However, pickle is specific to python, which presents a few limitations as well.  One is the obvious fact that the data would always need translated to other formats to use in external tools.  At the moment that is not really expected to be a factor since even the tools available that use H5 would need to be able to use the pandas specific format.  The second is the continued compatibility of python's pickle package.  It went through several formats that were not compatible.  However, they continued to support the previous formats.  Third, pickle results in executing python code.  As long as the pickled representations are protected and not exposed this is OK but it is not good to use pickle on data that is not trusted.  The other formats don't have this problem since they deal with data only.

In [104]:
import pickle

#### Combined DataFrame

In [95]:
com_pkl_file = 'combined.pkl'
def combined_pkl_store():
    with open(com_pkl_file, 'wb') as cp:
        pickle.dump(params_combined, cp)
%timeit -n1 combined_pkl_store()

1 loops, best of 3: 141 ms per loop


In [96]:
'{:,}K'.format(os.path.getsize(com_pkl_file)//1024)

'101,056K'

#### Separate DataFrame

In [98]:
sep_pkl_file = 'separate.pkl'
def separate_pkl_store():
    with open(sep_pkl_file, 'wb') as sp:
        pickle.dump(params, sp)
%timeit -n1 separate_pkl_store()

1 loops, best of 3: 18.9 ms per loop


In [99]:
'{:,}K'.format(os.path.getsize(sep_pkl_file)//1024)

'11,290K'

#### Pickle to/from Memory

In [109]:
from io import BytesIO
separate_io = BytesIO()
combined_io = BytesIO()
combined_io.read()

b''

In [113]:
params_combined[:2]

Unnamed: 0_level_0,ACID,DATE_DAY,DATE_MONTH,DATE_YEAR,DVER_1,DVER_2,ECYC_1,ECYC_2,ECYC_3,ECYC_4,...,WD,WS,BLAC,CTAC,FPAC,IVV,PTCH,RALT,ROLL,VRTG
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2001-04-11 14:40:24+00:00,687.0,11.0,4.0,2001.0,126.0,50.0,6740.0,6740.0,6740.0,8223.0,...,0.0,0.0,-0.008793,0,0,0,-0.505356,2.25,-0.285638,0.992412
2001-04-11 14:40:24.062500+00:00,,,,,,,,,,,...,,,-0.008793,0,0,0,,,,


In [116]:
separate_io = pickle.dumps(params)
combined_io = pickle.dumps(params_combined)

In [120]:
params2 = pickle.loads(separate_io)
[p.iloc[:1, :3] for p in params2.values()]

[                           ACID  DATE_DAY  DATE_MONTH
 timestamp                                            
 2001-04-11 14:40:24+00:00   687        11           4,
                                  ABRK  ACMT      AIL_1
 timestamp                                             
 2001-04-11 14:40:24+00:00  119.983559    60  87.275848,
                            APUF  CCPC  CCPF
 timestamp                                  
 2001-04-11 14:40:24+00:00     0  1957  1876,
                            ALT  ALTR      AOA1
 timestamp                                     
 2001-04-11 14:40:24+00:00  987   -16 -4.306609,
                                BLAC  CTAC  FPAC
 timestamp                                      
 2001-04-11 14:40:24+00:00 -0.008793     0     0,
                                PTCH  RALT      ROLL
 timestamp                                          
 2001-04-11 14:40:24+00:00 -0.505356  2.25 -0.285638]

### Conclusion

1. SQLite
   a. It's the slowest of all processes
   a. It uses the most storage memory
   a. It has the most portable representation for the data
   a. The separate data frame form uses 2/3 the memory as combined
1. H5
   a. It's much faster than SQLite and marginally slower than pickle
   a. It is equivalent on storage space to pickle 
   a. The representation is portable but the index and data are stored separate and large tables (lots of samples) are split into multiple H5 data tables.
   a. The separate data frame form uses 1/10th the memory of combined
   a. Writing to memory is theoretically possible but will require some experimentation
1. Pickle
   a. It is the fastest of all forms
   a. It's memory is tied for lowest
   a. Only python can easily read the representation (may not be compatible to different python implementations)
   a. The separate data frame form uses 1/10th the memory of combined
   a. The pickle package provides the dumps and loads functions to dump to a bytes object
   
**Conclusion**: Use pickle to store pandas data to the database

In [60]:
type(np.array)

builtin_function_or_method