# Testing some Pandas IO options with a fairly large dataset (~12 million rows) 
## Using the May 2016 csv file from the [NYC TLC Open Data Portal](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml)

In [3]:
import pandas as pd

import dask.dataframe as dd
#from dask.multiprocessing import get

import fastparquet

In [5]:
! ls -lh data/

total 3629936
-rw-r--r--  1 bob  staff   1.7G Aug 11  2016 yellow_tripdata_2016-05.csv


## Using Pandas' [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

In [7]:
%time pd_frame = pd.read_csv('data/yellow_tripdata_2016-05.csv')
print ('{:,} rows'.format(len(pd_frame)))

CPU times: user 36.3 s, sys: 2.92 s, total: 39.2 s
Wall time: 39.4 s
11,836,853 rows


## Using dask's [read_csv](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_csv)

In [8]:
%time dd_to_pd_frame = dd.read_csv('data/yellow_tripdata_2016-05.csv').compute()

CPU times: user 54 s, sys: 7 s, total: 1min
Wall time: 16.5 s


## Using [fastparquet's](https://fastparquet.readthedocs.io/en/latest/index.html) [writing](https://fastparquet.readthedocs.io/en/latest/quickstart.html#writing) 

In [9]:
%time fastparquet.write('data/yellow_tripdata_2016-05.parq', pd_frame, compression='SNAPPY')

CPU times: user 23.7 s, sys: 1.62 s, total: 25.3 s
Wall time: 25.5 s


In [10]:
! ls -lhtr data/

total 4604176
-rw-r--r--  1 bob  staff   1.7G Aug 11  2016 yellow_tripdata_2016-05.csv
-rw-r--r--  1 bob  staff   476M Jul 21 13:25 yellow_tripdata_2016-05.parq


## Using fastparquet's [reading](https://fastparquet.readthedocs.io/en/latest/quickstart.html#reading)

In [13]:
%time parq_to_pd = fastparquet.ParquetFile('data/yellow_tripdata_2016-05.parq').to_pandas()

CPU times: user 12 s, sys: 2.82 s, total: 14.8 s
Wall time: 15.8 s


## Using Pandas [to_hdf](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_hdf.html)

In [12]:
%time pd_frame.to_hdf('data/yellow_tripdata_2016-05.h5','tripdata')

CPU times: user 6.72 s, sys: 3.18 s, total: 9.89 s
Wall time: 11.1 s


In [14]:
! ls -lhtr data/

total 8214616
-rw-r--r--  1 bob  staff   1.7G Aug 11  2016 yellow_tripdata_2016-05.csv
-rw-r--r--  1 bob  staff   476M Jul 21 13:25 yellow_tripdata_2016-05.parq
-rw-r--r--  1 bob  staff   1.7G Jul 21 13:27 yellow_tripdata_2016-05.h5


## Using Pandas [read_hdf](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_hdf.html)

In [15]:
%time hdf_to_pd_frame = pd.read_hdf('data/yellow_tripdata_2016-05.h5','tripdata')

CPU times: user 2.89 s, sys: 2.77 s, total: 5.66 s
Wall time: 6.91 s
