In [None]:
# default_exp storage

# Storage

> Storage alternatives. Read, write and convert files.

In [None]:
# export
import json
import gzip
import shutil

import pandas as pd
import fastparquet as fp

from ig_format import tools

# CSV

Human readable, universally supported, row oriented storage format.

## Issues
- Newly introduced nullable string dtype (pd.StringDtype, "string") substantially increases read time compared to old "object" type.

## Possible optimizations
- Use smallest possible dtype (e.g. float32 instead of float64)
- Use categoricals for string columns

In [None]:
# export
def dtypes_from_schema(schema_path, ext_dtypes=False):
    """Read JSON table schema and convert it to 
    dictionary (col_name -> col_dtype) that can be passed to pandas CSV reader.
    """
    with open(schema_path) as f:
        schema = json.load(f)
    if ext_dtypes:
        dt_int = 'Int64'
        dt_str = 'string'
    else:
        dt_int = 'float64'
        dt_str = 'str'
    map_ = {
        'integer': dt_int,
        'number': 'float64',
        'string': dt_str,
        'year': dt_int
    }
    dtypes = {f['name']: map_[f['type']] for f in schema['fields']}
    return dtypes

def read_csv(year, ext_dtypes=False, full=False, gzip=False):
    """Read single year of data from CSV file into pandas dataframe.
    
    :par ext_dtypes: use pandas experimental extension types (nullable int and str).
    :par full: read all records or just the first 100k.
    """
    path = f'./data/csv/{year}.csv'
    if gzip: path += '.gz'
    dt = dtypes_from_schema(f'./data/csv/{year}_schema.json', ext_dtypes)
    nr = None if full else 100000
    return pd.read_csv(path, dtype=dt, nrows=nr)

Extension types make reading slower.

In [None]:
%%time
df = read_csv(2000, False)

In [None]:
%%time
df = read_csv(2000, True)

## csv.gz

Pandas can read compressed CSV files.
Compression reduces size on disk and may improve read time by speeding up disk I/O at the expense of CPU time.

Compression takes about the same time with Python "gzip" module and system "gzip" utility.
Compressed file size is the same.

Compression level can be set between 1 (fastest, worst compression) and 9 (slowest, best compression).


In [None]:
%%time
# compression level 1: fastest
with open('./data/csv/2000.csv', 'rb') as f_in:
    with gzip.open('./data/csv/2000.csv.py1.gz', 'wb', 1) as f_out:
        shutil.copyfileobj(f_in, f_out)

In [None]:
%%time
# compression level 6: same system "gzip" utility
with open('./data/csv/2000.csv', 'rb') as f_in:
    with gzip.open('./data/csv/2000.csv.py6.gz', 'wb', 6) as f_out:
        shutil.copyfileobj(f_in, f_out)

In [None]:
%%time
# default compression level 9: highest
with open('./data/csv/2000.csv', 'rb') as f_in:
    with gzip.open('./data/csv/2000.csv.py9.gz', 'wb', 9) as f_out:
        shutil.copyfileobj(f_in, f_out)

In [None]:
# export
def compress_files(*glob_paths):
    """Compress given files with GZIP."""

In [None]:
tools.size_on_disk('./data/csv/2000.csv*')

In [None]:
%time df = read_csv(2000, full=True)

In [None]:
!cp ./data/csv/2000.csv.py1.gz ./data/csv/2000.csv.gz
%time df = read_csv(2000, full=True, gzip=True)

In [None]:
!cp ./data/csv/2000.csv.py6.gz ./data/csv/2000.csv.gz
%time df = read_csv(2000, full=True, gzip=True)

In [None]:
!cp ./data/csv/2000.csv.py9.gz ./data/csv/2000.csv.gz
%time df = read_csv(2000, full=True, gzip=True)

# parquet

## Apache Parquet

[Website](https://parquet.apache.org/)
[Wikipedia](https://en.wikipedia.org/wiki/Apache_Parquet)
[Docs](https://parquet.apache.org/documentation/latest/)

File format for data storage. Provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. 

> Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Features:

- columnar storage, only read the data of interest
- efficient binary packing
- choice of compression algorithms and encoding
- split data into files, allowing for parallel processing
- range of logical types
- statistics stored in metadata allow for skipping unneeded chunks
- data partitioning using the directory structure

Hierarchically, a file consists of one or more row groups. A row group contains exactly one column chunk per column. Column chunks contain one or more pages.

[Parquet vs CSV: AWS processing costs](https://blog.openbridge.com/how-to-be-a-hero-with-powerful-parquet-google-and-amazon-f2ae0f35ee04)


## fastparquet

[GitHub](https://github.com/dask/fastparquet)
[Docs](https://fastparquet.readthedocs.io/en/latest/)

Python implementation of the *Apache Parquet* format.
Part of *dask* ecosystem, designed to work well with dask for parallel execution.

Latest release: 0.3.3

Not all parts of the Parquet-format have been implemented yet or tested. 
Not all output options will be compatible with every other Parquet framework, which each implement only a subset of the standard.
Usage decisions: writing parquet files that are compatible with other parquet implementations, versus performance when writing data for reading back with fastparquet.

Compression: uncompressed, gzip, snappy (install `python-snappy` from conda-forge separately)

### Issues

- As of 0.3.3, pandas extended dtypes like "Int64" are not supported, but there is [PR](https://github.com/dask/fastparquet/pull/483)


## Apache Arrow

> Apache Arrow is a cross-language development platform for **in-memory** data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.

[Website](https://arrow.apache.org/)
[Docs](https://arrow.apache.org/docs/index.html)

The Arrow Python bindings (also named “PyArrow”) have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ implementation of Arrow.

[Data types, Schema and Table](https://arrow.apache.org/docs/python/data.html)

[Read and write parquet](https://arrow.apache.org/docs/python/parquet.html)

> fastparquet has a much smaller footprint, weighing in at a modest 180 kB compared to pyarrow‘s hulking 48 MB. It also has a much simpler and pandas centric read/write signature which can be nice for users that are more comfortable working with a DataFrame mindset. On the other hand,pyarrow offers greater coverage of the Parquet file format, more voluminous documentation, and more explicit column typing which turned out to be important later on. After much wavering we decided to go forward using pyarrow. [parquet business use case](https://medium.com/when-i-work-data/por-que-parquet-2a3ec42141c6)

Arrow's purpose is to move data between components (pandas, parquet, CSV, Spark, R, ...) more efficiently. If we only use pandas, fastparquet might be sufficient. As of now, Arrow does not support SAS or Stata.


### Feather

[Docs](https://arrow.apache.org/docs/python/ipc.html#feather-format)

Lightweight file storage format for dataframes understood by both pandas and R. Does not look very active, it is probably better to use parquet.

In [None]:
%%time
# convert CSV to parquet
for year in range(1997, 2016):
    df = read_csv(year, full=True)
    for compres, label in zip([None, 'gzip', 'snappy'], ['', '.gz', '.snap']):
        fp.write(f'./data/parquet/{year}{label}.pq', df, compression=compres, write_index=False)

In [None]:
df = read_csv(2000)
fp.write('/tmp/df.snappy.pq', df, compression='snappy', write_index=False)

In [None]:
!ls -sh data/parquet/????.pq

In [None]:
# convert CSV to parquet
data_years = range(1997, 2016)
for year in data_years:
    df = read_csv(year)
    fp.write(f'./data/parquet/{year}.gz.pq', df, compression='gzip', write_index=False)

In [None]:
tools.size_on_disk('./data/parquet/????.gz.pq')

In [None]:
# convert CSV to parquet
data_years = range(1997, 2016)
for year in data_years:
    df = read_csv(year)
    fp.write(f'./data/parquet/{year}.snappy.pq', df, compression='snappy', write_index=False)

In [None]:
!ls -sh data/parquet/????.snappy.pq

In [None]:
def read_pq(year, cols=None):
    """Read single year of data from CSV file into pandas dataframe.
    
    :par cols: list of columns to read (default all).
    """
    pf = fp.ParquetFile(f'./data/parquet/{year}.pq')
    return pf.to_pandas(cols)

In [None]:
dfc = read_csv(2013)
dfp = read_pq(2013)

In [None]:
(dfc.dtypes == dfp.dtypes).all()

# omnischema

This is an idea of a Python package that would provide conversion between different table schema formats.

What I need now: conversion between pandas and parquet.

Schemas that could be supported:

- pandas
  - classic [numpy dtypes](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)
  - new [extension types](https://pandas.pydata.org/docs/getting_started/basics.html#basics-dtypes)
- [parquet](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md)
- [Frictionless Data Table Schema](https://frictionlessdata.io/specs/table-schema/)
- SQL (flavors)
- Apache Arrow
- R dataframe
- Stata
- SAS
- BigQuery
- xlsx, ods

## Existing tools

- [Stat/Transfer](https://stattransfer.com/) - commercial software for format conversion. It even support Feather.
- [pyreadstat](https://github.com/Roche/pyreadstat)
  - convert SAS, Stata and SPSS formats <-> pandas dataframes
  - wrapped C library, faster than sas7bdat and pd.read_sas
  - available from conda-forge
  - read meta, read column subset, read row chunks
- [odo](http://odo.pydata.org/en/latest/) - efficiently migrates data from the source to the target through a network of conversions.


## Additional resources

- [DCAT](https://www.w3.org/TR/vocab-dcat/) - W3C spec for dataset metadata
- [Schema.org Dataset](https://schema.org/Dataset) - founded by major search engines, used by Google Dataset Search tool.
- [Data Documentation Initiative](https://ddialliance.org/)