In [None]:
%load_ext autoreload
%autoreload 2

# parquet

> Binary storage formats.

## Apache Parquet

File format for data storage. Provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. 

> Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

[Website](https://parquet.apache.org/)

[Wikipedia](https://en.wikipedia.org/wiki/Apache_Parquet)

[Parquet vs CSV: AWS processing costs](https://blog.openbridge.com/how-to-be-a-hero-with-powerful-parquet-google-and-amazon-f2ae0f35ee04)

[Docs](https://parquet.apache.org/documentation/latest/)

Features:

- columnar storage, only read the data of interest
- efficient binary packing
- choice of compression algorithms and encoding
- split data into files, allowing for parallel processing
- range of logical types
- statistics stored in metadata allow for skipping unneeded chunks
- data partitioning using the directory structure

Hierarchically, a file consists of one or more row groups. A row group contains exactly one column chunk per column. Column chunks contain one or more pages. 

## fastparquet

Python implementation of the *Apache Parquet* format.
Part of *dask* ecosystem, designed to work well with dask for parallel execution.

Latest release: 0.3.3

Not all parts of the Parquet-format have been implemented yet or tested. 
Not all output options will be compatible with every other Parquet framework, which each implement only a subset of the standard.
Usage decisions: writing parquet files that are compatible with other parquet implementations, versus performance when writing data for reading back with fastparquet.

[GitHub page](https://github.com/dask/fastparquet)

[Docs](https://fastparquet.readthedocs.io/en/latest/)

### Variations

- uncompressed, gzip, snappy (install `python-snappy` from conda-forge separately)

### Issues

- As of 0.3.3, pandas extended dtypes like "Int64" are not supported, but there is [wip](https://github.com/dask/fastparquet/pull/483)

In [None]:
import pandas as pd
import fastparquet as fp

import ig_format
from ig_format import pandas as igpd

In [None]:
data_dir = './out/extracts/100k'
schema_path = './out/schema.json'
data_years = range(1997, 2000)

dt = igpd.dtypes_from_schema(schema_path)
df = pd.read_csv(data_dir + '/2000.csv', dtype=dt)

In [None]:
# convert nullable ints to floats for fastparquet compatibility
# could be part of schema creation
for c in df:
    if isinstance(df[c].dtype, pd.Int64Dtype):
        df[c] = df[c].astype('float64')

fp.write('./tmp/2000.parquet', df, compression='gzip')

In [None]:
pf = fp.ParquetFile('./tmp/2000.parquet')
dfp = pf.to_pandas(['company', 'abi', 'state', 'naics', 'employees'])
dfp.head()

## Apache Arrow

> Apache Arrow is a cross-language development platform for **in-memory** data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.

[Website](https://arrow.apache.org/)
[Docs](https://arrow.apache.org/docs/index.html)

The Arrow Python bindings (also named “PyArrow”) have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ implementation of Arrow.

[Data types, Schema and Table](https://arrow.apache.org/docs/python/data.html)

[Read and write parquet](https://arrow.apache.org/docs/python/parquet.html)

> fastparquet has a much smaller footprint, weighing in at a modest 180 kB compared to pyarrow‘s hulking 48 MB. It also has a much simpler and pandas centric read/write signature which can be nice for users that are more comfortable working with a DataFrame mindset. On the other hand,pyarrow offers greater coverage of the Parquet file format, more voluminous documentation, and more explicit column typing which turned out to be important later on. After much wavering we decided to go forward using pyarrow. [parquet business use case](https://medium.com/when-i-work-data/por-que-parquet-2a3ec42141c6)

Arrow's purpose is to move data between components (pandas, parquet, CSV, Spark, R, ...) more efficiently. If we only use pandas, fastparquet might be sufficient. For now, Arrow does not support SAS or Stata.


### Feather

[docs](https://arrow.apache.org/docs/python/ipc.html#feather-format)

Lightweight file storage format for dataframes understood by both pandas and R. Does not look very active, it is probably better to use parquet.

## Conversion from SAS

### pyreadstat

https://github.com/Roche/pyreadstat

- convert between SAS, Stata and SPSS formats <-> pandas dataframes
- wrapped C library, faster than sas7bdat and pd.read_sas
- available from conda-forge
- read meta, read column subset, read row chunks