# PyArrow VS Fastparquet

This is a quick comparison between pyarrow and fastparquet.

Don't know what is parquet and arrow? this talk 👉 https://www.youtube.com/watch?v=wdmf1msbtVs

## More info

- Parquet: https://parquet.apache.org/
- Arrow: https://arrow.apache.org/

In [1]:
from pathlib import Path
import warnings

import numpy as np
import pandas as pd

In [2]:
warnings.filterwarnings('ignore')

## PyArrow

pyarrow: https://arrow.apache.org/docs/python/index.html

In [3]:
import pyarrow.parquet as pq
import pyarrow as pa

Let's create a new dataframe

In [4]:
df = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})

In [5]:
df

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


We can transform it inot a Arrow's table

In [6]:
table = pa.Table.from_pandas(df)

In [7]:
table

pyarrow.Table
one: double
two: string
three: bool
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "one", "field_name": "one", "pandas_type": "float64",'
            b' "numpy_type": "float64", "metadata": null}, {"name": "two", "fi'
            b'eld_name": "two", "pandas_type": "unicode", "numpy_type": "objec'
            b't", "metadata": null}, {"name": "three", "field_name": "three", '
            b'"pandas_type": "bool", "numpy_type": "bool", "metadata": null}, '
            b'{"name": null, "field_name": "__index_level_0__", "pandas_type":'
            b' "int64", "numpy_type": "int64", "metadata": null}], "pandas_ver'
            b'sion": "0.23.4"}'}

We can now write the table, using `write_table` and the name of the parque file.

In [8]:
pq.write_table(table, 'example.parquet')

and we can read back our parquet file

In [9]:
table2 = pq.read_table('example.parquet')

In [10]:
type(table2)

pyarrow.lib.Table

And transform it to a Pandas DataFrame

In [11]:
df = table2.to_pandas()

In [12]:
type(df)

pandas.core.frame.DataFrame

In [13]:
df

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


We also specify which columns we want to read

In [14]:
pq.read_table('example.parquet', columns=['one', 'three']).to_pandas()

Unnamed: 0,one,three
0,-1.0,True
1,,False
2,2.5,True


And we can read multiple parquet files together

In [15]:
pq.write_table(table, 'example2.parquet')

In [16]:
!ls

README.md                 example.parquet           partitioning.ipynb
env.yml                   example2.parquet          pyarrow_fastparquet.ipynb


In [17]:
my_parquet_files = ["example.parquet",  "example2.parquet"]

In [18]:
files = pq.ParquetDataset(my_parquet_files)

In [19]:
type(files)

pyarrow.parquet.ParquetDataset

In [20]:
files.read().to_pandas()

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


Parquet performs schema validation, so the dataframes has to have the same schema

In [21]:
df = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz']})

In [22]:
table = pa.Table.from_pandas(df)

In [23]:
pq.write_table(table, 'example3.parquet')

In [24]:
my_parquet_files = ["example.parquet",  "example2.parquet", "example3.parquet"]

In [25]:
files = pq.ParquetDataset(my_parquet_files)

ValueError: Schema in example3.parquet was different. 
one: double
two: string
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "one", "field_name": "one", "pandas_type": "float64",'
            b' "numpy_type": "float64", "metadata": null}, {"name": "two", "fi'
            b'eld_name": "two", "pandas_type": "unicode", "numpy_type": "objec'
            b't", "metadata": null}, {"name": null, "field_name": "__index_lev'
            b'el_0__", "pandas_type": "int64", "numpy_type": "int64", "metadat'
            b'a": null}], "pandas_version": "0.23.4"}'}

vs

one: double
two: string
three: bool
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "one", "field_name": "one", "pandas_type": "float64",'
            b' "numpy_type": "float64", "metadata": null}, {"name": "two", "fi'
            b'eld_name": "two", "pandas_type": "unicode", "numpy_type": "objec'
            b't", "metadata": null}, {"name": "three", "field_name": "three", '
            b'"pandas_type": "bool", "numpy_type": "bool", "metadata": null}, '
            b'{"name": null, "field_name": "__index_level_0__", "pandas_type":'
            b' "int64", "numpy_type": "int64", "metadata": null}], "pandas_ver'
            b'sion": "0.23.4"}'}

## Using fastparquet

`fastparquet`: https://github.com/dask/fastparquet

In [26]:
from fastparquet import ParquetFile, write

In [27]:
df

Unnamed: 0,one,two
0,-1.0,foo
1,,bar
2,2.5,baz


In [28]:
write('example.fastparq', df)

We can write a parquet file passing our Pandas DataFrame directly, write support also many other options, like row_group_offsets and file_scheme.

In [29]:
write('example2.fastparq', df, row_group_offsets=1, compression='GZIP', file_scheme='hive')

This won't generate only a single file but a new folder, we 3 separate files (one for each row) plus the metadata.

In [30]:
ParquetFile('example2.fastparq').to_pandas()

Unnamed: 0,one,two
0,-1.0,foo
1,,bar
2,2.5,baz


But reading works exatcly as before

As you can see we can read the parquet file created with PyArrow

In [31]:
ParquetFile('example2.fastparq').to_pandas()

Unnamed: 0,one,two
0,-1.0,foo
1,,bar
2,2.5,baz
