# PyArrow 🆚 fastparquet

This is a quick comparison between pyarrow and fastparquet.

Don't you know what parquet and arrow are? check this talk 👉 https://www.youtube.com/watch?v=wdmf1msbtVs

- [PyArrow](#PyArrow)
- [fastparquet](#Fastparquet)
- [Partitioning with PyArrow](#Partitioning-with-PyArrow)
- [Partitioning with fastparquet](#Partitioning-with-fastparquet)
- [Compatibility](#Compatibility)

### More info

- Parquet: https://parquet.apache.org/
- Arrow: https://arrow.apache.org/

In [1]:
from glob import glob
from pathlib import Path
import warnings

import numpy as np
import pandas as pd

In [2]:
warnings.filterwarnings('ignore')

## PyArrow

pyarrow: https://arrow.apache.org/docs/python/index.html

In [3]:
import pyarrow.parquet as pq
import pyarrow as pa

Let's create a new dataframe

In [4]:
df = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})

In [5]:
df

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


We can transform it into an Arrow's table

In [6]:
table = pa.Table.from_pandas(df)

In [7]:
type(table)

pyarrow.lib.Table

In [8]:
table

pyarrow.Table
one: double
two: string
three: bool
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "one", "field_name": "one", "pandas_type": "float64",'
            b' "numpy_type": "float64", "metadata": null}, {"name": "two", "fi'
            b'eld_name": "two", "pandas_type": "unicode", "numpy_type": "objec'
            b't", "metadata": null}, {"name": "three", "field_name": "three", '
            b'"pandas_type": "bool", "numpy_type": "bool", "metadata": null}, '
            b'{"name": null, "field_name": "__index_level_0__", "pandas_type":'
            b' "int64", "numpy_type": "int64", "metadata": null}], "pandas_ver'
            b'sion": "0.23.4"}'}

We can now write the table, using `write_table` and the name of the parquet file

In [9]:
pq.write_table(table, 'pyarrow.0.parquet')

and we can read it back

In [10]:
table = pq.read_table('pyarrow.0.parquet')

and transform it to a Pandas DataFrame

In [11]:
df = table.to_pandas()

In [12]:
type(df)

pandas.core.frame.DataFrame

In [13]:
df

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


We also specify which columns we want to read

In [14]:
pq.read_table('pyarrow.0.parquet', columns=['one', 'three']).to_pandas()

Unnamed: 0,one,three
0,-1.0,True
1,,False
2,2.5,True


And we can read multiple parquet files together

In [15]:
pq.write_table(table, 'pyarrow.1.parquet')

In [16]:
!ls

README.md                 pyarrow.0.parquet         pyarrow_fastparquet.ipynb
env.yml                   pyarrow.1.parquet


In [17]:
parquet_files = glob("*.parquet")

In [18]:
files = pq.ParquetDataset(parquet_files)

In [19]:
files.read().to_pandas()

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


Parquet files have a schema, when you try to read multiple files together they need to have the same schema or you will get an error

This new dataframe has 2 columns instead of 3

In [20]:
df = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz']})

In [21]:
table = pa.Table.from_pandas(df)

In [22]:
pq.write_table(table, 'pyarrow.2.parquet')

In [23]:
parquet_files = glob("*.parquet")

In [24]:
files = pq.ParquetDataset(parquet_files)

ValueError: Schema in pyarrow.2.parquet was different. 
one: double
two: string
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "one", "field_name": "one", "pandas_type": "float64",'
            b' "numpy_type": "float64", "metadata": null}, {"name": "two", "fi'
            b'eld_name": "two", "pandas_type": "unicode", "numpy_type": "objec'
            b't", "metadata": null}, {"name": null, "field_name": "__index_lev'
            b'el_0__", "pandas_type": "int64", "numpy_type": "int64", "metadat'
            b'a": null}], "pandas_version": "0.23.4"}'}

vs

one: double
two: string
three: bool
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "one", "field_name": "one", "pandas_type": "float64",'
            b' "numpy_type": "float64", "metadata": null}, {"name": "two", "fi'
            b'eld_name": "two", "pandas_type": "unicode", "numpy_type": "objec'
            b't", "metadata": null}, {"name": "three", "field_name": "three", '
            b'"pandas_type": "bool", "numpy_type": "bool", "metadata": null}, '
            b'{"name": null, "field_name": "__index_level_0__", "pandas_type":'
            b' "int64", "numpy_type": "int64", "metadata": null}], "pandas_ver'
            b'sion": "0.23.4"}'}

## Fastparquet

`fastparquet`: https://github.com/dask/fastparquet

In [25]:
from fastparquet import ParquetFile, write

In [26]:
df = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})

In [27]:
df

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


In [28]:
write('fastparq.0.parquet', df)

We can write a parquet file passing our Pandas DataFrame directly, write also supports many other options, like `row_group_offsets` and `file_scheme`

In [29]:
write('fastparq.parquet', df, row_group_offsets=1, file_scheme='hive')

This won't generate only a single file but a new folder, we 3 separate files (one for each row) plus the metadata

In [49]:
!ls fastparq.parquet/

_common_metadata part.0.parquet   part.2.parquet
_metadata        part.1.parquet


In [30]:
ParquetFile('fastparq.parquet').to_pandas()

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


Reading works exactly as before, event with multiple files

As you can see we can read the parquet file created with PyArrow

In [31]:
ParquetFile('pyarrow.0.parquet').to_pandas()

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


## Partitioning with PyArrow

In [32]:
df = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})

In [33]:
df

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


In [34]:
table = pa.Table.from_pandas(df)

We can write it down partitioning by the column `three`, this will create multiple folders, one for each unique value inside the `three` column

In [35]:
pq.write_to_dataset(table, 'pyarrow', partition_cols=['three'])

In [36]:
!ls pyarrow/

[1m[36mthree=False[m[m [1m[36mthree=True[m[m


`three=False` and `three=True` are the unique values inside the `three` column

We can read it back just by using the folder name

In [37]:
pq.read_table('pyarrow').to_pandas()

Unnamed: 0,one,two,three
1,,bar,False
0,-1.0,foo,True
2,2.5,baz,True


Inside the `three=False` we have our parquet file that we can read directly

In [38]:
parquet_file = glob("pyarrow/three=False/*.parquet")[0]

In [39]:
parquet_file

'pyarrow/three=False/ff02b95d8b2e47cbb757fe72d5ccb2a6.parquet'

In [40]:
pq.read_table(parquet_file).to_pandas()

Unnamed: 0,one,two
1,,bar


## Partitioning with fastparquet

In [41]:
df = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})

In [42]:
df

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


If we want to partition a dataframe with fastparquet we need to pass `hive` as `file_scheme`, otherwise `partition_on` will be ignored

In [43]:
write('fastparquet', df, row_group_offsets=1, partition_on=['three'], file_scheme='hive')

The structure of the folder is different compared to the pyarrow's one.

In [44]:
!ls fastparquet/

_common_metadata _metadata        [1m[36mthree=False[m[m      [1m[36mthree=True[m[m


Reading works exactly as before

In [45]:
ParquetFile('fastparquet').to_pandas()

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


# Compatibility

Both pyarrow and fastparquet can rean single parquet file but fastparquet cannot read partions created with pyarrow

In [46]:
ParquetFile('pyarrow').to_pandas()

IsADirectoryError: [Errno 21] Is a directory: 'pyarrow'

here more info: https://github.com/dask/fastparquet/issues/364

A trick is to use `glob`

In [47]:
ParquetFile(glob("pyarrow/**/*.parquet", recursive=True)).to_pandas()

Unnamed: 0,one,two,three
0,-1.0,foo,True
2,2.5,baz,True
1,,bar,False


On the other side pyarrow can read partitions created using fastparquet

In [48]:
pq.read_table("fastparquet").to_pandas()

Unnamed: 0,one,two,three
0,,bar,False
1,-1.0,foo,True
2,2.5,baz,True
