# Partitioning with PyArrow and Fastparquet

- [PyArrow](#PyArrow)
- [fastparquet](#fastparquet)
- [Compatibility](#Compatibility)

In [154]:
from glob import glob

import numpy as np
import pandas as pd

## PyArrow

In [156]:
import pyarrow.parquet as pq
import pyarrow as pa

We will use the same dataframe as before

In [157]:
df = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})

In [158]:
df

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


In [163]:
table = pa.Table.from_pandas(df)

We can write it down using the `three` columns, this will create multiple folders, one for each unique value inside the `three` column.

In [164]:
pq.write_to_dataset(table, 'example_pyarrow', partition_cols=['three'], compression='gzip')

In [138]:
!ls example_pyarrow/

[1m[36mthree=False[m[m [1m[36mthree=True[m[m


`three=False` and `three=True` are the unique values inside the `three` columns

Inside the `three=False` we have our Parquet file that we can read directly

We can read it back just by using the folder name

In [167]:
pq.read_table('example_pyarrow').to_pandas()

Unnamed: 0,one,two,three
1,,bar,False
1,,bar,False
1,,bar,False
1,,bar,False
0,-1.0,foo,True
2,2.5,baz,True
0,-1.0,foo,True
2,2.5,baz,True
0,-1.0,foo,True
2,2.5,baz,True


We can also read a parquet file directly

In [168]:
pq.read_table(glob("example_pyarrow/three=False/*.parquet")[0]).to_pandas()

Unnamed: 0,one,two
1,,bar


## Using fastparquet

In [169]:
from fastparquet import ParquetFile, write

In [170]:
df

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


We need to pass `hive` as `file_scheme` otherwise partitioning will be ignored.

In [144]:
write('example_fastparq', df, row_group_offsets=1, partition_on=['three'], file_scheme='hive')

The structure of the folder is different compared to the pyarrow's one.

In [172]:
!ls example_fastparq/

_common_metadata _metadata        [1m[36mthree=False[m[m      [1m[36mthree=True[m[m


But reading works exatcly as before

In [171]:
ParquetFile('example_fastparq').to_pandas()

Unnamed: 0,one,two,three
0,-1.0,foo,True
1,,bar,False
2,2.5,baz,True


As you can see we can read the parquet file created with PyArrow

In [146]:
ParquetFile('example_fastparq/three=False/part.1.parquet').to_pandas()

Unnamed: 0,one,two
0,,bar


# Compatibilty

Seems fastparquet cannot read partions create with pyarrow

In [147]:
ParquetFile('example_pyarrow').to_pandas()

IsADirectoryError: [Errno 21] Is a directory: 'example_pyarrow'

Here the issue: https://github.com/dask/fastparquet/issues/364

A trick is to use `glob`

In [174]:
ParquetFile(glob("example_pyarrow/**/*.parquet", recursive=True)).to_pandas()

Unnamed: 0,one,two,three
0,-1.0,foo,True
2,2.5,baz,True
0,-1.0,foo,True
2,2.5,baz,True
0,-1.0,foo,True
2,2.5,baz,True
0,-1.0,foo,True
2,2.5,baz,True
1,,bar,False
1,,bar,False


But pyarrow can read partions created using fastparquet

In [175]:
pq.read_table("example_fastparq").to_pandas()

Unnamed: 0,one,two,three
0,,bar,False
1,-1.0,foo,True
2,2.5,baz,True
