[C++][python] performance of read_table using filters on a partitioned parquet file #18743

asfimport · 2021-07-17T19:18:38Z

Reading a single partition of a parquet file via filters is significantly slower than reading the partition directly.

import pandas as pd
size = 100_000
df = pd.DataFrame({'a': [1, 2, 3] * size, 'b': [4, 5, 6] * size})
df.to_parquet('test.parquet', partition_cols=['a'])
%timeit pyarrow.parquet.read_table('test.parquet/a=1')
%timeit pyarrow.parquet.read_table('test.parquet', filters=[('a', '=', 1)])

gives the timings

2.57 ms ± 41.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.18 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Likewise, changing size to 1_000_000 in the above code gives

16.3 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Part of the docs for read_table states:

Partition keys embedded in a nested directory structure will be exploited to avoid loading files at all if they contain no matching rows.

From this, I expected the performance to be roughly the same.

Reporter: Richard Shadrach

_{Note: This issue was originally created as ARROW-13369. Please see the migration documentation for further details.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][python] performance of read_table using filters on a partitioned parquet file #18743

[C++][python] performance of read_table using filters on a partitioned parquet file #18743

asfimport commented Jul 17, 2021

[C++][python] performance of read_table using filters on a partitioned parquet file #18743

[C++][python] performance of read_table using filters on a partitioned parquet file #18743

Comments

asfimport commented Jul 17, 2021