We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading a single partition of a parquet file via filters is significantly slower than reading the partition directly.
import pandas as pd size = 100_000 df = pd.DataFrame({'a': [1, 2, 3] * size, 'b': [4, 5, 6] * size}) df.to_parquet('test.parquet', partition_cols=['a']) %timeit pyarrow.parquet.read_table('test.parquet/a=1') %timeit pyarrow.parquet.read_table('test.parquet', filters=[('a', '=', 1)])
gives the timings
2.57 ms ± 41.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 5.18 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Likewise, changing size to 1_000_000 in the above code gives
16.3 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 32.7 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Part of the docs for read_table states:
Partition keys embedded in a nested directory structure will be exploited to avoid loading files at all if they contain no matching rows.
From this, I expected the performance to be roughly the same.
Reporter: Richard Shadrach
Note: This issue was originally created as ARROW-13369. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Reading a single partition of a parquet file via filters is significantly slower than reading the partition directly.
gives the timings
Likewise, changing size to 1_000_000 in the above code gives
Part of the docs for read_table states:
From this, I expected the performance to be roughly the same.
Reporter: Richard Shadrach
Note: This issue was originally created as ARROW-13369. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: