ARROW-17483: [Python] Support Expression filters in non-legacy ParquetDataset/read_table #14011

milesgranger · 2022-08-31T09:58:18Z

No description provided.

github-actions · 2022-08-31T10:09:47Z

https://issues.apache.org/jira/browse/ARROW-17483

AlenkaF

Looks good!

As expressions get supported with this PR I guess expressions with nested fields also work? Could we add that in the tests, something like:

integer_keys = [0, 1, 2, 3, 4]
df = pd.DataFrame({
    'index': np.arange(len(integer_keys)),
    'integers': np.array(integer_keys, dtype='i4'),
    'nested': np.array([{'a': j % 3, 'b': str(j % 3)} for j in range(5)])
}, columns=['index', 'integers', 'nested'])

and add pc.field("nested", "b") == 1 to the fixture?

I haven't tested it, but am curious if this works.

python/pyarrow/tests/parquet/test_dataset.py

milesgranger · 2022-08-31T12:08:41Z

Thanks @AlenkaF!
Extended as suggested (it works) with some slight modifications, just to keep the filter comparison operation the same throughout the fixtures if that's acceptable.

@jorisvandenbossche

This PR tries to redo the work from #9799. It will unblock: - https://issues.apache.org/jira/browse/ARROW-13798 - https://issues.apache.org/jira/browse/ARROW-14596 cc @jorisvandenbossche @pitrou Closes #12863 from AlenkaF/ARROW-11259 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

python/pyarrow/parquet/core.py

milesgranger · 2022-09-01T07:44:18Z

@jorisvandenbossche, as you're the one who pointed me to this issue, would you like to take a gander?

jorisvandenbossche

Thanks! Some minor comments

python/pyarrow/parquet/core.py

jorisvandenbossche

Thanks!

ursabot · 2022-09-13T19:15:11Z

Benchmark runs are scheduled for baseline = ef8cb09 and contender = 8ceb3b8. 8ceb3b8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.37% ⬆️0.0%] test-mac-arm
[Failed ⬇️4.51% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.46% ⬆️0.11%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 8ceb3b8a ec2-t3-xlarge-us-east-2
[Finished] 8ceb3b8a test-mac-arm
[Failed] 8ceb3b8a ursa-i9-9960x
[Finished] 8ceb3b8a ursa-thinkcentre-m75q
[Finished] ef8cb099 ec2-t3-xlarge-us-east-2
[Finished] ef8cb099 test-mac-arm
[Failed] ef8cb099 ursa-i9-9960x
[Finished] ef8cb099 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-09-13T19:15:33Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

…tDataset/read_table (apache#14011) Authored-by: Miles Granger <miles59923@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

github-actions bot added the Component: Python label Aug 31, 2022

AlenkaF approved these changes Aug 31, 2022

View reviewed changes

python/pyarrow/tests/parquet/test_dataset.py Outdated Show resolved Hide resolved

milesgranger commented Aug 31, 2022

View reviewed changes

python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Sep 5, 2022

View reviewed changes

python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved

python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved

milesgranger added 6 commits September 12, 2022 09:31

Support Expression filters in _ParquetDatasetV2

3b8776d

Extend test with nested checks

666cbdf

Remove Expression type check and related raising

cf0ddfc

Catch len error to report Malformed filters

ee7cfc5

Remove redundant check for filters type

ed34177

Update Expression error when legacy dataset

e126425

jorisvandenbossche approved these changes Sep 13, 2022

View reviewed changes

jorisvandenbossche changed the title ~~ARROW-17483: [Python] Support Expression filters in _ParquetDatasetV2~~ ARROW-17483: [Python] Support Expression filters in non-legacy ParquetDataset/read_table Sep 13, 2022

jorisvandenbossche merged commit 8ceb3b8 into apache:master Sep 13, 2022

milesgranger deleted the ARROW-17483_expressions-in-read_table-filter branch September 13, 2022 08:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17483: [Python] Support Expression filters in non-legacy ParquetDataset/read_table #14011

ARROW-17483: [Python] Support Expression filters in non-legacy ParquetDataset/read_table #14011

milesgranger commented Aug 31, 2022 •

edited by jorisvandenbossche

Loading

github-actions bot commented Aug 31, 2022

AlenkaF left a comment

milesgranger commented Aug 31, 2022 •

edited

Loading

milesgranger commented Sep 1, 2022

jorisvandenbossche left a comment

jorisvandenbossche left a comment

ursabot commented Sep 13, 2022

ursabot commented Sep 13, 2022

ARROW-17483: [Python] Support Expression filters in non-legacy ParquetDataset/read_table #14011

ARROW-17483: [Python] Support Expression filters in non-legacy ParquetDataset/read_table #14011

Conversation

milesgranger commented Aug 31, 2022 • edited by jorisvandenbossche Loading

github-actions bot commented Aug 31, 2022

AlenkaF left a comment

Choose a reason for hiding this comment

milesgranger commented Aug 31, 2022 • edited Loading

milesgranger commented Sep 1, 2022

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

ursabot commented Sep 13, 2022

ursabot commented Sep 13, 2022

milesgranger commented Aug 31, 2022 •

edited by jorisvandenbossche

Loading

milesgranger commented Aug 31, 2022 •

edited

Loading