Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17483: [Python] Support Expression filters in non-legacy ParquetDataset/read_table #14011

Merged
merged 6 commits into from
Sep 13, 2022
Merged

ARROW-17483: [Python] Support Expression filters in non-legacy ParquetDataset/read_table #14011

merged 6 commits into from
Sep 13, 2022

Conversation

milesgranger
Copy link
Contributor

@milesgranger milesgranger commented Aug 31, 2022

No description provided.

@github-actions
Copy link

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

As expressions get supported with this PR I guess expressions with nested fields also work? Could we add that in the tests, something like:

integer_keys = [0, 1, 2, 3, 4]
df = pd.DataFrame({
    'index': np.arange(len(integer_keys)),
    'integers': np.array(integer_keys, dtype='i4'),
    'nested': np.array([{'a': j % 3, 'b': str(j % 3)} for j in range(5)])
}, columns=['index', 'integers', 'nested'])

and add pc.field("nested", "b") == 1 to the fixture?

I haven't tested it, but am curious if this works.

python/pyarrow/tests/parquet/test_dataset.py Outdated Show resolved Hide resolved
@milesgranger
Copy link
Contributor Author

milesgranger commented Aug 31, 2022

Thanks @AlenkaF!
Extended as suggested (it works) with some slight modifications, just to keep the filter comparison operation the same throughout the fixtures if that's acceptable.

AlenkaF referenced this pull request Aug 31, 2022
This PR tries to redo the work from #9799.

It will unblock:
- https://issues.apache.org/jira/browse/ARROW-13798
- https://issues.apache.org/jira/browse/ARROW-14596

cc @jorisvandenbossche @pitrou

Closes #12863 from AlenkaF/ARROW-11259

Lead-authored-by: Alenka Frim <frim.alenka@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@milesgranger
Copy link
Contributor Author

@jorisvandenbossche, as you're the one who pointed me to this issue, would you like to take a gander?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Some minor comments

python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved
python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jorisvandenbossche jorisvandenbossche changed the title ARROW-17483: [Python] Support Expression filters in _ParquetDatasetV2 ARROW-17483: [Python] Support Expression filters in non-legacy ParquetDataset/read_table Sep 13, 2022
@jorisvandenbossche jorisvandenbossche merged commit 8ceb3b8 into apache:master Sep 13, 2022
@milesgranger milesgranger deleted the ARROW-17483_expressions-in-read_table-filter branch September 13, 2022 08:04
@ursabot
Copy link

ursabot commented Sep 13, 2022

Benchmark runs are scheduled for baseline = ef8cb09 and contender = 8ceb3b8. 8ceb3b8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.37% ⬆️0.0%] test-mac-arm
[Failed ⬇️4.51% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.46% ⬆️0.11%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 8ceb3b8a ec2-t3-xlarge-us-east-2
[Finished] 8ceb3b8a test-mac-arm
[Failed] 8ceb3b8a ursa-i9-9960x
[Finished] 8ceb3b8a ursa-thinkcentre-m75q
[Finished] ef8cb099 ec2-t3-xlarge-us-east-2
[Finished] ef8cb099 test-mac-arm
[Failed] ef8cb099 ursa-i9-9960x
[Finished] ef8cb099 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Sep 13, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
…tDataset/read_table (apache#14011)

Authored-by: Miles Granger <miles59923@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
…tDataset/read_table (apache#14011)

Authored-by: Miles Granger <miles59923@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants