Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33976: [Python] Remove usage of TableGroupBy helper in favor of pyarrow.acero.Declaration #34769

Merged

Conversation

jorisvandenbossche
Copy link
Member

Rationale for this change

Now we have the pyarrow.acero building blocks (GH-33976), we can easily construct the Declaration, which arrow::compute::TableGroupBy creates under the hood, ourselves in pyarrow.

Are these changes tested?

Existing tests are passing.

Are there any user-facing changes?

No

@jorisvandenbossche
Copy link
Member Author

@westonpace I don't see the C++ TableGroupBy being used anywhere else (except in tests). Now we don't need this anymore for pyarrow, do we still want to keep this? (as convenience for C++ users) Or do I remove this on the C++ side as well?

BatchGroupBy is used in a benchmark, but we can easily move the 4-line function to construct the declaration to that file.

@westonpace
Copy link
Member

@westonpace I don't see the C++ TableGroupBy being used anywhere else (except in tests). Now we don't need this anymore for pyarrow, do we still want to keep this? (as convenience for C++ users) Or do I remove this on the C++ side as well?

I'm kind of +0 but yes, let's go ahead and remove it. Having a single interface to Acero is probably easier to maintain long-term.

@jorisvandenbossche
Copy link
Member Author

To be clear I am fine with also keeping it if it would be useful on the C++ side (although also there, it's only a few lines of code through the declaration interface)

@github-actions github-actions bot added the awaiting review Awaiting review label Apr 4, 2023
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python code looks good. I reviewed some of the group by code again. I thought it was doing a bit more to patch up the output from the exec plan. However, I believe I am remembering an older state of the code. So yes, let's go ahead and remove the C++ side too if you don't mind.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Apr 10, 2023
@jorisvandenbossche
Copy link
Member Author

On the C++ side, those are publicly exposed in Acero I assume (use of ARROW_ACERO_EXPORT). Do we first want to deprecate those, or it is OK to directly remove them given the early stage of Acero?

@jorisvandenbossche
Copy link
Member Author

Another question: groupby_test.cc, is that just testing the helper interface (and the actual hash aggregation tests are living elsewhere), or is that useful to keep?
I was first planning to just move the helper defining/executing the Declaration to groupby_test.cc to keep it running without exposing the helper function. But if that's not needed, I can just remove the file altogether (I see there are quite some tests involving grouped aggregations in plan_test.cc)

@westonpace
Copy link
Member

@jorisvandenbossche

If you don't mind, keeping the standalone groupby_test would be helpful. I want to encourage splitting plan_test up into tests-per-node (see order_by_node_test and fetch_node_test which are similar in size to groupby_test). Maybe you could rename it to aggregate_node_test?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks a lot!

@jorisvandenbossche jorisvandenbossche merged commit 0bb2d83 into apache:main Jun 1, 2023
33 of 34 checks passed
@jorisvandenbossche jorisvandenbossche deleted the gh-33976-refactor-groupby branch June 1, 2023 14:02
@ursabot
Copy link

ursabot commented Jun 2, 2023

Benchmark runs are scheduled for baseline = c1359c5 and contender = 0bb2d83. 0bb2d83 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.41% ⬆️0.09%] test-mac-arm
[Finished ⬇️9.48% ⬆️0.65%] ursa-i9-9960x
[Finished ⬇️0.81% ⬆️0.36%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 0bb2d83a ec2-t3-xlarge-us-east-2
[Finished] 0bb2d83a test-mac-arm
[Finished] 0bb2d83a ursa-i9-9960x
[Finished] 0bb2d83a ursa-thinkcentre-m75q
[Finished] c1359c5f ec2-t3-xlarge-us-east-2
[Finished] c1359c5f test-mac-arm
[Finished] c1359c5f ursa-i9-9960x
[Finished] c1359c5f ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Jun 2, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants