GH-33976: [Python] Remove usage of TableGroupBy helper in favor of pyarrow.acero.Declaration #34769

jorisvandenbossche · 2023-03-29T15:24:39Z

Rationale for this change

Now we have the pyarrow.acero building blocks (GH-33976), we can easily construct the Declaration, which arrow::compute::TableGroupBy creates under the hood, ourselves in pyarrow.

Are these changes tested?

Existing tests are passing.

Are there any user-facing changes?

No

… of pyarrow.acero.Declaration

jorisvandenbossche · 2023-03-29T15:27:13Z

@westonpace I don't see the C++ TableGroupBy being used anywhere else (except in tests). Now we don't need this anymore for pyarrow, do we still want to keep this? (as convenience for C++ users) Or do I remove this on the C++ side as well?

BatchGroupBy is used in a benchmark, but we can easily move the 4-line function to construct the declaration to that file.

…ctor-groupby

westonpace · 2023-03-31T18:03:26Z

@westonpace I don't see the C++ TableGroupBy being used anywhere else (except in tests). Now we don't need this anymore for pyarrow, do we still want to keep this? (as convenience for C++ users) Or do I remove this on the C++ side as well?

I'm kind of +0 but yes, let's go ahead and remove it. Having a single interface to Acero is probably easier to maintain long-term.

jorisvandenbossche · 2023-04-04T11:29:16Z

To be clear I am fine with also keeping it if it would be useful on the C++ side (although also there, it's only a few lines of code through the declaration interface)

…ctor-groupby

westonpace

The python code looks good. I reviewed some of the group by code again. I thought it was doing a bit more to patch up the output from the exec plan. However, I believe I am remembering an older state of the code. So yes, let's go ahead and remove the C++ side too if you don't mind.

…ctor-groupby

jorisvandenbossche · 2023-05-02T08:49:21Z

On the C++ side, those are publicly exposed in Acero I assume (use of ARROW_ACERO_EXPORT). Do we first want to deprecate those, or it is OK to directly remove them given the early stage of Acero?

jorisvandenbossche · 2023-05-02T08:54:20Z

Another question: groupby_test.cc, is that just testing the helper interface (and the actual hash aggregation tests are living elsewhere), or is that useful to keep?
I was first planning to just move the helper defining/executing the Declaration to groupby_test.cc to keep it running without exposing the helper function. But if that's not needed, I can just remove the file altogether (I see there are quite some tests involving grouped aggregations in plan_test.cc)

westonpace · 2023-05-08T16:58:20Z

@jorisvandenbossche

If you don't mind, keeping the standalone groupby_test would be helpful. I want to encourage splitting plan_test up into tests-per-node (see order_by_node_test and fetch_node_test which are similar in size to groupby_test). Maybe you could rename it to aggregate_node_test?

…ctor-groupby

westonpace

This looks great. Thanks a lot!

ursabot · 2023-06-02T06:47:47Z

Benchmark runs are scheduled for baseline = c1359c5 and contender = 0bb2d83. 0bb2d83 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.41% ⬆️0.09%] test-mac-arm
[Finished ⬇️9.48% ⬆️0.65%] ursa-i9-9960x
[Finished ⬇️0.81% ⬆️0.36%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 0bb2d83a ec2-t3-xlarge-us-east-2
[Finished] 0bb2d83a test-mac-arm
[Finished] 0bb2d83a ursa-i9-9960x
[Finished] 0bb2d83a ursa-thinkcentre-m75q
[Finished] c1359c5f ec2-t3-xlarge-us-east-2
[Finished] c1359c5f test-mac-arm
[Finished] c1359c5f ursa-i9-9960x
[Finished] c1359c5f ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2023-06-02T06:50:10Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

apacheGH-33976: [Python] Remove usage of TableGroupby helper in favor…

34aa3c4

… of pyarrow.acero.Declaration

jorisvandenbossche requested a review from AlenkaF as a code owner March 29, 2023 15:24

jorisvandenbossche added 3 commits March 29, 2023 17:32

fixup

7168fb8

Merge remote-tracking branch 'upstream/main' into apachegh-33976-refa…

792ad77

…ctor-groupby

fix import

34cb33c

jorisvandenbossche added the Component: Python label Mar 29, 2023

Merge remote-tracking branch 'upstream/main' into apachegh-33976-refa…

ce0ea7a

…ctor-groupby

github-actions bot added the awaiting review Awaiting review label Apr 4, 2023

move CAggregate declaration to libarrow

3a71e6a

westonpace approved these changes Apr 10, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Apr 10, 2023

Merge remote-tracking branch 'upstream/main' into apachegh-33976-refa…

75b1817

…ctor-groupby

jorisvandenbossche added 2 commits May 23, 2023 10:28

Merge remote-tracking branch 'upstream/main' into apachegh-33976-refa…

0e047bb

…ctor-groupby

remove groupby.cc, rename test file

e6eefac

github-actions bot added the Component: C++ label May 23, 2023

jorisvandenbossche requested a review from westonpace May 24, 2023 08:52

westonpace approved these changes Jun 1, 2023

View reviewed changes

jorisvandenbossche merged commit 0bb2d83 into apache:main Jun 1, 2023
33 of 34 checks passed

jorisvandenbossche deleted the gh-33976-refactor-groupby branch June 1, 2023 14:02

jorisvandenbossche mentioned this pull request Jul 19, 2023

[Python] Guarantee that group_by has stable ordering. #36709

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-33976: [Python] Remove usage of TableGroupBy helper in favor of pyarrow.acero.Declaration #34769

GH-33976: [Python] Remove usage of TableGroupBy helper in favor of pyarrow.acero.Declaration #34769

jorisvandenbossche commented Mar 29, 2023

jorisvandenbossche commented Mar 29, 2023

westonpace commented Mar 31, 2023

jorisvandenbossche commented Apr 4, 2023

westonpace left a comment

jorisvandenbossche commented May 2, 2023

jorisvandenbossche commented May 2, 2023

westonpace commented May 8, 2023

westonpace left a comment

ursabot commented Jun 2, 2023

ursabot commented Jun 2, 2023

GH-33976: [Python] Remove usage of TableGroupBy helper in favor of pyarrow.acero.Declaration #34769

GH-33976: [Python] Remove usage of TableGroupBy helper in favor of pyarrow.acero.Declaration #34769

Conversation

jorisvandenbossche commented Mar 29, 2023

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

jorisvandenbossche commented Mar 29, 2023

westonpace commented Mar 31, 2023

jorisvandenbossche commented Apr 4, 2023

westonpace left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 2, 2023

jorisvandenbossche commented May 2, 2023

westonpace commented May 8, 2023

westonpace left a comment

Choose a reason for hiding this comment

ursabot commented Jun 2, 2023

ursabot commented Jun 2, 2023