Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35979: [C++] Refactor Acero scalar and hash aggregation into separate files #35980

Merged
merged 4 commits into from
Jun 13, 2023

Conversation

ildipo
Copy link
Contributor

@ildipo ildipo commented Jun 7, 2023

Rationale for this change

Some refactoring to simplify relations development and pave the way for implementing window aggregation.

What changes are included in this PR?

Existing Acero aggregation (scalar and group-by) sources have been refactored into separate files with no changes.

…sting aggregation code

implement window_scalar_aggregation
implement window_groupby_aggregation
add window_*_aggregation tests
@ildipo ildipo requested a review from westonpace as a code owner June 7, 2023 18:00
@github-actions
Copy link

github-actions bot commented Jun 7, 2023

⚠️ GitHub issue #35979 has been automatically assigned in GitHub to PR creator.

@@ -29,7 +29,11 @@ endmacro()

set(ARROW_ACERO_SRCS
accumulation_queue.cc
aggregate_node.cc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggregate_node.cc was split into scalar_aggregate_node.cc, groupby_aggregate_node.cc and aggregate_internal.cc - the existing code was moved as-is

@icexelloss
Copy link
Contributor

icexelloss commented Jun 7, 2023

Thanks @ildipo!

@westonpace Since this is a fairly large PR I think it might be beneficial if reviewers get an understanding of high level design and get on the same page (maybe a meeting? Or some shared doc?). WDYT?

Also @ianmcook IIRC you also mentioned interested in reviewing windowing design/functionality

@@ -0,0 +1,276 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this was inside aggregate_node.cc

@@ -0,0 +1,447 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this was inside aggregate_node.cc

@@ -0,0 +1,322 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this was inside aggregate_node.cc

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 7, 2023
@ildipo ildipo force-pushed the window-aggregation branch 2 times, most recently from 230124c to 4b4723b Compare June 7, 2023 18:20
@westonpace
Copy link
Member

@icexelloss happy to meet in person and go over the high level design. Just from a glance it looks like the aggregate node was split into the scalar and group-by cases with common parts put into aggregate_internal (I'm guessing to help share code between the aggregate cases and the window aggregates). Then two new window nodes were added. It seems that both of these nodes rely on the InputReceived being called in a sorted order in the same way that the asof join node does?

@ildipo
Copy link
Contributor Author

ildipo commented Jun 8, 2023

@westonpace that is correct

@ildipo
Copy link
Contributor Author

ildipo commented Jun 12, 2023

I decided to split this into 2 to simplify review.
This now contains only refactoring and the additional options in the relations but not the window implementation.

int total_output_batches_ = 0;
};

class WindowScalarNode : public ExecNode, public TracedNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this?

@@ -297,6 +297,26 @@ class ARROW_ACERO_EXPORT ProjectNodeOptions : public ExecNodeOptions {
std::vector<std::string> names;
};

/// windows proceed left to right
class ARROW_ACERO_EXPORT WindowAggregateArgs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 12, 2023
@icexelloss
Copy link
Contributor

@ildipo I am in favor of splitting up aggregate code to make it more reusable. +1 for the refactoring. It looks like there are still some windowing related changes in this PR though.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 12, 2023
@icexelloss icexelloss changed the title GH-35979: [C++] Acero - Add basic scalar and group-by window aggregation GH-35979: [C++] Refactor Acero scalar and hash aggregation into separate files Jun 12, 2023
@@ -69,3 +69,4 @@ message SegmentedAggregateRel {
// A list of one or more aggregate expressions along with an optional filter.
repeated substrait.AggregateRel.Measure measures = 3;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: revert this change? (Since we don't touch this file at all probably cleaner if we don't leave a new line change)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 12, 2023
@icexelloss
Copy link
Contributor

@ildipo LGTM. Minor comments on the substrait file (since we don't change it probably better not to leave untouched)

Lint issue probably need:

archery lint --cmake-format

Do you mind also updating the title and description to better reflect the reduced change set?

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 12, 2023
@icexelloss
Copy link
Contributor

LGTM. cc @westonpace in case you want to take a look. (Should be pretty straight forward)

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 12, 2023
@icexelloss
Copy link
Contributor

@westonpace I plan to merge this later Today if you don't have objections.

@icexelloss icexelloss merged commit dadf665 into apache:main Jun 13, 2023
34 of 35 checks passed
@icexelloss
Copy link
Contributor

@westonpace Since I didn't hear from you I went ahead and merged this. (No changes to external headers so we should be fine).

But let me know in case you have some more comments we can do it as a follow up.

@conbench-apache-arrow
Copy link

Conbench analyzed the 7 benchmark runs on commit dadf6659.

There were 4 benchmark results indicating a performance regression:

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Refactor Acero scalar and hash aggregation into separate files
3 participants