Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15609: [C++][Compute] Support hash_aggregate with only keys #12369

Closed
wants to merge 1 commit into from

Conversation

Crystrix
Copy link
Contributor

@Crystrix Crystrix commented Feb 8, 2022

Currently, hash_aggregate requires keys, arguments, and aggregation functions, sometimes we want to execute group_by without any aggregation functions.

for (auto& state : states) {
ARROW_ASSIGN_OR_RAISE(state, InitKernels(kernels, ctx, aggregates, argument_descrs));
}
if (!arguments.empty()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Init arguments related variables only if arguments exists

@@ -2586,7 +2592,8 @@ Result<Datum> GroupBy(const std::vector<Datum>& arguments, const std::vector<Dat

// start "streaming" execution
ExecBatch key_batch, argument_batch;
while (argument_batch_iterator->Next(&argument_batch) &&
while ((argument_batch_iterator == NULLPTR ||
argument_batch_iterator->Next(&argument_batch)) &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argument_batch_iterator == NULLPTR means there are no arguments, the reading and check of argument_batch can be skipped

@github-actions
Copy link

github-actions bot commented Feb 8, 2022

@Crystrix
Copy link
Contributor Author

@pitrou, @lidavidm Could you please help to review?

@lidavidm lidavidm self-requested a review February 16, 2022 13:23
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically, this is equivalent to selecting all unique tuples of the input keys? I suppose we have similar vector kernels already, but being vector kernels they're not usable in the query engine, and this has the added benefit of being incrementally computed.

And it looks like this already works with ExecPlan, so we only needed to change the helper GroupBy function.

@Crystrix
Copy link
Contributor Author

So basically, this is equivalent to selecting all unique tuples of the input keys?

Yes, the result is the same.

and this has the added benefit of being incrementally computed

That's right, and then such SQL query can be supported
SELECT foo FROM table GROUP BY foo

@lidavidm lidavidm closed this in e4e866f Feb 16, 2022
@lidavidm
Copy link
Member

👍 thank you!

@ursabot
Copy link

ursabot commented Feb 16, 2022

Benchmark runs are scheduled for baseline = 3a8e409 and contender = e4e866f. e4e866f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed] ursa-i9-9960x
[Failed] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants