Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Hash aggregate function that returns value from first row in group #29593

Closed
asfimport opened this issue Sep 14, 2021 · 12 comments
Closed

Comments

@asfimport
Copy link

asfimport commented Sep 14, 2021

It would be nice to have a hash aggregate function that returns the first value of a column within each hash group.

If row order within groups is non-deterministic, then effectively this would return one arbitrary value. This is a very computationally cheap operation.

This can be quite useful when querying a non-normalized table. For example if you have a table with a country column and also a country_abbr column and you want to group by either/both of those columns but return the values from both columns, you could do

SELECT country, country_abbr FROM table GROUP BY country, country_abbr

but it would be more efficient to do

SELECT country, first(country_abbr) FROM table GROUP BY country

because then the engine does not need to scan all the values of the country_abbr column.

Reporter: Ian Cook / @ianmcook
Assignee: Dhruv Vats / @dhruv9vats

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-13993. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Ian Cook / @ianmcook:
A more general solution would be to implement a hash_take hash aggregate function that takes a scalar integer argument n and returns the nth row from each hash group.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Since the result would be non-deterministic, I'm not sure I understand the point of a hash_take function compared to the hash_first proposal.

@asfimport
Copy link
Author

Ian Cook / @ianmcook:
@pitrou  I agree, there is probably no point; just a hash_first kernel would suffice for all the uses I can imagine

@asfimport
Copy link
Author

Weston Pace / @westonpace:
So it sounds like we can implement this JIRA as "pick one column value from the group" and not "pick the first column value from the group". The latter we can create a new JIRA for (there is some desire for this: see ARROW-15474) and tackle later once we have an idea of how we deal with ordering mid-plan.

Do we want to consider a name other than hash_first? Maybe hash_one or hash_single?

@asfimport
Copy link
Author

Dhruv Vats / @dhruv9vats:
Just so I understand this correctly (as I don't have a very formal CS background), when we do:

SELECT country, SUM(customerID) FROM db_table GROUP BY country

from a supposed sales table db_table that has fields country and {}customerID{}, we get number of customers per country/group.

So here instead sum of all tuples in a group, we just want to return a single tuple from the different groups/country? And, it seems which tuple (like either the first or a specific one) to return is yet to be finalised, right?

Also is there a PR or an existing kernel that has a similar boilerplate code to what this will have? (That'll save a disproportionate time going through all the abstractions).

@asfimport
Copy link
Author

David Li / @lidavidm:
Yes, we just want a single row per group. Any row will do; the point above is that we can't implement anything else (because the query engine currently lacks support for ordering, beyond sorting outputs at the very end).

All hash_ kernels ("hash aggregate kernels") are in hash_aggregate.cc and it will be very similar to the CountDistinct/Distinct implementation there.

@asfimport
Copy link
Author

David Li / @lidavidm:
Issue resolved by pull request 12368
#12368

@asfimport
Copy link
Author

Ian Cook / @ianmcook:
@dhruv9vats  Thanks for doing this! I think we need a follow-up to add hash_one to the table of hash aggregate functions in compute.rst. Could you create an issue for that please?

@asfimport
Copy link
Author

David Li / @lidavidm:
D'oh. Sorry @dhruv9vats I forgot to note this in the review. Thanks @ianmcook for catching this.

@asfimport
Copy link
Author

David Li / @lidavidm:
See ARROW-15717.

@asfimport
Copy link
Author

Ian Cook / @ianmcook:
Thanks!

@asfimport
Copy link
Author

Ian Cook / @ianmcook:
I recently learned that this function is called any_value in some SQL dialects (for example in Snowflake: https://docs.snowflake.com/en/sql-reference/functions/any_value.html)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant