[C++] Hash aggregate function that returns value from first row in group #29593

asfimport · 2021-09-14T13:53:10Z

It would be nice to have a hash aggregate function that returns the first value of a column within each hash group.

If row order within groups is non-deterministic, then effectively this would return one arbitrary value. This is a very computationally cheap operation.

This can be quite useful when querying a non-normalized table. For example if you have a table with a country column and also a country_abbr column and you want to group by either/both of those columns but return the values from both columns, you could do

SELECT country, country_abbr FROM table GROUP BY country, country_abbr

but it would be more efficient to do

SELECT country, first(country_abbr) FROM table GROUP BY country

because then the engine does not need to scan all the values of the country_abbr column.

Reporter: Ian Cook / @ianmcook
Assignee: Dhruv Vats / @dhruv9vats

Related issues:

[R] Support for .keep_all = TRUE with distinct() (blocks)
[Docs] Add hash_one to the documentation (is related to)

PRs and other links:

GitHub Pull Request #12368

_{Note: This issue was originally created as ARROW-13993. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2021-09-14T14:10:53Z

Ian Cook / @ianmcook:
A more general solution would be to implement a hash_take hash aggregate function that takes a scalar integer argument n and returns the nth row from each hash group.

asfimport · 2021-12-16T17:51:33Z

Antoine Pitrou / @pitrou:
Since the result would be non-deterministic, I'm not sure I understand the point of a hash_take function compared to the hash_first proposal.

asfimport · 2021-12-20T14:28:49Z

Ian Cook / @ianmcook:
@pitrou I agree, there is probably no point; just a hash_first kernel would suffice for all the uses I can imagine

asfimport · 2022-02-04T19:05:48Z

Weston Pace / @westonpace:
So it sounds like we can implement this JIRA as "pick one column value from the group" and not "pick the first column value from the group". The latter we can create a new JIRA for (there is some desire for this: see ARROW-15474) and tackle later once we have an idea of how we deal with ordering mid-plan.

Do we want to consider a name other than hash_first? Maybe hash_one or hash_single?

asfimport · 2022-02-07T13:12:04Z

Dhruv Vats / @dhruv9vats:
Just so I understand this correctly (as I don't have a very formal CS background), when we do:

SELECT country, SUM(customerID) FROM db_table GROUP BY country

from a supposed sales table db_table that has fields country and {}customerID{}, we get number of customers per country/group.

So here instead sum of all tuples in a group, we just want to return a single tuple from the different groups/country? And, it seems which tuple (like either the first or a specific one) to return is yet to be finalised, right?

Also is there a PR or an existing kernel that has a similar boilerplate code to what this will have? (That'll save a disproportionate time going through all the abstractions).

asfimport · 2022-02-07T13:16:24Z

David Li / @lidavidm:
Yes, we just want a single row per group. Any row will do; the point above is that we can't implement anything else (because the query engine currently lacks support for ordering, beyond sorting outputs at the very end).

All hash_ kernels ("hash aggregate kernels") are in hash_aggregate.cc and it will be very similar to the CountDistinct/Distinct implementation there.

asfimport · 2022-02-17T13:13:31Z

David Li / @lidavidm:
Issue resolved by pull request 12368
#12368

asfimport · 2022-02-17T15:36:41Z

Ian Cook / @ianmcook:
@dhruv9vats Thanks for doing this! I think we need a follow-up to add hash_one to the table of hash aggregate functions in compute.rst. Could you create an issue for that please?

asfimport · 2022-02-17T15:38:30Z

David Li / @lidavidm:
D'oh. Sorry @dhruv9vats I forgot to note this in the review. Thanks @ianmcook for catching this.

asfimport · 2022-02-17T15:39:01Z

David Li / @lidavidm:
See ARROW-15717.

asfimport · 2022-02-17T15:48:20Z

Ian Cook / @ianmcook:
Thanks!

asfimport · 2022-06-30T13:46:47Z

Ian Cook / @ianmcook:
I recently learned that this function is called any_value in some SQL dialects (for example in Snowflake: https://docs.snowflake.com/en/sql-reference/functions/any_value.html)

asfimport closed this as completed Feb 17, 2022

This was referenced Jan 11, 2023

[R] Support for .keep_all = TRUE with distinct() #29642

Open

[Docs] Add hash_one to the documentation #31168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Hash aggregate function that returns value from first row in group #29593

[C++] Hash aggregate function that returns value from first row in group #29593

asfimport commented Sep 14, 2021 •

edited

asfimport commented Sep 14, 2021

asfimport commented Dec 16, 2021

asfimport commented Dec 20, 2021

asfimport commented Feb 4, 2022

asfimport commented Feb 7, 2022

asfimport commented Feb 7, 2022

asfimport commented Feb 17, 2022

asfimport commented Feb 17, 2022

asfimport commented Feb 17, 2022

asfimport commented Feb 17, 2022

asfimport commented Feb 17, 2022

asfimport commented Jun 30, 2022

[C++] Hash aggregate function that returns value from first row in group #29593

[C++] Hash aggregate function that returns value from first row in group #29593

Comments

asfimport commented Sep 14, 2021 • edited

Related issues:

PRs and other links:

asfimport commented Sep 14, 2021

asfimport commented Dec 16, 2021

asfimport commented Dec 20, 2021

asfimport commented Feb 4, 2022

asfimport commented Feb 7, 2022

asfimport commented Feb 7, 2022

asfimport commented Feb 17, 2022

asfimport commented Feb 17, 2022

asfimport commented Feb 17, 2022

asfimport commented Feb 17, 2022

asfimport commented Feb 17, 2022

asfimport commented Jun 30, 2022

asfimport commented Sep 14, 2021 •

edited