[C++] Implement Hash Aggregation query execution node #21501

asfimport · 2019-03-25T02:05:29Z

Dear all,

I wonder what the best way forward is for implementing GroupBy kernels. Initially this was part of

https://issues.apache.org/jira/browse/ARROW-4124

but is not contained in the current implementation as far as I can tell.

It seems that the part of group by that just returns indices could be conveniently implemented with the HashKernel. That seems useful in any case. Is that indeed the best way forward/should this be done?

GroupBy + Aggregate could then either be implemented with that + the Take kernel + aggregation involving more memory copies than necessary though or as part of the aggregate kernel. Probably the latter is preferred, any thoughts on that?

Am I missing any other JIRAs related to this?

Best, Philipp.

Reporter: Philipp Moritz / @pcmoritz

Related issues:

[C++] Parallelize execution of ScalarAggregateFunction (blocks)
cpp implementation for hash aggregation and binding in python (is duplicated by)
[C++][Compute] Wrap grouped aggregation in an ExecNode (depends upon)

_{Note: This issue was originally created as ARROW-5002. Please see the migration documentation for further details.}

asfimport · 2019-03-25T12:50:36Z

Francois Saint-Jacques / @fsaintjacques:
The Take kernel should be out of the solution, like you mentioned, we want to minimize memory copy. My intent is to extend the AggregateFunction class with the ConsumeWithFilters and ConsumeWithGroups as follows:

class AggregateFunction {                                                                                     
 public:                                                                                                      
  /// \brief Consume an array into a state.                                                                   
  /// `SELECT AGG(x) FROM T;`                                                                                 
  virtual Status Consume(const Array& input, void* state) const = 0;             
                            
  /// \brief Consume an array with a mask into a state.                                                       
  /// `SELECT AGG(x) FROM T WHERE pred;`                                                                      
  virtual Status ConsumeWithFilter(const Array& input, const Array& mask, void* state) const = 0;             

  /// \brief Consume an array with a scatter group into states.                                               
  /// `SELECT k, AGG(x) FROM T GROUP BY k;`                                                                      
  /// `SELECT k, AGG(x) FROM T WHERE pred GROUP BY k;`                                                           
  virtual Status ConsumeWithGroups(const Array& input, const Array& groups, IndexableState* states) const = 0;

The GroupBy kernel would emit an array of index (it also needs to provide the hash table of original keys). One desirable property of the GroupBy kernel is that it is not a full barrier to run the aggregates, in other words, you can run GroupBy & aggregates in parallel assuming you have a final consolidating phase (which is the barrier). Read section 4.4. of Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age.

As a side note, we should specialize the GroupBy on the group types, e.g. for any primitive types of width <=16 (assuming a single column expression), we can use a fixed size array and no hash required.

asfimport · 2019-03-28T22:25:29Z

Wes McKinney / @wesm:
I think what we are discussing here is called "hash aggregation" in the database literature and is a type of execution node. This is related to ARROW-3978, where we need to be able to compute group ordinal indexes via hashing for multiple keys

I'm not sure about the "ConsumeWithGroups" API. What do other analytic database engines (Impala, Clickhouse, etc.) do?

asfimport · 2020-05-25T15:09:39Z

Wes McKinney / @wesm:
I renamed the issue. I need to be able to execute hash aggregations in the next few months so I be working to implement the appropriate machinery for this under arrow/compute (since hash aggregations need to compose with array/kernel expressions)

asfimport closed this as completed Aug 3, 2021

asfimport added this to the 6.0.0 milestone Jan 11, 2023

This was referenced Jan 11, 2023

[C++] Parallelize execution of ScalarAggregateFunction #19473

Open

cpp implementation for hash aggregation and binding in python #23074

Closed

[C++][Compute] Wrap grouped aggregation in an ExecNode #28501

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Implement Hash Aggregation query execution node #21501

[C++] Implement Hash Aggregation query execution node #21501

asfimport commented Mar 25, 2019 •

edited

asfimport commented Mar 25, 2019

asfimport commented Mar 28, 2019

asfimport commented May 25, 2020

[C++] Implement Hash Aggregation query execution node #21501

[C++] Implement Hash Aggregation query execution node #21501

Comments

asfimport commented Mar 25, 2019 • edited

Related issues:

asfimport commented Mar 25, 2019

asfimport commented Mar 28, 2019

asfimport commented May 25, 2020

asfimport commented Mar 25, 2019 •

edited