New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Implement Hash Aggregation query execution node #21501
Comments
Francois Saint-Jacques / @fsaintjacques: class AggregateFunction {
public:
/// \brief Consume an array into a state.
/// `SELECT AGG(x) FROM T;`
virtual Status Consume(const Array& input, void* state) const = 0;
/// \brief Consume an array with a mask into a state.
/// `SELECT AGG(x) FROM T WHERE pred;`
virtual Status ConsumeWithFilter(const Array& input, const Array& mask, void* state) const = 0;
/// \brief Consume an array with a scatter group into states.
/// `SELECT k, AGG(x) FROM T GROUP BY k;`
/// `SELECT k, AGG(x) FROM T WHERE pred GROUP BY k;`
virtual Status ConsumeWithGroups(const Array& input, const Array& groups, IndexableState* states) const = 0; The GroupBy kernel would emit an array of index (it also needs to provide the hash table of original keys). One desirable property of the GroupBy kernel is that it is not a full barrier to run the aggregates, in other words, you can run GroupBy & aggregates in parallel assuming you have a final consolidating phase (which is the barrier). Read section 4.4. of Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age. As a side note, we should specialize the GroupBy on the group types, e.g. for any primitive types of width <=16 (assuming a single column expression), we can use a fixed size array and no hash required. |
Wes McKinney / @wesm: I'm not sure about the "ConsumeWithGroups" API. What do other analytic database engines (Impala, Clickhouse, etc.) do? |
Wes McKinney / @wesm: |
Dear all,
I wonder what the best way forward is for implementing GroupBy kernels. Initially this was part of
https://issues.apache.org/jira/browse/ARROW-4124
but is not contained in the current implementation as far as I can tell.
It seems that the part of group by that just returns indices could be conveniently implemented with the HashKernel. That seems useful in any case. Is that indeed the best way forward/should this be done?
GroupBy + Aggregate could then either be implemented with that + the Take kernel + aggregation involving more memory copies than necessary though or as part of the aggregate kernel. Probably the latter is preferred, any thoughts on that?
Am I missing any other JIRAs related to this?
Best, Philipp.
Reporter: Philipp Moritz / @pcmoritz
Related issues:
Note: This issue was originally created as ARROW-5002. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: