Skip to content

Support for single phase and two-phase distributed hash aggregation #4372

@siddharthteotia

Description

@siddharthteotia

I am not entirely sure to what extent this is already supported -- so apologies if this is something that is already done or something not applicable

Idea:

GROUP BY processing using hash based aggregation in a distributed query engine can be done using two different methods and each one has an impact on performance depending on cardinality

(1) Two-phase hash aggregation: In the first phase, each node will do a local group by from the data (segment) on that node -- essentially computing partial resultset from a single thread query execution for a given segment at a given node.

The physical plan or execution plan will have a shuffle above the aggregation group by operator -- the shuffle operator will use the partial aggregates and send them around such that each node now ends up with all the data (partial aggregates) corresponding to group by key(s).

The second phase of hash agg will now work on the partial aggregates and compute the final resultset for a given set of key(s) per node. Each node can now send its result set to broker. Broker only needs to union as opposed to doing any aggregation related processing.

(2) Single-phase hash aggregation: The above described two phase method proves to be useful if the cardinality of group by key columns is low -- since doing local hash agg first achieves a higher degree of reduction and thus reduces the data sent around. However, if cardinality is very high then it is better to do single phase aggregation. In this case, below the aggregation operator, there will be a shuffle operator. Again, after hash agg, each node can send the final result set to broker and it just needs to union.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions