Implement hash-partitioned hash aggregate #27

Dandandan · 2021-04-21T20:26:31Z

This is (up to 4x in my earlier tests) faster than the current implementation that collects all parts to one "full" for cases with very high cardinality in the aggregate (think deduplication code). However, not hash partitioning is faster for very "simple" aggregates as less work needs to be done.

We probably need some fast way to have a rough estimate on the number of distinct values in the aggregate keys, maybe dynamically based on the first batch(es).

Also this work creates a building block for ballista to distribute data across workers, parallelizing it, avoiding collecting it to one worker, and making it scale to bigger datasets.

returnString · 2021-04-22T19:23:02Z

We probably need some fast way to have a rough estimate on the number of distinct values in the aggregate keys, maybe dynamically based on the first batch(es).

Does your TableProvider column stats work provide any useful base for this in situations where we're running aggregations on original table columns (as opposed to computed exprs) or is that too coarse?

Dandandan · 2021-04-22T19:52:03Z

We probably need some fast way to have a rough estimate on the number of distinct values in the aggregate keys, maybe dynamically based on the first batch(es).

Does your TableProvider column stats work provide any useful base for this in situations where we're running aggregations on original table columns (as opposed to computed exprs) or is that too coarse?

I think yes, for cases where the table is queried directly and we have statistics for distinct values available, those could be using heuristics for group by expressions based on cardinality statistics (say if you use column a and b maybe distinct(a)*distinct(b) would be an OK heuristic).
We also need the same support of distinct values per column for generalizing the join order optimization rule to more complicated joins and with more expressions than those directly on tables.
Support for collecting those statistics (i.e. ANALYZE TABLE) would need to be added too.

To support the more general case we probably need a way to estimate the cardinality of the intermediate results, based on sampling one or a couple of batches with the particular group by expression.

I added this requirement in this design doc:
https://docs.google.com/document/d/17DCBe_HygkjsoMzC4Znw-t8i1okLGlBkf0kp8DfLBhk/edit?usp=drivesdk

Dandandan · 2021-05-11T20:32:56Z

I also realize I may have implemented this in an inefficient way. Looking at Spark it does first a partial aggregate followed by a hash repartition followed by another aggregation. I think this will not slow down the low cardinality query that much (as the result set will be small already after first aggregate) while being fast - maybe even faster - for high cardinality aggregates.

* Initial commit * initial commit * failing test * table scan projection * closer * test passes, with some hacks * use DataFrame (#2) * update README * update dependency * code cleanup (#3) * Add support for Filter operator and BinaryOp expressions (#4) * GitHub action (#5) * Split code into producer and consumer modules (#6) * Support more functions and scalar types (#7) * Use substrait 0.1 and datafusion 8.0 (#8) * use substrait 0.1 * use datafusion 8.0 * update datafusion to 10.0 and substrait to 0.2 (#11) * Add basic join support (#12) * Added fetch support (#23) Added fetch to consumer Added limit to producer Added unit tests for limit Added roundtrip_fill_none() for testing when None input can be converted to 0 Update src/consumer.rs Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> * Upgrade to DataFusion 13.0.0 (#25) * Add sort consumer and producer (#24) Add consumer Add producer and test Modified error string * Add serializer/deserializer (#26) * Add plan and function extension support (#27) * Add plan and function extension support * Removed unwraps * Implement GROUP BY (#28) * Add consumer, producer and tests for aggregate relation Change function extension registration from absolute to relative anchor (reference) Remove operator to/from reference * Fixed function registration bug * Add test * Addressed PR comments * Changed field reference from mask to direct reference (#29) * Changed field reference from masked reference to direct reference * Handle unsupported case (struct with child) * Handle SubqueryAlias (#30) Fixed aggregate function register bug * Add support for SELECT DISTINCT (#31) Add test case * Implement BETWEEN (#32) * Add case (#33) * Implement CASE WHEN * Add more case to test * Addressed comments * feat: support explicit catalog/schema names in ReadRel (#34) * feat: support explicit catalog/schema names in ReadRel Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix: use re-exported expr crate Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * move files to subfolder * RAT * remove rust.yaml * revert .gitignore changes * tomlfmt * tomlfmt Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: JanKaul <jankaul@mailbox.org> Co-authored-by: nseekhao <37189615+nseekhao@users.noreply.github.com> Co-authored-by: Ruihang Xia <waynestxia@gmail.com>

Dandandan changed the title ~~Implement hash-partitioned hash aggregat~~ Implement hash-partitioned hash aggregate Apr 21, 2021

Dandandan mentioned this issue May 11, 2021

Implement hash partitioned aggregation #320

Merged

Dandandan closed this as completed in #320 May 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement hash-partitioned hash aggregate #27

Implement hash-partitioned hash aggregate #27

Dandandan commented Apr 21, 2021

returnString commented Apr 22, 2021

Dandandan commented Apr 22, 2021

Dandandan commented May 11, 2021 •

edited

Loading

Implement hash-partitioned hash aggregate #27

Implement hash-partitioned hash aggregate #27

Comments

Dandandan commented Apr 21, 2021

returnString commented Apr 22, 2021

Dandandan commented Apr 22, 2021

Dandandan commented May 11, 2021 • edited Loading

Dandandan commented May 11, 2021 •

edited

Loading