Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement hash-partitioned hash aggregate #27

Closed
Dandandan opened this issue Apr 21, 2021 · 3 comments · Fixed by #320
Closed

Implement hash-partitioned hash aggregate #27

Dandandan opened this issue Apr 21, 2021 · 3 comments · Fixed by #320

Comments

@Dandandan
Copy link
Contributor

This is (up to 4x in my earlier tests) faster than the current implementation that collects all parts to one "full" for cases with very high cardinality in the aggregate (think deduplication code). However, not hash partitioning is faster for very "simple" aggregates as less work needs to be done.

We probably need some fast way to have a rough estimate on the number of distinct values in the aggregate keys, maybe dynamically based on the first batch(es).

Also this work creates a building block for ballista to distribute data across workers, parallelizing it, avoiding collecting it to one worker, and making it scale to bigger datasets.

@Dandandan Dandandan changed the title Implement hash-partitioned hash aggregat Implement hash-partitioned hash aggregate Apr 21, 2021
@returnString
Copy link
Contributor

We probably need some fast way to have a rough estimate on the number of distinct values in the aggregate keys, maybe dynamically based on the first batch(es).

Does your TableProvider column stats work provide any useful base for this in situations where we're running aggregations on original table columns (as opposed to computed exprs) or is that too coarse?

@Dandandan
Copy link
Contributor Author

We probably need some fast way to have a rough estimate on the number of distinct values in the aggregate keys, maybe dynamically based on the first batch(es).

Does your TableProvider column stats work provide any useful base for this in situations where we're running aggregations on original table columns (as opposed to computed exprs) or is that too coarse?

I think yes, for cases where the table is queried directly and we have statistics for distinct values available, those could be using heuristics for group by expressions based on cardinality statistics (say if you use column a and b maybe distinct(a)*distinct(b) would be an OK heuristic).
We also need the same support of distinct values per column for generalizing the join order optimization rule to more complicated joins and with more expressions than those directly on tables.
Support for collecting those statistics (i.e. ANALYZE TABLE) would need to be added too.

To support the more general case we probably need a way to estimate the cardinality of the intermediate results, based on sampling one or a couple of batches with the particular group by expression.

I added this requirement in this design doc:
https://docs.google.com/document/d/17DCBe_HygkjsoMzC4Znw-t8i1okLGlBkf0kp8DfLBhk/edit?usp=drivesdk

@Dandandan
Copy link
Contributor Author

Dandandan commented May 11, 2021

I also realize I may have implemented this in an inefficient way. Looking at Spark it does first a partial aggregate followed by a hash repartition followed by another aggregation. I think this will not slow down the low cardinality query that much (as the result set will be small already after first aggregate) while being fast - maybe even faster - for high cardinality aggregates.

andygrove added a commit that referenced this issue Jan 12, 2023
* Initial commit

* initial commit

* failing test

* table scan projection

* closer

* test passes, with some hacks

* use DataFrame (#2)

* update README

* update dependency

* code cleanup (#3)

* Add support for Filter operator and BinaryOp expressions (#4)

* GitHub action (#5)

* Split code into producer and consumer modules (#6)

* Support more functions and scalar types (#7)

* Use substrait 0.1 and datafusion 8.0 (#8)

* use substrait 0.1

* use datafusion 8.0

* update datafusion to 10.0 and substrait to 0.2 (#11)

* Add basic join support (#12)

* Added fetch support (#23)

Added fetch to consumer

Added limit to producer

Added unit tests for limit

Added roundtrip_fill_none() for testing when None input can be converted to 0

Update src/consumer.rs

Co-authored-by: Andy Grove <andygrove73@gmail.com>

Co-authored-by: Andy Grove <andygrove73@gmail.com>

* Upgrade to DataFusion 13.0.0 (#25)

* Add sort consumer and producer (#24)

Add consumer

Add producer and test

Modified error string

* Add serializer/deserializer (#26)

* Add plan and function extension support (#27)

* Add plan and function extension support

* Removed unwraps

* Implement GROUP BY (#28)

* Add consumer, producer and tests for aggregate relation

Change function extension registration from absolute to relative anchor
(reference)

Remove operator to/from reference

* Fixed function registration bug

* Add test

* Addressed PR comments

* Changed field reference from mask to direct reference (#29)

* Changed field reference from masked reference to direct reference

* Handle unsupported case (struct with child)

* Handle SubqueryAlias (#30)

Fixed aggregate function register bug

* Add support for SELECT DISTINCT (#31)

Add test case

* Implement BETWEEN (#32)

* Add case (#33)

* Implement CASE WHEN

* Add more case to test

* Addressed comments

* feat: support explicit catalog/schema names in ReadRel (#34)

* feat: support explicit catalog/schema names in ReadRel

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* fix: use re-exported expr crate

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* move files to subfolder

* RAT

* remove rust.yaml

* revert .gitignore changes

* tomlfmt

* tomlfmt

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: JanKaul <jankaul@mailbox.org>
Co-authored-by: nseekhao <37189615+nseekhao@users.noreply.github.com>
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants