Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-37001][SQL] Disable two level of map for final hash aggregatio…
…n by default ### What changes were proposed in this pull request? This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in #32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here `spark.sql.codegen.aggregate.final.map.twolevel.enabled`, to allow query to enable the feature when seeing benefit. ### Why are the changes needed? Fix query regression. ### Does this PR introduce _any_ user-facing change? Yes, the introduced `spark.sql.codegen.aggregate.final.map.twolevel.enabled` config. ### How was this patch tested? Existing unit test in `AggregationQuerySuite.scala`. Also verified generated code for an example query in the file: ``` spark.sql( """ |SELECT key, avg(value) |FROM agg1 |GROUP BY key """.stripMargin) ``` Verified the generated code for final hash aggregation not have two level maps by default: https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 . Verified the generated code for final hash aggregation have two level maps if enabling the config: https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 . Closes #34270 from c21/agg-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3354a21) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
- Loading branch information