-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37001][SQL] Disable two level of map for final hash aggregation by default #34270
Conversation
cc @cloud-fan could you help take a look when you have time? Thanks! |
Kubernetes integration test starting |
.version("2.3.0") | ||
.booleanConf | ||
.createWithDefault(true) | ||
|
||
val ENABLE_TWOLEVEL_FINAL_AGG_MAP = | ||
buildConf("spark.sql.codegen.aggregate.final.map.twolevel.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about park.sql.codegen.aggregate.map.twolevel.partialOnly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan - sure, updated. So given the new meaning of config, changed the default config value to true
as well.
.doc("Enable two-level aggregate hash map for final aggregate as well. Disable by default " + | ||
"because final aggregate might get more distinct keys compared to partial aggregate. " + | ||
"Overhead of looking up 1st-level map might dominate when having a lot of distinct keys.") | ||
.version("3.2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3.2.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan - yes, updated.
Test build #144190 has finished for PR 34270 at commit
|
Kubernetes integration test status failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @c21 . Thank you for making a PR.
However, SPARK-35141 is released already via Apache Spark 3.2.0.
You cannot make a follow-up because this PR will be released as Apache Spark 3.2.1.
Please file a new JIRA issue and use it for this kind of PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed all comments from @cloud-fan and @dongjoon-hyun, thanks.
.version("2.3.0") | ||
.booleanConf | ||
.createWithDefault(true) | ||
|
||
val ENABLE_TWOLEVEL_FINAL_AGG_MAP = | ||
buildConf("spark.sql.codegen.aggregate.final.map.twolevel.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan - sure, updated. So given the new meaning of config, changed the default config value to true
as well.
.doc("Enable two-level aggregate hash map for final aggregate as well. Disable by default " + | ||
"because final aggregate might get more distinct keys compared to partial aggregate. " + | ||
"Overhead of looking up 1st-level map might dominate when having a lot of distinct keys.") | ||
.version("3.2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan - yes, updated.
Kubernetes integration test starting |
Kubernetes integration test status failure |
@@ -3865,6 +3875,8 @@ class SQLConf extends Serializable with Logging { | |||
|
|||
def enableTwoLevelAggMap: Boolean = getConf(ENABLE_TWOLEVEL_AGG_MAP) | |||
|
|||
def enableTwoLevelAggMapPartialOnly: Boolean = getConf(ENABLE_TWOLEVEL_AGG_MAP_PARTIAL_ONLY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't need to add a corresponding conf method if it's only called once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan - updated.
Test build #144217 has finished for PR 34270 at commit
|
Alice 1 2 165.0 | ||
NULL 3 7 172.5 | ||
Bob 0 5 180.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is needed for passing unit test, which reverts the change in https://github.com/apache/spark/pull/32242/files .
Kubernetes integration test starting |
Kubernetes integration test status failure |
thanks, merging to master/3.2! |
…n by default ### What changes were proposed in this pull request? This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in #32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here `spark.sql.codegen.aggregate.final.map.twolevel.enabled`, to allow query to enable the feature when seeing benefit. ### Why are the changes needed? Fix query regression. ### Does this PR introduce _any_ user-facing change? Yes, the introduced `spark.sql.codegen.aggregate.final.map.twolevel.enabled` config. ### How was this patch tested? Existing unit test in `AggregationQuerySuite.scala`. Also verified generated code for an example query in the file: ``` spark.sql( """ |SELECT key, avg(value) |FROM agg1 |GROUP BY key """.stripMargin) ``` Verified the generated code for final hash aggregation not have two level maps by default: https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 . Verified the generated code for final hash aggregation have two level maps if enabling the config: https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 . Closes #34270 from c21/agg-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3354a21) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Test build #144245 has finished for PR 34270 at commit
|
…n by default ### What changes were proposed in this pull request? This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in apache#32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here `spark.sql.codegen.aggregate.final.map.twolevel.enabled`, to allow query to enable the feature when seeing benefit. ### Why are the changes needed? Fix query regression. ### Does this PR introduce _any_ user-facing change? Yes, the introduced `spark.sql.codegen.aggregate.final.map.twolevel.enabled` config. ### How was this patch tested? Existing unit test in `AggregationQuerySuite.scala`. Also verified generated code for an example query in the file: ``` spark.sql( """ |SELECT key, avg(value) |FROM agg1 |GROUP BY key """.stripMargin) ``` Verified the generated code for final hash aggregation not have two level maps by default: https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 . Verified the generated code for final hash aggregation have two level maps if enabling the config: https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 . Closes apache#34270 from c21/agg-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3354a21) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…n by default ### What changes were proposed in this pull request? This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in apache#32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here `spark.sql.codegen.aggregate.final.map.twolevel.enabled`, to allow query to enable the feature when seeing benefit. ### Why are the changes needed? Fix query regression. ### Does this PR introduce _any_ user-facing change? Yes, the introduced `spark.sql.codegen.aggregate.final.map.twolevel.enabled` config. ### How was this patch tested? Existing unit test in `AggregationQuerySuite.scala`. Also verified generated code for an example query in the file: ``` spark.sql( """ |SELECT key, avg(value) |FROM agg1 |GROUP BY key """.stripMargin) ``` Verified the generated code for final hash aggregation not have two level maps by default: https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 . Verified the generated code for final hash aggregation have two level maps if enabling the config: https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 . Closes apache#34270 from c21/agg-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3354a21) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…n by default ### What changes were proposed in this pull request? This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in apache#32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here `spark.sql.codegen.aggregate.final.map.twolevel.enabled`, to allow query to enable the feature when seeing benefit. ### Why are the changes needed? Fix query regression. ### Does this PR introduce _any_ user-facing change? Yes, the introduced `spark.sql.codegen.aggregate.final.map.twolevel.enabled` config. ### How was this patch tested? Existing unit test in `AggregationQuerySuite.scala`. Also verified generated code for an example query in the file: ``` spark.sql( """ |SELECT key, avg(value) |FROM agg1 |GROUP BY key """.stripMargin) ``` Verified the generated code for final hash aggregation not have two level maps by default: https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 . Verified the generated code for final hash aggregation have two level maps if enabling the config: https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 . Closes apache#34270 from c21/agg-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3354a21) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in #32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here
spark.sql.codegen.aggregate.final.map.twolevel.enabled
, to allow query to enable the feature when seeing benefit.Why are the changes needed?
Fix query regression.
Does this PR introduce any user-facing change?
Yes, the introduced
spark.sql.codegen.aggregate.final.map.twolevel.enabled
config.How was this patch tested?
Existing unit test in
AggregationQuerySuite.scala
.Also verified generated code for an example query in the file:
Verified the generated code for final hash aggregation not have two level maps by default:
https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 .
Verified the generated code for final hash aggregation have two level maps if enabling the config:
https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 .