[SPARK-37001][SQL] Disable two level of map for final hash aggregation by default #34270

c21 · 2021-10-13T08:07:38Z

What changes were proposed in this pull request?

This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in #32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here spark.sql.codegen.aggregate.final.map.twolevel.enabled, to allow query to enable the feature when seeing benefit.

Why are the changes needed?

Fix query regression.

Does this PR introduce any user-facing change?

Yes, the introduced spark.sql.codegen.aggregate.final.map.twolevel.enabled config.

How was this patch tested?

Existing unit test in AggregationQuerySuite.scala.

Also verified generated code for an example query in the file:

spark.sql(
    """
      |SELECT key, avg(value)
      |FROM agg1
      |GROUP BY key
    """.stripMargin)

Verified the generated code for final hash aggregation not have two level maps by default:
https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 .

Verified the generated code for final hash aggregation have two level maps if enabling the config:
https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 .

c21 · 2021-10-13T08:07:54Z

cc @cloud-fan could you help take a look when you have time? Thanks!

SparkQA · 2021-10-13T09:00:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48669/

cloud-fan · 2021-10-13T09:17:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .version("2.3.0")
      .booleanConf
      .createWithDefault(true)

+  val ENABLE_TWOLEVEL_FINAL_AGG_MAP =
+    buildConf("spark.sql.codegen.aggregate.final.map.twolevel.enabled")


how about park.sql.codegen.aggregate.map.twolevel.partialOnly

@cloud-fan - sure, updated. So given the new meaning of config, changed the default config value to true as well.

cloud-fan · 2021-10-13T09:17:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .doc("Enable two-level aggregate hash map for final aggregate as well. Disable by default " +
+        "because final aggregate might get more distinct keys compared to partial aggregate. " +
+        "Overhead of looking up 1st-level map might dominate when having a lot of distinct keys.")
+      .version("3.2.0")


@cloud-fan - yes, updated.

SparkQA · 2021-10-13T09:40:38Z

Test build #144190 has finished for PR 34270 at commit 255daac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-13T09:44:35Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48669/

dongjoon-hyun

Hi, @c21 . Thank you for making a PR.
However, SPARK-35141 is released already via Apache Spark 3.2.0.
You cannot make a follow-up because this PR will be released as Apache Spark 3.2.1.
Please file a new JIRA issue and use it for this kind of PR.

c21

Addressed all comments from @cloud-fan and @dongjoon-hyun, thanks.

c21 · 2021-10-13T23:20:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .version("2.3.0")
      .booleanConf
      .createWithDefault(true)

+  val ENABLE_TWOLEVEL_FINAL_AGG_MAP =
+    buildConf("spark.sql.codegen.aggregate.final.map.twolevel.enabled")


@cloud-fan - sure, updated. So given the new meaning of config, changed the default config value to true as well.

c21 · 2021-10-13T23:20:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .doc("Enable two-level aggregate hash map for final aggregate as well. Disable by default " +
+        "because final aggregate might get more distinct keys compared to partial aggregate. " +
+        "Overhead of looking up 1st-level map might dominate when having a lot of distinct keys.")
+      .version("3.2.0")


@cloud-fan - yes, updated.

SparkQA · 2021-10-14T00:04:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48696/

SparkQA · 2021-10-14T01:02:28Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48696/

cloud-fan · 2021-10-14T02:25:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -3865,6 +3875,8 @@ class SQLConf extends Serializable with Logging {

  def enableTwoLevelAggMap: Boolean = getConf(ENABLE_TWOLEVEL_AGG_MAP)

+  def enableTwoLevelAggMapPartialOnly: Boolean = getConf(ENABLE_TWOLEVEL_AGG_MAP_PARTIAL_ONLY)


nit: we don't need to add a corresponding conf method if it's only called once.

@cloud-fan - updated.

SparkQA · 2021-10-14T02:43:41Z

Test build #144217 has finished for PR 34270 at commit d02f108.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2021-10-14T07:48:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala

        Alice	1	2	165.0
        NULL	3	7	172.5
+        Bob	0	5	180.0


This change is needed for passing unit test, which reverts the change in https://github.com/apache/spark/pull/32242/files .

SparkQA · 2021-10-14T09:14:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48725/

SparkQA · 2021-10-14T09:58:11Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48725/

cloud-fan · 2021-10-14T10:16:20Z

thanks, merging to master/3.2!

…n by default ### What changes were proposed in this pull request? This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in #32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here `spark.sql.codegen.aggregate.final.map.twolevel.enabled`, to allow query to enable the feature when seeing benefit. ### Why are the changes needed? Fix query regression. ### Does this PR introduce _any_ user-facing change? Yes, the introduced `spark.sql.codegen.aggregate.final.map.twolevel.enabled` config. ### How was this patch tested? Existing unit test in `AggregationQuerySuite.scala`. Also verified generated code for an example query in the file: ``` spark.sql( """ |SELECT key, avg(value) |FROM agg1 |GROUP BY key """.stripMargin) ``` Verified the generated code for final hash aggregation not have two level maps by default: https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 . Verified the generated code for final hash aggregation have two level maps if enabling the config: https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 . Closes #34270 from c21/agg-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3354a21) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

SparkQA · 2021-10-14T13:10:05Z

Test build #144245 has finished for PR 34270 at commit 889adbe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…n by default ### What changes were proposed in this pull request? This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in apache#32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here `spark.sql.codegen.aggregate.final.map.twolevel.enabled`, to allow query to enable the feature when seeing benefit. ### Why are the changes needed? Fix query regression. ### Does this PR introduce _any_ user-facing change? Yes, the introduced `spark.sql.codegen.aggregate.final.map.twolevel.enabled` config. ### How was this patch tested? Existing unit test in `AggregationQuerySuite.scala`. Also verified generated code for an example query in the file: ``` spark.sql( """ |SELECT key, avg(value) |FROM agg1 |GROUP BY key """.stripMargin) ``` Verified the generated code for final hash aggregation not have two level maps by default: https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 . Verified the generated code for final hash aggregation have two level maps if enabling the config: https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 . Closes apache#34270 from c21/agg-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3354a21) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Disable two level map for final hash aggregation by default

255daac

github-actions bot added the SQL label Oct 13, 2021

cloud-fan reviewed Oct 13, 2021

View reviewed changes

cloud-fan approved these changes Oct 13, 2021

View reviewed changes

cloud-fan reviewed Oct 13, 2021

View reviewed changes

dongjoon-hyun requested changes Oct 13, 2021

View reviewed changes

c21 changed the title ~~[SPARK-35141][SQL][FOLLOWUP] Disable two level of map for final hash aggregation by default~~ [SPARK-37001][SQL] Disable two level of map for final hash aggregation by default Oct 13, 2021

Address all comments

d02f108

c21 commented Oct 13, 2021

View reviewed changes

cloud-fan reviewed Oct 14, 2021

View reviewed changes

cloud-fan approved these changes Oct 14, 2021

View reviewed changes

Address all comments and fix unit test failure

889adbe

c21 commented Oct 14, 2021

View reviewed changes

cloud-fan closed this in 3354a21 Oct 14, 2021

cloud-fan mentioned this pull request Oct 18, 2021

Add 3.2.0 release note and news and update links apache/spark-website#361

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37001][SQL] Disable two level of map for final hash aggregation by default #34270

[SPARK-37001][SQL] Disable two level of map for final hash aggregation by default #34270

c21 commented Oct 13, 2021

c21 commented Oct 13, 2021

SparkQA commented Oct 13, 2021

cloud-fan Oct 13, 2021

c21 Oct 13, 2021

cloud-fan Oct 13, 2021

c21 Oct 13, 2021

SparkQA commented Oct 13, 2021

SparkQA commented Oct 13, 2021

dongjoon-hyun left a comment

c21 left a comment

c21 Oct 13, 2021

c21 Oct 13, 2021

SparkQA commented Oct 14, 2021

SparkQA commented Oct 14, 2021

cloud-fan Oct 14, 2021

c21 Oct 14, 2021

SparkQA commented Oct 14, 2021

c21 Oct 14, 2021

SparkQA commented Oct 14, 2021

SparkQA commented Oct 14, 2021

cloud-fan commented Oct 14, 2021

SparkQA commented Oct 14, 2021

		@@ -3865,6 +3875,8 @@ class SQLConf extends Serializable with Logging {

		def enableTwoLevelAggMap: Boolean = getConf(ENABLE_TWOLEVEL_AGG_MAP)

		def enableTwoLevelAggMapPartialOnly: Boolean = getConf(ENABLE_TWOLEVEL_AGG_MAP_PARTIAL_ONLY)

[SPARK-37001][SQL] Disable two level of map for final hash aggregation by default #34270

[SPARK-37001][SQL] Disable two level of map for final hash aggregation by default #34270

Conversation

c21 commented Oct 13, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

c21 commented Oct 13, 2021

SparkQA commented Oct 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 13, 2021

SparkQA commented Oct 13, 2021

dongjoon-hyun left a comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 14, 2021

SparkQA commented Oct 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 14, 2021

Choose a reason for hiding this comment

SparkQA commented Oct 14, 2021

SparkQA commented Oct 14, 2021

cloud-fan commented Oct 14, 2021

SparkQA commented Oct 14, 2021