Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37001][SQL] Disable two level of map for final hash aggregation by default #34270

Closed
wants to merge 3 commits into from

Conversation

c21
Copy link
Contributor

@c21 c21 commented Oct 13, 2021

What changes were proposed in this pull request?

This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in #32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here spark.sql.codegen.aggregate.final.map.twolevel.enabled, to allow query to enable the feature when seeing benefit.

Why are the changes needed?

Fix query regression.

Does this PR introduce any user-facing change?

Yes, the introduced spark.sql.codegen.aggregate.final.map.twolevel.enabled config.

How was this patch tested?

Existing unit test in AggregationQuerySuite.scala.

Also verified generated code for an example query in the file:

spark.sql(
    """
      |SELECT key, avg(value)
      |FROM agg1
      |GROUP BY key
    """.stripMargin)

Verified the generated code for final hash aggregation not have two level maps by default:
https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 .

Verified the generated code for final hash aggregation have two level maps if enabling the config:
https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 .

@github-actions github-actions bot added the SQL label Oct 13, 2021
@c21
Copy link
Contributor Author

c21 commented Oct 13, 2021

cc @cloud-fan could you help take a look when you have time? Thanks!

@SparkQA
Copy link

SparkQA commented Oct 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48669/

.version("2.3.0")
.booleanConf
.createWithDefault(true)

val ENABLE_TWOLEVEL_FINAL_AGG_MAP =
buildConf("spark.sql.codegen.aggregate.final.map.twolevel.enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about park.sql.codegen.aggregate.map.twolevel.partialOnly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - sure, updated. So given the new meaning of config, changed the default config value to true as well.

.doc("Enable two-level aggregate hash map for final aggregate as well. Disable by default " +
"because final aggregate might get more distinct keys compared to partial aggregate. " +
"Overhead of looking up 1st-level map might dominate when having a lot of distinct keys.")
.version("3.2.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.2.1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - yes, updated.

@SparkQA
Copy link

SparkQA commented Oct 13, 2021

Test build #144190 has finished for PR 34270 at commit 255daac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48669/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @c21 . Thank you for making a PR.
However, SPARK-35141 is released already via Apache Spark 3.2.0.
You cannot make a follow-up because this PR will be released as Apache Spark 3.2.1.
Please file a new JIRA issue and use it for this kind of PR.

@c21 c21 changed the title [SPARK-35141][SQL][FOLLOWUP] Disable two level of map for final hash aggregation by default [SPARK-37001][SQL] Disable two level of map for final hash aggregation by default Oct 13, 2021
Copy link
Contributor Author

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all comments from @cloud-fan and @dongjoon-hyun, thanks.

.version("2.3.0")
.booleanConf
.createWithDefault(true)

val ENABLE_TWOLEVEL_FINAL_AGG_MAP =
buildConf("spark.sql.codegen.aggregate.final.map.twolevel.enabled")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - sure, updated. So given the new meaning of config, changed the default config value to true as well.

.doc("Enable two-level aggregate hash map for final aggregate as well. Disable by default " +
"because final aggregate might get more distinct keys compared to partial aggregate. " +
"Overhead of looking up 1st-level map might dominate when having a lot of distinct keys.")
.version("3.2.0")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - yes, updated.

@SparkQA
Copy link

SparkQA commented Oct 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48696/

@SparkQA
Copy link

SparkQA commented Oct 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48696/

@@ -3865,6 +3875,8 @@ class SQLConf extends Serializable with Logging {

def enableTwoLevelAggMap: Boolean = getConf(ENABLE_TWOLEVEL_AGG_MAP)

def enableTwoLevelAggMapPartialOnly: Boolean = getConf(ENABLE_TWOLEVEL_AGG_MAP_PARTIAL_ONLY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need to add a corresponding conf method if it's only called once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - updated.

@SparkQA
Copy link

SparkQA commented Oct 14, 2021

Test build #144217 has finished for PR 34270 at commit d02f108.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Alice 1 2 165.0
NULL 3 7 172.5
Bob 0 5 180.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is needed for passing unit test, which reverts the change in https://github.com/apache/spark/pull/32242/files .

@SparkQA
Copy link

SparkQA commented Oct 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48725/

@SparkQA
Copy link

SparkQA commented Oct 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48725/

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.2!

@cloud-fan cloud-fan closed this in 3354a21 Oct 14, 2021
cloud-fan pushed a commit that referenced this pull request Oct 14, 2021
…n by default

### What changes were proposed in this pull request?

This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in #32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here `spark.sql.codegen.aggregate.final.map.twolevel.enabled`, to allow query to enable the feature when seeing benefit.

### Why are the changes needed?

Fix query regression.

### Does this PR introduce _any_ user-facing change?

Yes, the introduced `spark.sql.codegen.aggregate.final.map.twolevel.enabled` config.

### How was this patch tested?

Existing unit test in `AggregationQuerySuite.scala`.

Also verified generated code for an example query in the file:

```
spark.sql(
    """
      |SELECT key, avg(value)
      |FROM agg1
      |GROUP BY key
    """.stripMargin)
```

Verified the generated code for final hash aggregation not have two level maps by default:
https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 .

Verified the generated code for final hash aggregation have two level maps if enabling the config:
https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 .

Closes #34270 from c21/agg-fix.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 3354a21)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@SparkQA
Copy link

SparkQA commented Oct 14, 2021

Test build #144245 has finished for PR 34270 at commit 889adbe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sunchao pushed a commit to sunchao/spark that referenced this pull request Dec 8, 2021
…n by default

### What changes were proposed in this pull request?

This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in apache#32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here `spark.sql.codegen.aggregate.final.map.twolevel.enabled`, to allow query to enable the feature when seeing benefit.

### Why are the changes needed?

Fix query regression.

### Does this PR introduce _any_ user-facing change?

Yes, the introduced `spark.sql.codegen.aggregate.final.map.twolevel.enabled` config.

### How was this patch tested?

Existing unit test in `AggregationQuerySuite.scala`.

Also verified generated code for an example query in the file:

```
spark.sql(
    """
      |SELECT key, avg(value)
      |FROM agg1
      |GROUP BY key
    """.stripMargin)
```

Verified the generated code for final hash aggregation not have two level maps by default:
https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 .

Verified the generated code for final hash aggregation have two level maps if enabling the config:
https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 .

Closes apache#34270 from c21/agg-fix.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 3354a21)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
catalinii pushed a commit to lyft/spark that referenced this pull request Feb 22, 2022
…n by default

### What changes were proposed in this pull request?

This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in apache#32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here `spark.sql.codegen.aggregate.final.map.twolevel.enabled`, to allow query to enable the feature when seeing benefit.

### Why are the changes needed?

Fix query regression.

### Does this PR introduce _any_ user-facing change?

Yes, the introduced `spark.sql.codegen.aggregate.final.map.twolevel.enabled` config.

### How was this patch tested?

Existing unit test in `AggregationQuerySuite.scala`.

Also verified generated code for an example query in the file:

```
spark.sql(
    """
      |SELECT key, avg(value)
      |FROM agg1
      |GROUP BY key
    """.stripMargin)
```

Verified the generated code for final hash aggregation not have two level maps by default:
https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 .

Verified the generated code for final hash aggregation have two level maps if enabling the config:
https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 .

Closes apache#34270 from c21/agg-fix.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 3354a21)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
catalinii pushed a commit to lyft/spark that referenced this pull request Mar 4, 2022
…n by default

### What changes were proposed in this pull request?

This PR is to disable two level of maps for final hash aggregation by default. The feature was introduced in apache#32242 and we found it can lead to query performance regression when the final aggregation gets rows with a lot of distinct keys. The 1st level hash map is full so a lot of rows will waste the 1st hash map lookup and inserted into 2nd hash map. This feature still benefits query with not so many distinct keys though, so introducing a config here `spark.sql.codegen.aggregate.final.map.twolevel.enabled`, to allow query to enable the feature when seeing benefit.

### Why are the changes needed?

Fix query regression.

### Does this PR introduce _any_ user-facing change?

Yes, the introduced `spark.sql.codegen.aggregate.final.map.twolevel.enabled` config.

### How was this patch tested?

Existing unit test in `AggregationQuerySuite.scala`.

Also verified generated code for an example query in the file:

```
spark.sql(
    """
      |SELECT key, avg(value)
      |FROM agg1
      |GROUP BY key
    """.stripMargin)
```

Verified the generated code for final hash aggregation not have two level maps by default:
https://gist.github.com/c21/d4ce87ef28a22d1ce839e0cda000ce14 .

Verified the generated code for final hash aggregation have two level maps if enabling the config:
https://gist.github.com/c21/4b59752c1f3f98303b60ccff66b5db69 .

Closes apache#34270 from c21/agg-fix.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 3354a21)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants