[SPARK-31952][SQL][3.0] Fix incorrect memory spill metric when doing Aggregate #31140

Ngone51 · 2021-01-12T03:07:45Z

What changes were proposed in this pull request?

This PR takes over #28780.

Counted the spilled memory size when creating the UnsafeExternalSorter with the existing InMemorySorter
Accumulate the totalSpillBytes when merging two UnsafeExternalSorter

Why are the changes needed?

As mentioned in #28780:

It happends when hash aggregate downgrades to sort based aggregate.
UnsafeExternalSorter.createWithExistingInMemorySorter calls spill on an InMemorySorter immediately, but the memory pointed by InMemorySorter is acquired by outside BytesToBytesMap, instead the allocatedPages in UnsafeExternalSorter. So the memory spill bytes metric is always 0, but disk bytes spill metric is right.

Besides, this PR also fixes the UnsafeExternalSorter.merge by accumulating the totalSpillBytes of two sorters. Thus, we can report the correct spilled size in HashAggregateExec.finishAggregate.

Issues can be reproduced by the following step by checking the SQL metrics in UI:

bin/spark-shell --driver-memory 512m --executor-memory 512m --executor-cores 1 --conf "spark.default.parallelism=1"
scala> sql("select id, count(1) from range(10000000) group by id").write.csv("/tmp/result.json")

Before:

After:

Does this PR introduce any user-facing change?

Yes, users can see the correct spill metrics after this PR.

How was this patch tested?

Tested manually and added UTs.

…gate ### What changes were proposed in this pull request? This PR takes over apache#28780. 1. Counted the spilled memory size when creating the `UnsafeExternalSorter` with the existing `InMemorySorter` 2. Accumulate the `totalSpillBytes` when merging two `UnsafeExternalSorter` ### Why are the changes needed? As mentioned in apache#28780: > It happends when hash aggregate downgrades to sort based aggregate. `UnsafeExternalSorter.createWithExistingInMemorySorter` calls spill on an `InMemorySorter` immediately, but the memory pointed by `InMemorySorter` is acquired by outside `BytesToBytesMap`, instead the allocatedPages in `UnsafeExternalSorter`. So the memory spill bytes metric is always 0, but disk bytes spill metric is right. Besides, this PR also fixes the `UnsafeExternalSorter.merge` by accumulating the `totalSpillBytes` of two sorters. Thus, we can report the correct spilled size in `HashAggregateExec.finishAggregate`. Issues can be reproduced by the following step by checking the SQL metrics in UI: ``` bin/spark-shell --driver-memory 512m --executor-memory 512m --executor-cores 1 --conf "spark.default.parallelism=1" scala> sql("select id, count(1) from range(10000000) group by id").write.csv("/tmp/result.json") ``` Before: <img width="200" alt="WeChatfe5146180d91015e03b9a27852e9a443" src="https://user-images.githubusercontent.com/16397174/103625414-e6fc6280-4f75-11eb-8b93-c55095bdb5b8.png"> After: <img width="200" alt="WeChat42ab0e73c5fbc3b14c12ab85d232071d" src="https://user-images.githubusercontent.com/16397174/103625420-e8c62600-4f75-11eb-8e1f-6f5e8ab561b9.png"> ### Does this PR introduce _any_ user-facing change? Yes, users can see the correct spill metrics after this PR. ### How was this patch tested? Tested manually and added UTs. Closes apache#31035 from Ngone51/SPARK-31952. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Ngone51 · 2021-01-12T03:08:50Z

cc @cloud-fan

cloud-fan · 2021-01-12T03:22:38Z

ok to test

SparkQA · 2021-01-12T04:22:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38540/

SparkQA · 2021-01-12T04:43:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38540/

cloud-fan · 2021-01-12T05:27:59Z

thanks, merging to 3.0!

…Aggregate ### What changes were proposed in this pull request? This PR takes over #28780. 1. Counted the spilled memory size when creating the `UnsafeExternalSorter` with the existing `InMemorySorter` 2. Accumulate the `totalSpillBytes` when merging two `UnsafeExternalSorter` ### Why are the changes needed? As mentioned in #28780: > It happends when hash aggregate downgrades to sort based aggregate. `UnsafeExternalSorter.createWithExistingInMemorySorter` calls spill on an `InMemorySorter` immediately, but the memory pointed by `InMemorySorter` is acquired by outside `BytesToBytesMap`, instead the allocatedPages in `UnsafeExternalSorter`. So the memory spill bytes metric is always 0, but disk bytes spill metric is right. Besides, this PR also fixes the `UnsafeExternalSorter.merge` by accumulating the `totalSpillBytes` of two sorters. Thus, we can report the correct spilled size in `HashAggregateExec.finishAggregate`. Issues can be reproduced by the following step by checking the SQL metrics in UI: ``` bin/spark-shell --driver-memory 512m --executor-memory 512m --executor-cores 1 --conf "spark.default.parallelism=1" scala> sql("select id, count(1) from range(10000000) group by id").write.csv("/tmp/result.json") ``` Before: <img width="200" alt="WeChatfe5146180d91015e03b9a27852e9a443" src="https://user-images.githubusercontent.com/16397174/103625414-e6fc6280-4f75-11eb-8b93-c55095bdb5b8.png"> After: <img width="200" alt="WeChat42ab0e73c5fbc3b14c12ab85d232071d" src="https://user-images.githubusercontent.com/16397174/103625420-e8c62600-4f75-11eb-8e1f-6f5e8ab561b9.png"> ### Does this PR introduce _any_ user-facing change? Yes, users can see the correct spill metrics after this PR. ### How was this patch tested? Tested manually and added UTs. Closes #31140 from Ngone51/cp-spark-31952. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2021-01-12T05:28:07Z

Merged to branch-3.0.

SparkQA · 2021-01-12T06:14:26Z

Test build #133953 has finished for PR 31140 at commit 7381539.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 mentioned this pull request Jan 12, 2021

[SPARK-31952][SQL] Fix incorrect memory spill metric when doing Aggregate #31035

Closed

cloud-fan approved these changes Jan 12, 2021

View reviewed changes

dongjoon-hyun approved these changes Jan 12, 2021

View reviewed changes

dongjoon-hyun closed this Jan 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31952][SQL][3.0] Fix incorrect memory spill metric when doing Aggregate #31140

[SPARK-31952][SQL][3.0] Fix incorrect memory spill metric when doing Aggregate #31140

Ngone51 commented Jan 12, 2021

Ngone51 commented Jan 12, 2021

cloud-fan commented Jan 12, 2021

SparkQA commented Jan 12, 2021

SparkQA commented Jan 12, 2021

cloud-fan commented Jan 12, 2021

dongjoon-hyun commented Jan 12, 2021

SparkQA commented Jan 12, 2021

[SPARK-31952][SQL][3.0] Fix incorrect memory spill metric when doing Aggregate #31140

[SPARK-31952][SQL][3.0] Fix incorrect memory spill metric when doing Aggregate #31140

Conversation

Ngone51 commented Jan 12, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Ngone51 commented Jan 12, 2021

cloud-fan commented Jan 12, 2021

SparkQA commented Jan 12, 2021

SparkQA commented Jan 12, 2021

cloud-fan commented Jan 12, 2021

dongjoon-hyun commented Jan 12, 2021

SparkQA commented Jan 12, 2021