[SPARK-31253][SQL][FOLLOW-UP] Improve the partition data size metrics in CustomShuffleReaderExec #28175

JkSelf · 2020-04-10T03:47:02Z

What changes were proposed in this pull request?

Currently the partition data size metrics contain three entries (min/max/avg) in Spark UI, which is not user friendly. This PR lets the metrics with min/max/avg in one entry by calling SQLMetrics.postDriverMetricUpdates multiple times.
Before this PR, the spark UI is shown in the following:

After this PR. the spark UI is shown in the following:

Why are the changes needed?

Improving UI

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing ut

JkSelf · 2020-04-10T03:48:13Z

@cloud-fan @maryannxue

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala

cloud-fan · 2020-04-10T04:29:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala

+        metrics("partitionDataSize").set(dataSize)
+        SQLMetrics.postDriverMetricUpdates(
+          sparkContext, executionId,
+          metrics.filter(_._1 == "partitionDataSize").values.toSeq)


can we look up the partitionDataSize SQLMetric at the beginning of this method? then here we can simply write Seq(metric).

cloud-fan · 2020-04-10T04:30:12Z

can you put the before/after screenshots?

SparkQA · 2020-04-10T05:26:32Z

Test build #121054 has finished for PR 28175 at commit 806a143.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-10T11:05:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala

+  private def sendPartitionDataSizeMetrics(
+      executionId: String,
+      partitionMetrics: SQLMetric): Unit = {
+    val mapStats = shuffleStage.get.mapStats.get.bytesByPartitionId


Let's follow the previous code: https://github.com/apache/spark/pull/28175/files#diff-a42cafdbb5870e28c4e03df50ffc44f6L111

If shuffleStage.get.mapStats.isEmpty, we send the metric value as 0 only once.

SparkQA · 2020-04-10T11:06:03Z

Test build #121076 has finished for PR 28175 at commit 4aabfaa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-10T11:18:02Z

Test build #121071 has finished for PR 28175 at commit bbd6324.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-13T06:27:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala

+    val mapStats = shuffleStage.get.mapStats
+    if (mapStats.isEmpty) {
+      metrics("partitionDataSize").set(0)
+      SQLMetrics.postDriverMetricUpdates(sparkContext, executionId, Seq{partitionMetrics})


Seq{partitionMetrics} ？

should be Seq(partitionMetrics)

cloud-fan · 2020-04-13T06:27:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala

+          val dataSize = startReducerIndex.until(endReducerIndex).map(
+            mapStats.get.bytesByPartitionId(_)).sum
+          metrics("partitionDataSize").set(dataSize)
+          SQLMetrics.postDriverMetricUpdates(sparkContext, executionId, Seq{partitionMetrics})


cloud-fan · 2020-04-13T06:27:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala

+      partitionMetrics: SQLMetric): Unit = {
+    val mapStats = shuffleStage.get.mapStats
+    if (mapStats.isEmpty) {
+      metrics("partitionDataSize").set(0)


partitionMetrics.set(0)

cloud-fan · 2020-04-13T06:28:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala

+          sum += dataSize
+        case p: PartialReducerPartitionSpec =>
+          metrics("partitionDataSize").set(p.dataSize)
+          SQLMetrics.postDriverMetricUpdates(sparkContext, executionId, Seq{partitionMetrics})


cloud-fan · 2020-04-13T06:28:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala

+        case CoalescedPartitionSpec(startReducerIndex, endReducerIndex) =>
+          val dataSize = startReducerIndex.until(endReducerIndex).map(
+            mapStats.get.bytesByPartitionId(_)).sum
+          metrics("partitionDataSize").set(dataSize)


cloud-fan · 2020-04-13T06:28:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala

+        case p => throw new IllegalStateException("unexpected " + p)
+      }
+      // Set sum value to "partitionDataSize" metric.
+      metrics("partitionDataSize").set(sum)


cloud-fan · 2020-04-13T06:29:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala

+      metrics.filter(_._1 != "partitionDataSize").values.toSeq)
+
+    if(!isLocalReader && shuffleStage.get.mapStats.isDefined) {
+      sendPartitionDataSizeMetrics(executionId, metrics.get("partitionDataSize").get)


why not do val partitionMetrics = metrics("partitionDataSize") inside the method instead of passing as a parameter?

SparkQA · 2020-04-13T07:05:01Z

Test build #121168 has finished for PR 28175 at commit e6c50df.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-13T07:05:02Z

Test build #121173 has finished for PR 28175 at commit ae1824a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-13T07:07:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala

@@ -128,6 +104,34 @@ case class CustomShuffleReaderExec private(
    Map("numSkewedPartitions" -> metrics)
  }

+  private def sendPartitionDataSizeMetrics(
+      executionId: String): Unit = {


we can merge this to the previous line now.

SparkQA · 2020-04-13T09:27:24Z

Test build #121184 has finished for PR 28175 at commit fda6846.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JkSelf · 2020-04-13T09:44:20Z

retest this please

SparkQA · 2020-04-13T13:07:24Z

Test build #121196 has finished for PR 28175 at commit fda6846.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-13T14:44:00Z

retest this please

SparkQA · 2020-04-13T19:26:50Z

Test build #121214 has finished for PR 28175 at commit fda6846.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2020-04-13T19:58:12Z

After offline discussions with @cloud-fan, we agree that it'd be more efficient to add a new metric type other than post metrics for all the partitions. With the new type, we can have max, min, avg, median (or anything you want).

…data size once

JkSelf · 2020-04-14T08:10:44Z

@maryannxue Instead of creating new metric type, we can add new method postDriverMetricsUpdatedByValue to pass all the partition data size only once to reduce the overhead.

cloud-fan · 2020-04-14T08:24:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala

@@ -222,6 +222,15 @@ object SQLMetrics {
    }
  }

+  def postDriverMetricsUpdatedByValue(
+      sc: SparkContext, executionId: String,


nit: one parameter per line

SparkQA · 2020-04-14T14:02:12Z

Test build #121264 has finished for PR 28175 at commit efdd9b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-14T15:11:19Z

Test build #121270 has finished for PR 28175 at commit 3ee19f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2020-04-14T16:02:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala

+
+      val id = partitionMetrics.id
+      val accumUpdates = sizes.map(value => (id, value))
+      SQLMetrics.postDriverMetricsUpdatedByValue(sparkContext, executionId, accumUpdates)


Why can't we send all metrics together?

make sense. And already updated.

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala

SparkQA · 2020-04-16T04:39:51Z

Test build #121347 has finished for PR 28175 at commit 72d46eb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-16T05:25:56Z

Test build #121345 has finished for PR 28175 at commit 6c70108.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-16T07:05:02Z

Test build #121351 has finished for PR 28175 at commit 2e2dfb8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-16T12:04:42Z

retest this please

SparkQA · 2020-04-16T15:56:16Z

Test build #121362 has finished for PR 28175 at commit 2e2dfb8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JkSelf · 2020-04-17T01:07:00Z

retest this please

SparkQA · 2020-04-17T06:10:04Z

Test build #121385 has finished for PR 28175 at commit 2e2dfb8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-17T06:24:07Z

thanks, merging to master!

…etrics of AQE shuffle ### What changes were proposed in this pull request? A followup of #28175: 1. use mutable collection to store the driver metrics 2. don't send size metrics if there is no map stats, as UI will display size as 0 if there is no data 3. calculate partition data size separately, to make the code easier to read. ### Why are the changes needed? code simplification ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #28240 from cloud-fan/refactor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>

HyukjinKwon changed the title ~~[SPARK-31253][SQL][followup] Improve the partition data size metrics in CustomShuffleReaderExec~~ [SPARK-31253][SQL][FOLLOW-UO] Improve the partition data size metrics in CustomShuffleReaderExec Apr 10, 2020

HyukjinKwon changed the title ~~[SPARK-31253][SQL][FOLLOW-UO] Improve the partition data size metrics in CustomShuffleReaderExec~~ [SPARK-31253][SQL][FOLLOW-UP] Improve the partition data size metrics in CustomShuffleReaderExec Apr 10, 2020

cloud-fan reviewed Apr 10, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Apr 10, 2020

View reviewed changes

JkSelf added 3 commits April 10, 2020 17:26

improve the partition data size metrics in CustomShuffleReaderExec

fc766a5

resolve the comments

a885f30

rebase code

4aabfaa

JkSelf force-pushed the improveAqeMetrics branch from bbd6324 to 4aabfaa Compare April 10, 2020 09:27

cloud-fan reviewed Apr 10, 2020

View reviewed changes

resolve comments

e6c50df

probot-autolabeler bot added SQL WEB UI labels Apr 13, 2020

cloud-fan reviewed Apr 13, 2020

View reviewed changes

resolve comments

ae1824a

cloud-fan reviewed Apr 13, 2020

View reviewed changes

cloud-fan approved these changes Apr 13, 2020

View reviewed changes

resolve comment

fda6846

add method postDriverMetricsUpdatedByValue to pass all the partition …

efdd9b7

…data size once

cloud-fan reviewed Apr 14, 2020

View reviewed changes

resolve the comment

3ee19f8

maryannxue reviewed Apr 14, 2020

View reviewed changes

send all the driver metrics together

6c70108

cloud-fan reviewed Apr 16, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffleReaderExec.scala Show resolved Hide resolved

resolve comment

72d46eb

fix the failed ut

2e2dfb8

cloud-fan approved these changes Apr 16, 2020

View reviewed changes

cloud-fan closed this in d136b72 Apr 17, 2020

cloud-fan mentioned this pull request Apr 17, 2020

[SPARK-31253][SQL][FOLLOW-UP] simplify the code of calculating size metrics of AQE shuffle #28240

Closed

[SPARK-31253][SQL][FOLLOW-UP] Improve the partition data size metrics in CustomShuffleReaderExec #28175

[SPARK-31253][SQL][FOLLOW-UP] Improve the partition data size metrics in CustomShuffleReaderExec #28175

Conversation

JkSelf commented Apr 10, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

JkSelf commented Apr 10, 2020

Choose a reason for hiding this comment

cloud-fan commented Apr 10, 2020

SparkQA commented Apr 10, 2020

Choose a reason for hiding this comment

SparkQA commented Apr 10, 2020

SparkQA commented Apr 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 13, 2020

SparkQA commented Apr 13, 2020

Choose a reason for hiding this comment

SparkQA commented Apr 13, 2020

JkSelf commented Apr 13, 2020

SparkQA commented Apr 13, 2020

cloud-fan commented Apr 13, 2020

SparkQA commented Apr 13, 2020

maryannxue commented Apr 13, 2020

JkSelf commented Apr 14, 2020

Choose a reason for hiding this comment

SparkQA commented Apr 14, 2020

SparkQA commented Apr 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 16, 2020

SparkQA commented Apr 16, 2020

SparkQA commented Apr 16, 2020

cloud-fan commented Apr 16, 2020

SparkQA commented Apr 16, 2020

JkSelf commented Apr 17, 2020

SparkQA commented Apr 17, 2020

cloud-fan commented Apr 17, 2020

JkSelf commented Apr 10, 2020 •

edited