[SPARK-31270][CORE] Expose executor metrics in task detail table in StageLevel #28061

AngersZhuuuu · 2020-03-28T06:54:28Z

What changes were proposed in this pull request?

Expose executor metrics in task detail table in StageLevel

Why are the changes needed?

help developer to check app running status.

Does this PR introduce any user-facing change?

Yea, User can see executor's peak memory usage witch this task running on by clock Additional Metrics 's checkbox of Peak JVM On/Off Heap Memory

How was this patch tested?

Run will spark-sql

./bin/spark-sql --conf spark.eventLog.enabled=true --conf spark.eventLog.logStageExecutorMetrics.enabled=true --conf spark.eventLog.dir=/Users/angerszhu/tmp/spark-events/ --conf spark.executor.metrics.pollingInterval=100

Run sql :
spark-sql> select count(1) from test_dy_data group by id;

dongjoon-hyun · 2020-03-29T01:45:43Z

ok to test

dongjoon-hyun · 2020-03-29T01:47:50Z

core/src/main/resources/org/apache/spark/ui/static/stagepage.js

@@ -572,7 +600,7 @@ $(document).ready(function () {
                                var row1 = createRowMetadataForColumn(
                                    columnKey, taskMetricsResponse[columnKey], 4);
                                var row2 = createRowMetadataForColumn(
-                                    "shuffleWriteTime", taskMetricsResponse[columnKey], 21);
+                                    "shuffleWriteTime", taskMetricsResponse[columnKey], 23);


Hi, @AngersZhuuuu . If possible, could you try to avoid this change?

If possible, could you try to a

Since add new column in Additional Metric, the column index is changed. So need to do this

To be honest , I spend a lot of time to make sure the column change when add new column, since the spark web ui code changed a lot, since origin pr in stagepage.js have not do things like add column.

If it's ok, this pr can be a demo for new develope of spark web UI

That's the exact reason why I asked that. I've seen several UI patches like this. :)

Since add new column in Additional Metric, the column index is changed. So need to do this

There exists lots of similar commits already for new develop of Spark web UI. Always, the commits(already reviewed and merged) are the best demo for developers.

If it's ok, this pr can be a demo for new develope of spark web UI

There exists lots of similar commits already for new develop of Spark web UI. Always, the commits(already reviewed and merged) are the best demo for developers.

Each page's JS script isn't the same, and I have find that for stage page, don't have merged pr about add new additional column.

@AngersZhuuuu I found following issues.

When a job performs shuffle, StagePage for shuffle write stage has a problem.
Even if you visit StagePage where the corresponding stage performs shuffle write and check Shuffle Write Time,
the metrics is not shown.

Web UI will be hung-up
When you visit a StagePage where the corresponding stage performs shuffle read and reload, UI will be hung-up.

Those issues are caused because column indices are not changed properly.
To avoid such issues, please keep it simple. I believe column indices need not to be changed.

dongjoon-hyun

Maybe,
peakJvmHeapMemory -> peakJvmOnHeapMemory?
Peak JVM Heap Memory -> Peak JVM On Heap Memory?

dongjoon-hyun · 2020-03-29T01:53:42Z

cc @sarutak

SparkQA · 2020-03-29T02:00:33Z

Test build #120542 has finished for PR 28061 at commit 7807901.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-03-29T02:27:09Z

Maybe,
peakJvmHeapMemory -> peakJvmOnHeapMemory?
Peak JVM Heap Memory -> Peak JVM On Heap Memory?

Sure, make it more clear

sarutak · 2020-03-29T07:29:08Z

I'll review within a week.

SparkQA · 2020-03-29T14:14:08Z

Test build #120555 has finished for PR 28061 at commit 48a355d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-29T17:00:05Z

Test build #120556 has finished for PR 28061 at commit bbea4f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-29T18:45:29Z

Thank you, @sarutak !

sarutak

I'm inspecting the change now but I have one fundamental question.
Peak JVM On/Off Heap Memory are Executor metrics and they don't indicate the peak values for each task.
So I wonder it's not helpful that those metrics are displayed for each task.

To explain concretely, I tested this change with following code in spark-shell.

$ bin/spark-shell --master local[1] --conf spark.executor.heartbeatInterval=1s
sc.parallelize(List(1, 2, 3), 3).map { e =>
  val context = org.apache.spark.TaskContext.get
  if (context.partitionId == 0) {
    val dummyData = List.tabulate(Int.MaxValue / 250){i => "Index " + i}
  }
  Thread.sleep(3000)
  e
}.collect

With this code, all the tasks run on the same Executor and task0 is memory consuming while task1 and task2 are not.
Although task1 and task2 should not consume much memory than task0, those Peak JVM On/Off Heap Memory indicate greater values than those of task0.
This is because those metrics are accumulated within the same JVM.

sarutak · 2020-04-05T22:42:18Z

core/src/main/scala/org/apache/spark/status/LiveEntity.scala

+                    executorMetrics: Option[ExecutorMetrics] = None): v1.TaskMetrics = {
+    executorMetrics.foreach {
+      this.executorMetrics.compareAndUpdatePeakValues
+    }


How about executorMetrics.foreach(this.executorMetrics.compareAndUpdatePeakValues)?

sarutak · 2020-04-05T23:16:08Z

Additional comments:

Could you explain more about manual test in the description so that we can understand what and how you tested?
With this change, appearance of UI changes so this change should be user facing.
Please update screen shots when you update the code.

AngersZhuuuu · 2020-04-06T07:11:23Z

So I wonder it's not helpful that those metrics are displayed for each task.

Such as Stage Executor Metrics, it show each executors's peak memory during this stage, but there is still a problem that during this stage, other stage's task will run in same executor too when executor's core num is bigger than 1.

We just show each task's running status about JVM, as an SRE, we need to know task running status as many as possible. Any one of these metrics can help the user find problems and provide help.

sarutak · 2020-04-08T06:40:04Z

O.K, I understand the motivation.
I confirmed that those peak metrics are calculated for each task so the fundamental idea might be fine even though over-wrapping or consecutive tasks are affected each other like GC Time.
What do you think, @dongjoon-hyun , @gengliangwang ?

sarutak · 2020-04-08T06:47:39Z

core/src/main/resources/org/apache/spark/ui/static/stagepage.js

@@ -327,6 +336,13 @@ $(document).ready(function () {
                "should be approximately the sum of the peak sizes across all such data structures created " +
                "in this task. For SQL jobs, this only tracks all unsafe operators, broadcast joins, and " +
                "external sort.");
+    $('#peak_jvm_on_heap_memory').attr("data-toggle", "tooltip")
+            .attr("data-placement", "top")
+            .attr("title", "Peak Executor JVM on-heap memory usage during this task running");


tasks might be proper rather than this task.

sarutak · 2020-04-08T06:49:25Z

core/src/main/resources/org/apache/spark/ui/static/stagepage.js

+            .attr("title", "Peak Executor JVM on-heap memory usage during this task running");
+    $('#peak_jvm_off_heap_memory').attr("data-toggle", "tooltip")
+            .attr("data-placement", "top")
+            .attr("title", "Peak Executor JVM off-heap memory usage during this task running");


Same comment as here.

gengliangwang · 2020-04-08T07:18:54Z

@AngersZhuuuu Sorry to reply late.
Actually, I am slightly -1 on the changes since it is not straightforward enough. Even @sarutak has to think about the meaning of the metrics. The existing task metrics are the sum of values from executors, while the new one is the max of values from executors.

How about just displaying the max on/off heap memory in the executor page?

AngersZhuuuu · 2020-04-08T07:23:15Z

@AngersZhuuuu Sorry to reply late.
Actually, I am slightly -1 on the changes since it is not straightforward enough. Even @sarutak has to think about the meaning of the metrics.

How about just displaying the max on/off heap memory in the executor page?

how about this ? #28036
#28036 (comment)

gengliangwang · 2020-04-08T07:26:18Z

how about this ? #28036
#28036 (comment)

That's way better!

AngersZhuuuu · 2020-04-08T07:27:42Z

That's way better!

ok, I will focus on that pr and follow your advise

gengliangwang · 2020-04-08T07:35:54Z

@sarutak I prefer the approach in #28036 .
What do you think? If so, shall we close this one?

sarutak · 2020-04-08T07:41:20Z

I also prefer that.

AngersZhuuuu · 2020-04-08T07:42:17Z

cc @sarutak @gengliangwang Close it

AngersZhuuuu added 2 commits March 28, 2020 12:06

ui right

021e8f4

Update stagepage.js

7807901

dongjoon-hyun added the WEB UI label Mar 29, 2020

dongjoon-hyun reviewed Mar 29, 2020

View reviewed changes

AngersZhuuuu added 2 commits March 29, 2020 18:27

Update MimaExcludes.scala

9cc5264

fix column name

48a355d

fix ut

bbea4f9

sarutak reviewed Apr 5, 2020

View reviewed changes

sarutak reviewed Apr 8, 2020

View reviewed changes

AngersZhuuuu closed this Apr 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31270][CORE] Expose executor metrics in task detail table in StageLevel #28061

[SPARK-31270][CORE] Expose executor metrics in task detail table in StageLevel #28061

AngersZhuuuu commented Mar 28, 2020 •

edited

dongjoon-hyun commented Mar 29, 2020

dongjoon-hyun Mar 29, 2020

AngersZhuuuu Mar 29, 2020

dongjoon-hyun Mar 29, 2020

AngersZhuuuu Mar 30, 2020

sarutak Apr 8, 2020

dongjoon-hyun left a comment

dongjoon-hyun commented Mar 29, 2020

SparkQA commented Mar 29, 2020

AngersZhuuuu commented Mar 29, 2020

sarutak commented Mar 29, 2020

SparkQA commented Mar 29, 2020

SparkQA commented Mar 29, 2020

dongjoon-hyun commented Mar 29, 2020

sarutak left a comment •

edited

sarutak Apr 5, 2020

sarutak commented Apr 5, 2020

AngersZhuuuu commented Apr 6, 2020

sarutak commented Apr 8, 2020

sarutak Apr 8, 2020

sarutak Apr 8, 2020

gengliangwang commented Apr 8, 2020 •

edited

AngersZhuuuu commented Apr 8, 2020

gengliangwang commented Apr 8, 2020

AngersZhuuuu commented Apr 8, 2020

gengliangwang commented Apr 8, 2020

sarutak commented Apr 8, 2020

AngersZhuuuu commented Apr 8, 2020

[SPARK-31270][CORE] Expose executor metrics in task detail table in StageLevel #28061

[SPARK-31270][CORE] Expose executor metrics in task detail table in StageLevel #28061

Conversation

AngersZhuuuu commented Mar 28, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Mar 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 29, 2020

SparkQA commented Mar 29, 2020

AngersZhuuuu commented Mar 29, 2020

sarutak commented Mar 29, 2020

SparkQA commented Mar 29, 2020

SparkQA commented Mar 29, 2020

dongjoon-hyun commented Mar 29, 2020

sarutak left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarutak commented Apr 5, 2020

AngersZhuuuu commented Apr 6, 2020

sarutak commented Apr 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang commented Apr 8, 2020 • edited

AngersZhuuuu commented Apr 8, 2020

gengliangwang commented Apr 8, 2020

AngersZhuuuu commented Apr 8, 2020

gengliangwang commented Apr 8, 2020

sarutak commented Apr 8, 2020

AngersZhuuuu commented Apr 8, 2020

AngersZhuuuu commented Mar 28, 2020 •

edited

sarutak left a comment •

edited

gengliangwang commented Apr 8, 2020 •

edited