Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31270][CORE] Expose executor metrics in task detail table in StageLevel #28061

Closed
wants to merge 5 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Mar 28, 2020

What changes were proposed in this pull request?

Expose executor metrics in task detail table in StageLevel
image

image

Why are the changes needed?

help developer to check app running status.

Does this PR introduce any user-facing change?

Yea, User can see executor's peak memory usage witch this task running on by clock Additional Metrics 's checkbox of Peak JVM On/Off Heap Memory

How was this patch tested?

Run will spark-sql

./bin/spark-sql --conf spark.eventLog.enabled=true --conf spark.eventLog.logStageExecutorMetrics.enabled=true --conf spark.eventLog.dir=/Users/angerszhu/tmp/spark-events/ --conf spark.executor.metrics.pollingInterval=100

Run sql :
spark-sql> select count(1) from test_dy_data group by id;

@dongjoon-hyun
Copy link
Member

ok to test

@@ -572,7 +600,7 @@ $(document).ready(function () {
var row1 = createRowMetadataForColumn(
columnKey, taskMetricsResponse[columnKey], 4);
var row2 = createRowMetadataForColumn(
"shuffleWriteTime", taskMetricsResponse[columnKey], 21);
"shuffleWriteTime", taskMetricsResponse[columnKey], 23);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @AngersZhuuuu . If possible, could you try to avoid this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, could you try to a

Since add new column in Additional Metric, the column index is changed. So need to do this

To be honest , I spend a lot of time to make sure the column change when add new column, since the spark web ui code changed a lot, since origin pr in stagepage.js have not do things like add column.

If it's ok, this pr can be a demo for new develope of spark web UI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the exact reason why I asked that. I've seen several UI patches like this. :)

Since add new column in Additional Metric, the column index is changed. So need to do this

There exists lots of similar commits already for new develop of Spark web UI. Always, the commits(already reviewed and merged) are the best demo for developers.

If it's ok, this pr can be a demo for new develope of spark web UI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There exists lots of similar commits already for new develop of Spark web UI. Always, the commits(already reviewed and merged) are the best demo for developers.

Each page's JS script isn't the same, and I have find that for stage page, don't have merged pr about add new additional column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AngersZhuuuu I found following issues.

  1. When a job performs shuffle, StagePage for shuffle write stage has a problem.
    Even if you visit StagePage where the corresponding stage performs shuffle write and check Shuffle Write Time,
    the metrics is not shown.

スクリーンショット 2020-04-08 16 14 30

  1. Web UI will be hung-up
    When you visit a StagePage where the corresponding stage performs shuffle read and reload, UI will be hung-up.

hungup

Those issues are caused because column indices are not changed properly.
To avoid such issues, please keep it simple. I believe column indices need not to be changed.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe,
peakJvmHeapMemory -> peakJvmOnHeapMemory?
Peak JVM Heap Memory -> Peak JVM On Heap Memory?

@dongjoon-hyun
Copy link
Member

cc @sarutak

@SparkQA
Copy link

SparkQA commented Mar 29, 2020

Test build #120542 has finished for PR 28061 at commit 7807901.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

Maybe,
peakJvmHeapMemory -> peakJvmOnHeapMemory?
Peak JVM Heap Memory -> Peak JVM On Heap Memory?

Sure, make it more clear

@sarutak
Copy link
Member

sarutak commented Mar 29, 2020

I'll review within a week.

@SparkQA
Copy link

SparkQA commented Mar 29, 2020

Test build #120555 has finished for PR 28061 at commit 48a355d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 29, 2020

Test build #120556 has finished for PR 28061 at commit bbea4f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you, @sarutak !

Copy link
Member

@sarutak sarutak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inspecting the change now but I have one fundamental question.
Peak JVM On/Off Heap Memory are Executor metrics and they don't indicate the peak values for each task.
So I wonder it's not helpful that those metrics are displayed for each task.

To explain concretely, I tested this change with following code in spark-shell.

$ bin/spark-shell --master local[1] --conf spark.executor.heartbeatInterval=1s
sc.parallelize(List(1, 2, 3), 3).map { e =>
  val context = org.apache.spark.TaskContext.get
  if (context.partitionId == 0) {
    val dummyData = List.tabulate(Int.MaxValue / 250){i => "Index " + i}
  }
  Thread.sleep(3000)
  e
}.collect

With this code, all the tasks run on the same Executor and task0 is memory consuming while task1 and task2 are not.
Although task1 and task2 should not consume much memory than task0, those Peak JVM On/Off Heap Memory indicate greater values than those of task0.
This is because those metrics are accumulated within the same JVM.

peakmemory

executorMetrics: Option[ExecutorMetrics] = None): v1.TaskMetrics = {
executorMetrics.foreach {
this.executorMetrics.compareAndUpdatePeakValues
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about executorMetrics.foreach(this.executorMetrics.compareAndUpdatePeakValues)?

@sarutak
Copy link
Member

sarutak commented Apr 5, 2020

Additional comments:

  • Could you explain more about manual test in the description so that we can understand what and how you tested?
  • With this change, appearance of UI changes so this change should be user facing.
  • Please update screen shots when you update the code.

@AngersZhuuuu
Copy link
Contributor Author

So I wonder it's not helpful that those metrics are displayed for each task.

Such as Stage Executor Metrics, it show each executors's peak memory during this stage, but there is still a problem that during this stage, other stage's task will run in same executor too when executor's core num is bigger than 1.

We just show each task's running status about JVM, as an SRE, we need to know task running status as many as possible. Any one of these metrics can help the user find problems and provide help.

@sarutak
Copy link
Member

sarutak commented Apr 8, 2020

O.K, I understand the motivation.
I confirmed that those peak metrics are calculated for each task so the fundamental idea might be fine even though over-wrapping or consecutive tasks are affected each other like GC Time.
What do you think, @dongjoon-hyun , @gengliangwang ?

@@ -327,6 +336,13 @@ $(document).ready(function () {
"should be approximately the sum of the peak sizes across all such data structures created " +
"in this task. For SQL jobs, this only tracks all unsafe operators, broadcast joins, and " +
"external sort.");
$('#peak_jvm_on_heap_memory').attr("data-toggle", "tooltip")
.attr("data-placement", "top")
.attr("title", "Peak Executor JVM on-heap memory usage during this task running");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tasks might be proper rather than this task.

.attr("title", "Peak Executor JVM on-heap memory usage during this task running");
$('#peak_jvm_off_heap_memory').attr("data-toggle", "tooltip")
.attr("data-placement", "top")
.attr("title", "Peak Executor JVM off-heap memory usage during this task running");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as here.

@gengliangwang
Copy link
Member

gengliangwang commented Apr 8, 2020

@AngersZhuuuu Sorry to reply late.
Actually, I am slightly -1 on the changes since it is not straightforward enough. Even @sarutak has to think about the meaning of the metrics. The existing task metrics are the sum of values from executors, while the new one is the max of values from executors.

How about just displaying the max on/off heap memory in the executor page?

@AngersZhuuuu
Copy link
Contributor Author

@AngersZhuuuu Sorry to reply late.
Actually, I am slightly -1 on the changes since it is not straightforward enough. Even @sarutak has to think about the meaning of the metrics.

How about just displaying the max on/off heap memory in the executor page?

how about this ? #28036
#28036 (comment)

@gengliangwang
Copy link
Member

how about this ? #28036
#28036 (comment)

That's way better!

@AngersZhuuuu
Copy link
Contributor Author

That's way better!

ok, I will focus on that pr and follow your advise

@gengliangwang
Copy link
Member

@sarutak I prefer the approach in #28036 .
What do you think? If so, shall we close this one?

@sarutak
Copy link
Member

sarutak commented Apr 8, 2020

I also prefer that.

@AngersZhuuuu
Copy link
Contributor Author

cc @sarutak @gengliangwang Close it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants