[SPARK-33033][WEBUI] display time series view for task metrics in history server #29908

zhli1142015 · 2020-09-30T08:30:00Z

What changes were proposed in this pull request?

Why are the changes needed?

Event log contains all tasks' metrics data, which are useful for performance debugging. By now, spark UI only displays final aggregation results, much information is hidden by this way. If spark UI could provide time series data view, it would be more helpful to performance debugging problems. We would like to build application statistics page in history server based on task metrics to provide more straight forward insight for spark application.
Below are views in application statistics page:
Execution Throughput: sum of completed tasks, stages, and jobs per minute. (associated stage Ids can be viewed in tool tip message.)

IO (Shuffle and Read) Data: sum of total shuffle read bytes, shuffle write bytes and read bytes per minute.

Task Duration Time: application level 50% and 90% percentile of task duration per minute.

Task Duration Time Component %: percentage of scheduler delay, computing time, shuffle read, task deserialization, result serialization, shuffle write, get result time by task duration per minute.

Task JVM GC Time %: percentage of task JVM GC time by task duration per minute.

Application statistics page is only available in spark history server. Aggregated data is generated when parsing event log file and store in KVStore. Metrics data is aggregated to one data instance per minute (based on task finish time). For example, if task a finish time is in (t1 - 1minute, t1],a's data is added to data instance t1. This follows same approach of executors metrics.
From my test there is no much increasing for kv store size and replaying time. Here is my local test result. Impact to replay time may be little different for different applications, but it should be too big.

Does this PR introduce any user-facing change?

User facing change compared to master: Add application statistics page under jobs tab and new page link in jobs page.
Entry point:

Application statistics page:

How was this patch tested?

manual test
add new unit test

AmplabJenkins · 2020-09-30T08:33:38Z

Can one of the admins verify this patch?

tgravescs · 2020-09-30T15:35:12Z

thanks for the changes more ui changes to help debugging are great. I've been wanting something that does more timeline view to be able to see things like active executors as well as things like cached data throughout the application time since that goes away as well.

the proposal here is the stats are still per application, correct? Not combining applications.
Assuming its within an application..
A few high level comments just looking at screen shots. "Cluster Throughput" might be confusing if inside an application. Perhaps something more like Execution activity or execution throughput. your description says " sum of completed tasks, stages, and jobs per minute." So if tasks start and finish within a minute its still shown here? this view is over hours long I would be curious to see what this looks like for shorter applications. I would also rather see these as a twisty like the event timeline on the page and not be the first thing the user sees.

zhli1142015 · 2020-10-01T03:21:44Z

@tgravescs , thanks for your comments. The views here are all application level. You are right, "cluster throughput" is confusing, I use this as we run spark per cluster per application in our cases, but it's not suitable for other users. i think "Execution Throughput" here is better.

if tasks start and finish within a minute, it's still shown here. for example task A started and finished within (t1 - 1 min, t1], it's value is recorded in data point t1.

For long application, the default time range may be hours, all the graphs support zooming in so we can select small time rang for details (but minimum interval is not changed). for shorter application, we can use smaller aggregation period (this is a config for all applications rendered in HS), i also added config for this, for example "5 seconds" as below.

Original data is from tasks metrics in event log, views above are based on my experience about using spark, welcome to any suggestions. If any other valuable view or aggregation data / dimension in your mind, we can talk about to see if it's easy to implement.

github-actions · 2021-01-10T01:14:13Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

display time series view for task metrics in history server

ed6c532

probot-autolabeler bot added CORE WEB UI labels Sep 30, 2020

change name og throughput view

158e6b2

github-actions bot added the Stale label Jan 10, 2021

github-actions bot closed this Jan 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33033][WEBUI] display time series view for task metrics in history server #29908

[SPARK-33033][WEBUI] display time series view for task metrics in history server #29908

zhli1142015 commented Sep 30, 2020 •

edited

AmplabJenkins commented Sep 30, 2020

tgravescs commented Sep 30, 2020

zhli1142015 commented Oct 1, 2020 •

edited

github-actions bot commented Jan 10, 2021

[SPARK-33033][WEBUI] display time series view for task metrics in history server #29908

[SPARK-33033][WEBUI] display time series view for task metrics in history server #29908

Conversation

zhli1142015 commented Sep 30, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented Sep 30, 2020

tgravescs commented Sep 30, 2020

zhli1142015 commented Oct 1, 2020 • edited

github-actions bot commented Jan 10, 2021

zhli1142015 commented Sep 30, 2020 •

edited

zhli1142015 commented Oct 1, 2020 •

edited