Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33033][WEBUI] display time series view for task metrics in history server #29908

Conversation

zhli1142015
Copy link
Contributor

@zhli1142015 zhli1142015 commented Sep 30, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Event log contains all tasks' metrics data, which are useful for performance debugging. By now, spark UI only displays final aggregation results, much information is hidden by this way. If spark UI could provide time series data view, it would be more helpful to performance debugging problems. We would like to build application statistics page in history server based on task metrics to provide more straight forward insight for spark application.
Below are views in application statistics page:
Execution Throughput: sum of completed tasks, stages, and jobs per minute. (associated stage Ids can be viewed in tool tip message.)
cluster throughput

IO (Shuffle and Read) Data: sum of total shuffle read bytes, shuffle write bytes and read bytes per minute.
IOData

Task Duration Time: application level 50% and 90% percentile of task duration per minute.
taskduration

Task Duration Time Component %: percentage of scheduler delay, computing time, shuffle read, task deserialization, result serialization, shuffle write, get result time by task duration per minute.
taskdurationpercentile

Task JVM GC Time %: percentage of task JVM GC time by task duration per minute.
taskgcpercentile

Application statistics page is only available in spark history server. Aggregated data is generated when parsing event log file and store in KVStore. Metrics data is aggregated to one data instance per minute (based on task finish time). For example, if task a finish time is in (t1 - 1minute, t1],a's data is added to data instance t1. This follows same approach of executors metrics.
From my test there is no much increasing for kv store size and replaying time. Here is my local test result. Impact to replay time may be little different for different applications, but it should be too big.
data

Does this PR introduce any user-facing change?

User facing change compared to master: Add application statistics page under jobs tab and new page link in jobs page.
Entry point:
entrypoint

Application statistics page:
screencapture-localhost-19090-history-application-1598927142852-0002-1-jobs-statistics-2020-09-30-20_34_04

How was this patch tested?

  1. manual test
  2. add new unit test

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@tgravescs
Copy link
Contributor

thanks for the changes more ui changes to help debugging are great. I've been wanting something that does more timeline view to be able to see things like active executors as well as things like cached data throughout the application time since that goes away as well.

the proposal here is the stats are still per application, correct? Not combining applications.
Assuming its within an application..
A few high level comments just looking at screen shots. "Cluster Throughput" might be confusing if inside an application. Perhaps something more like Execution activity or execution throughput. your description says " sum of completed tasks, stages, and jobs per minute." So if tasks start and finish within a minute its still shown here? this view is over hours long I would be curious to see what this looks like for shorter applications. I would also rather see these as a twisty like the event timeline on the page and not be the first thing the user sees.

@zhli1142015
Copy link
Contributor Author

zhli1142015 commented Oct 1, 2020

@tgravescs , thanks for your comments. The views here are all application level. You are right, "cluster throughput" is confusing, I use this as we run spark per cluster per application in our cases, but it's not suitable for other users. i think "Execution Throughput" here is better.

if tasks start and finish within a minute, it's still shown here. for example task A started and finished within (t1 - 1 min, t1], it's value is recorded in data point t1.

For long application, the default time range may be hours, all the graphs support zooming in so we can select small time rang for details (but minimum interval is not changed). for shorter application, we can use smaller aggregation period (this is a config for all applications rendered in HS), i also added config for this, for example "5 seconds" as below.
5seconds

Original data is from tasks metrics in event log, views above are based on my experience about using spark, welcome to any suggestions. If any other valuable view or aggregation data / dimension in your mind, we can talk about to see if it's easy to implement.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 10, 2021
@github-actions github-actions bot closed this Jan 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants