-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8100][UI]Make able to refer lost executor info in Spark UI #6644
Conversation
Test build #34184 has finished for PR 6644 at commit
|
@suyanNone for changes like these could you attach a screenshot of what it looks like before and after? Also a concern I have is that for long-running applications, if we kill executors and bring them back we'll end up having many removed executors, in which case there will be a lot of state piling up on the driver. |
@@ -85,7 +88,7 @@ private[ui] class ExecutorsPage( | |||
</span> | |||
</th> | |||
{if (logsExist) <th class="sorttable_nosort">Logs</th> else Seq.empty} | |||
{if (threadDumpEnabled) <th class="sorttable_nosort">Thread Dump</th> else Seq.empty} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewor14 I not remove that actually, Because for the lost executor, it can't do dump,
So I change that column head to "Status", for executor live, will show dump, for executor lost, will show a "Removed" status.
Test build #34405 has finished for PR 6644 at commit
|
retest this please |
Test build #35381 has finished for PR 6644 at commit
|
@@ -60,7 +60,8 @@ class ExecutorSummary private[spark]( | |||
val totalShuffleRead: Long, | |||
val totalShuffleWrite: Long, | |||
val maxMemory: Long, | |||
val executorLogs: Map[String, String]) | |||
val executorLogs: Map[String, String], | |||
val isRemoved: Boolean) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tests are failing b/c this is not binary compatible, but I think this is OK. In particular, from http://spark.apache.org/docs/latest/monitoring.html
we say:
New fields may be added to existing endpoints
So we should add a mima exclusion here. The logs of the PR builder tell you exactly what to add:
ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.status.api.v1.ExecutorSummary.this")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind adding a //
comment next to isRemoved
to say something like "Added in 1.5.0" or "since 1.5.0"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will refine that, and it's nice to tell me about MimaExcludes
I was trying to think of a good way to test this, but it doesn't seem like we have any good way to kill executors with the local-cluster tests. I think probably it makes the most sense to instead take an eventLog file with a removed executor, and make sure it show up correctly in the HistoryServer (both in the UI and also via the api). I can try to put one of those files together if it would be helpful. |
retest this please. |
Test build #132 has finished for PR 6644 at commit
|
Jenkins, retest this please |
private var _blockManagerRemoved = false | ||
|
||
def isRemoved: Boolean = _blockManagerRemoved | ||
def markAsRemoved(): Unit = _blockManagerRemoved = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about tracking the time it was removed as well?
Hi @suyanNone any updates here? This is a really nice addition to help users understand when things go wrong, but needs a little followup. As Andrew requested, we'd like to see a screenshot of the modified UI. Also we should limit how many removed executors we store for really long apps -- I think you could add that in |
Test build #213 has finished for PR 6644 at commit
|
I think this makes sense. As @andrewor14 mentioned above, for long running apps where many executors have been added and removed, this could cause lot of state to accumulate on the driver. We should perhaps restrict the time limit for which old executor state is maintained and remove it when the time limit expires. |
Test build #39743 has finished for PR 6644 at commit
|
@andrewor14 @harishreedharan @squito I miss the andrewor14 comments about "long running app" again... as harishreedharan says, "time limit expires" is worth considering |
It looks like this PR and #6263 duplicate / overlap with each other. |
@suyanNone do you mind closing this PR? and collaborating on #6263 Or the other way around; up to you two about which you want to move forward with. |
While spark application is still running, and the lost executor info also disappear in Spark UI, it is not
easy to see why that executor lost or refer other sth