Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8100][UI]Make able to refer lost executor info in Spark UI #6644

Closed
wants to merge 3 commits into from

Conversation

suyanNone
Copy link
Contributor

While spark application is still running, and the lost executor info also disappear in Spark UI, it is not
easy to see why that executor lost or refer other sth

@suyanNone suyanNone changed the title [SPARK-8100][UI]Make able [SPARK-8100][UI]Make able to refer lost executor info in Spark UI Jun 4, 2015
@SparkQA
Copy link

SparkQA commented Jun 4, 2015

Test build #34184 has finished for PR 6644 at commit c0d30a2.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

@suyanNone for changes like these could you attach a screenshot of what it looks like before and after? Also a concern I have is that for long-running applications, if we kill executors and bring them back we'll end up having many removed executors, in which case there will be a lot of state piling up on the driver.

@@ -85,7 +88,7 @@ private[ui] class ExecutorsPage(
</span>
</th>
{if (logsExist) <th class="sorttable_nosort">Logs</th> else Seq.empty}
{if (threadDumpEnabled) <th class="sorttable_nosort">Thread Dump</th> else Seq.empty}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewor14 I not remove that actually, Because for the lost executor, it can't do dump,
So I change that column head to "Status", for executor live, will show dump, for executor lost, will show a "Removed" status.

@SparkQA
Copy link

SparkQA commented Jun 8, 2015

Test build #34405 has finished for PR 6644 at commit 2ed0614.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@suyanNone
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 21, 2015

Test build #35381 has finished for PR 6644 at commit 2ed0614.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -60,7 +60,8 @@ class ExecutorSummary private[spark](
val totalShuffleRead: Long,
val totalShuffleWrite: Long,
val maxMemory: Long,
val executorLogs: Map[String, String])
val executorLogs: Map[String, String],
val isRemoved: Boolean)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tests are failing b/c this is not binary compatible, but I think this is OK. In particular, from http://spark.apache.org/docs/latest/monitoring.html
we say:

New fields may be added to existing endpoints

So we should add a mima exclusion here. The logs of the PR builder tell you exactly what to add:
ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.status.api.v1.ExecutorSummary.this")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind adding a // comment next to isRemoved to say something like "Added in 1.5.0" or "since 1.5.0"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will refine that, and it's nice to tell me about MimaExcludes

@squito
Copy link
Contributor

squito commented Jul 24, 2015

I was trying to think of a good way to test this, but it doesn't seem like we have any good way to kill executors with the local-cluster tests. I think probably it makes the most sense to instead take an eventLog file with a removed executor, and make sure it show up correctly in the HistoryServer (both in the UI and also via the api). I can try to put one of those files together if it would be helpful.

@suyanNone
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #132 has finished for PR 6644 at commit d94a716.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Aug 4, 2015

Jenkins, retest this please

private var _blockManagerRemoved = false

def isRemoved: Boolean = _blockManagerRemoved
def markAsRemoved(): Unit = _blockManagerRemoved = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about tracking the time it was removed as well?

@squito
Copy link
Contributor

squito commented Aug 4, 2015

Hi @suyanNone any updates here? This is a really nice addition to help users understand when things go wrong, but needs a little followup. As Andrew requested, we'd like to see a screenshot of the modified UI. Also we should limit how many removed executors we store for really long apps -- I think you could add that in StorageStatusListener. Finally we need some minimal test of the behavior, but I think that will be easy to add to HistoryServerSuite and / or UISeleniumSuite. I can help out with parts of this as well.

@SparkQA
Copy link

SparkQA commented Aug 4, 2015

Test build #213 has finished for PR 6644 at commit d94a716.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@harishreedharan
Copy link
Contributor

I think this makes sense. As @andrewor14 mentioned above, for long running apps where many executors have been added and removed, this could cause lot of state to accumulate on the driver. We should perhaps restrict the time limit for which old executor state is maintained and remove it when the time limit expires.

@SparkQA
Copy link

SparkQA commented Aug 4, 2015

Test build #39743 has finished for PR 6644 at commit d94a716.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@suyanNone
Copy link
Contributor Author

@andrewor14 @harishreedharan @squito I miss the andrewor14 comments about "long running app" again... as harishreedharan says, "time limit expires" is worth considering

@JoshRosen
Copy link
Contributor

It looks like this PR and #6263 duplicate / overlap with each other.

@srowen
Copy link
Member

srowen commented Nov 4, 2015

@suyanNone do you mind closing this PR? and collaborating on #6263 Or the other way around; up to you two about which you want to move forward with.

@suyanNone suyanNone closed this Nov 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants