Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-5217 Spark UI should report pending stages during job execution on AllStagesPage. #4043

Closed

Conversation

ScrapCodes
Copy link
Member

screenshot from 2015-01-16 13 43 25

This is a first step towards having time remaining estimates for queued and running jobs. See SPARK-5216

@SparkQA
Copy link

SparkQA commented Jan 14, 2015

Test build #25544 has started for PR 4043 at commit d335f8f.

  • This patch merges cleanly.

@ScrapCodes ScrapCodes force-pushed the SPARK-5216/5217-show-waiting-stages branch from d335f8f to f6dc584 Compare January 14, 2015 11:34
@SparkQA
Copy link

SparkQA commented Jan 14, 2015

Test build #25545 has started for PR 4043 at commit f6dc584.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 14, 2015

Test build #25544 has finished for PR 4043 at commit d335f8f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25544/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 14, 2015

Test build #25545 has finished for PR 4043 at commit f6dc584.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25545/
Test PASSed.

@squito
Copy link
Contributor

squito commented Jan 14, 2015

lgtm.

I was going to suggest that pending stages should be sorted with oldest submission time first, not reversed ... but I guess we want the completed stages sorted with oldest last, and probably makes sense to keep those tables consistent with each other.

@ScrapCodes
Copy link
Member Author

Hi Imran, Thanks for taking a look. @pwendell please take a look ?

@pwendell
Copy link
Contributor

I'm not sure this can be merged as-is. The state clean-up here is based on the assumption that every stage that is pending will at some later time be submitted. Is that definitely true? What happens if a stage is aborted or failed, won't its dependent stages remain pending indefinitely for the entire history of the application? I think at a minimum you need to make sure you remove all associated stages with a given job if the job ends. There may also be other corner cases I'm not thinking of.

A second assumption this makes is that the job start event will always occur before the stage is submitted. Is that definitely true? It would be good to dig through the reporting API and make sure that is a safe assumption. I think the guarantees of the listener around event ordering are pretty minimal.

Finally, what about putting pending stages after active stages in the display page? My concern is you may have dozens or more pending stages in production jobs, I think people will be frustrated if they open the UI and they have to scroll way down every time they need to e.g. refresh the page. It's a big odd to go (active -> pending -> completed), but I think better for usability.

@ScrapCodes
Copy link
Member Author

Thanks Patrick !
Have two questions inline.

I'm not sure this can be merged as-is. The state clean-up here is based on the assumption that every stage that is pending will at some later time be submitted. Is that definitely true? What happens if a stage is aborted or failed, won't its dependent stages remain pending indefinitely for the entire history of the application? I think at a minimum you need to make sure you remove all associated stages with a given job if the job ends. There may also be other corner cases I'm not thinking of.

If I am understanding this correctly, "All stages for a particular job Id on jobEnd event - should be cleaned up". I am certainly missing something, is this not already achieved by this ?

A second assumption this makes is that the job start event will always occur before the stage is submitted. Is that definitely true? It would be good to dig through the reporting API and make sure that is a safe assumption. I think the guarantees of the listener around event ordering are pretty minimal.

This can lead to an anomaly that a stage will appear in progress in both active stages section and pending section. But it will still be obvious that - that stage is in progress. Is this safe to ignore, or is it a concern that should be addressed at the cost of minute complexity ?

Finally, what about putting pending stages after active stages in the display page? My concern is you may have dozens or more pending stages in production jobs, I think people will be frustrated if they open the UI and they have to scroll way down every time they need to e.g. refresh the page. It's a big odd to go (active -> pending -> completed), but I think better for usability.

This is thoughtful, I am going to incorporate this asap.

[EDIT]
One more thing to look at is ordering of stages since they are stored in a HashMap. Let me see if ordering can make sense somehow.

@pwendell
Copy link
Contributor

Ah I see - so on the first point, the issue may be covered by the code you referenced. For some reason the diff originally rendered in a way where I didn't notice that.

For the second issue. My concern is that this line could throw an exception:
https://github.com/apache/spark/pull/4043/files#diff-1f32bcb61f51133bd0959a4177a066a5R263

I.e. if you had a stage submitted, but you never received the job started event. It could be that you can assume it, but it would be good to just note the ordering assumptions around the listener events on that line, so that if it gets changed later and someone gets a no-such-elemnt-exception. It's possible to debug it.

@ScrapCodes
Copy link
Member Author

Good point, in map in scala remove returns an Option and does not throw exception. However in lists what you said holds. See https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/mutable/LinkedHashMap.scala#L76

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25644 has started for PR 4043 at commit c7b3332.

  • This patch merges cleanly.

@pwendell
Copy link
Contributor

@ScrapCodes Rather than change this to +LinkedHashMap+ can you just check if it contains it before removing it? It might not be obvious to developers that +remove+ has that specific behavior. I think it's better to just be explicit.

@ScrapCodes
Copy link
Member Author

LinkedHashMap was introduced to maintain the insertion order of stagesIds(Hasmap will lead to arbitrary order), all map(s) in scala has return an Option on remove behavior. Even in java calling remove on inexistent key returns null. Do you still want me to change it? Also may be changing the name pendingStages to pendingStagesMap will avoid that confusion ? and should the same apply to - already existing field activeStages ?

@pwendell
Copy link
Contributor

ah I see - if the existing remove call is safe, then I think it's fine.

val now = System.currentTimeMillis

val activeStagesTable =
new StageTableBase(activeStages.sortBy(_.submissionTime).reverse,
parent.basePath, parent.listener, isFairScheduler = parent.isFairScheduler,
killEnabled = parent.killEnabled)
val pendingStagesTable =
new StageTableBase(pendingStages.sortBy(_.submissionTime),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwendell Can I use Sorting.stableSort here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just keep the sorting the same as it was? I think the submission time is unlikely to be tied in most cases. It would be good to just make it consistent with the existing ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about usability, like the most interesting stage which is next to be executed will appear at last. Anyway will change it.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25644 has finished for PR 4043 at commit c7b3332.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25644/
Test PASSed.

@@ -37,12 +37,18 @@ private[ui] class AllStagesPage(parent: StagesTab) extends WebUIPage("") {
val numCompletedStages = listener.numCompletedStages
val failedStages = listener.failedStages.reverse.toSeq
val numFailedStages = listener.numFailedStages
val pendingStages = listener.pendingStages.values.toSeq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this up with activeStages to make the declarations grouped properly?

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25656 has started for PR 4043 at commit 15cdda4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25656 has finished for PR 4043 at commit 15cdda4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25656/
Test PASSed.

@pwendell
Copy link
Contributor

@ScrapCodes mind bringing up to date? The current form LGTM

@ScrapCodes ScrapCodes force-pushed the SPARK-5216/5217-show-waiting-stages branch from 15cdda4 to 3b11803 Compare January 19, 2015 07:06
@SparkQA
Copy link

SparkQA commented Jan 19, 2015

Test build #25748 has started for PR 4043 at commit 3b11803.

  • This patch merges cleanly.

@ScrapCodes
Copy link
Member Author

@pwendell - patch updated to latest master.

@SparkQA
Copy link

SparkQA commented Jan 19, 2015

Test build #25748 has finished for PR 4043 at commit 3b11803.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25748/
Test PASSed.

@asfgit asfgit closed this in 851b6a9 Jan 19, 2015
@ScrapCodes ScrapCodes deleted the SPARK-5216/5217-show-waiting-stages branch June 3, 2015 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants