SPARK-2035: Store call stack for stages, display it on the UI.#981
SPARK-2035: Store call stack for stages, display it on the UI.#981darabos wants to merge 14 commits intoapache:masterfrom darabos:darabos-call-stack
Conversation
|
Can one of the admins verify this patch? |
|
This is fairly cool. However, instead of "show details", can we rename that "call stack"? Because it is not always the code line there (it is changeable by other things). |
|
Cool, +1! I wrote essentially the same thing today before noticing this PR, except that I wanted to show the RDD creation site for cached RDDs instead of stages. |
|
What is the impact on memory footprint for this change ? And if the impact is non trivial (the master is already at limits of its memory), how can the user go about reducing the memory footprint of this change ? Possibly disabling it ? Reducing amount of info stored ? Something better/different ? Since we have support for jobgroup, most of the time the actual stacktrace if fairly easy to infer : hence the query. |
|
Wow, 40k stages? 😮 To estimate the memory use, I guess a large stack trace could be ~10 kB, so it would be 400 MB total. Would that be noticeable compared to the base memory use from the 40k stages? How about I limit the stack trace to 1 kB? That would still be 20 lines, if there is an average of 50 characters per line. I expect it would be enough in most cases, and it would limit the memory use to 40 MB in your giant job. (I could make the cutoff a system property. Should I?) |
|
I dont have solutions, particularly since I have not looked much into ui code in spark. |
…l in memory-constrained situations with large numbers of stages.
|
@ankurdave: Cool! For some reason I didn't wire up RDDs, only stages. Your change should complement this nicely. @rxin: I went with "details" instead of "call stack" exactly because I can imagine situations where it is something other than a call stack. Probably that's the kind of situation where the stage name is not a code location. But I'm happy to change it if you prefer. |
|
ok to test |
|
Build triggered. |
|
Build started. |
|
Build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15514/ |
|
I forgot |
|
Build triggered. |
|
Build started. |
|
Build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15517/ |
|
I have some fixing to do. |
…esting, this parameter always originates in SparkContext.scala, and will never be null.
|
Build triggered. |
|
Build started. |
|
Build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15519/ |
|
Okay, now it is a binary incompatibility: These are all private methods/classes, but they can still affect binary compatibility. What is the recommended way for solving this? Thanks! |
|
@darabos the compatibility issues are false positives. You can add excludes for them in |
…re private methods/classes, so we ought to be safe.
|
Build triggered. |
|
Build started. |
|
Build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15761/ |
|
Build triggered. |
|
Build started. |
|
Thanks for the feedback! I've added JSON (de)serialization code for the new field. Patched in your change (thanks!). And added one more line to the top of the stack trace. |
|
Build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15762/ |
|
It's a failure in |
|
@darabos hey that's a flaky unit test we fixed a while back. The issue is that your patch doesn't merge cleanly with master, so it's not getting the "fix". Do you mind updating your patch so it merges cleanly? BTW - I'm working on something that better explains this situation in our QA harness (where a patch isn't merging), but it's not ready yet! |
Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15815/ |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
LGTM - thanks for the work on this. I'm going to go ahead and merge it. |
|
Thanks Patrick! |
I'm not sure about the test -- I get a lot of unrelated failures for some reason. I'll try to sort it out. But hopefully the automation will test this for me if I send a pull request :). I'll attach a demo HTML in [Jira](https://issues.apache.org/jira/browse/SPARK-2035). Author: Daniel Darabos <darabos.daniel@gmail.com> Author: Patrick Wendell <pwendell@gmail.com> Closes apache#981 from darabos/darabos-call-stack and squashes the following commits: f7c6bfa [Daniel Darabos] Fix bad merge. I undid 83c226d by Doris. 3d0a48d [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack b857849 [Daniel Darabos] Style: Break long line. ecb5690 [Daniel Darabos] Include the last Spark method in the full stack trace. Otherwise it is not visible if the stage name is overridden. d00a85b [Patrick Wendell] Make call sites for stages non-optional and well defined b9eba24 [Daniel Darabos] Make StageInfo.details non-optional. Add JSON serialization code for the new field. Verify JSON backward compatibility. 4312828 [Daniel Darabos] Remove Mima excludes for CallSite. They should be unnecessary now, with SPARK-2070 fixed. 0920750 [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack a4b1faf [Daniel Darabos] Add Mima exclusions for the CallSite changes it has picked up. They are private methods/classes, so we ought to be safe. 932f810 [Daniel Darabos] Use empty CallSite instead of null in DAGSchedulerSuite. Outside of testing, this parameter always originates in SparkContext.scala, and will never be null. ccd89d1 [Daniel Darabos] Fix long lines. ac173e4 [Daniel Darabos] Hide "show details" if there are no details to show. 6182da6 [Daniel Darabos] Set a configurable limit on maximum call stack depth. It can be useful in memory-constrained situations with large numbers of stages. 8fe2e34 [Daniel Darabos] Store call stack for stages, display it on the UI.
I'm not sure about the test -- I get a lot of unrelated failures for some reason. I'll try to sort it out. But hopefully the automation will test this for me if I send a pull request :). I'll attach a demo HTML in [Jira](https://issues.apache.org/jira/browse/SPARK-2035). Author: Daniel Darabos <darabos.daniel@gmail.com> Author: Patrick Wendell <pwendell@gmail.com> Closes apache#981 from darabos/darabos-call-stack and squashes the following commits: f7c6bfa [Daniel Darabos] Fix bad merge. I undid 83c226d by Doris. 3d0a48d [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack b857849 [Daniel Darabos] Style: Break long line. ecb5690 [Daniel Darabos] Include the last Spark method in the full stack trace. Otherwise it is not visible if the stage name is overridden. d00a85b [Patrick Wendell] Make call sites for stages non-optional and well defined b9eba24 [Daniel Darabos] Make StageInfo.details non-optional. Add JSON serialization code for the new field. Verify JSON backward compatibility. 4312828 [Daniel Darabos] Remove Mima excludes for CallSite. They should be unnecessary now, with SPARK-2070 fixed. 0920750 [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack a4b1faf [Daniel Darabos] Add Mima exclusions for the CallSite changes it has picked up. They are private methods/classes, so we ought to be safe. 932f810 [Daniel Darabos] Use empty CallSite instead of null in DAGSchedulerSuite. Outside of testing, this parameter always originates in SparkContext.scala, and will never be null. ccd89d1 [Daniel Darabos] Fix long lines. ac173e4 [Daniel Darabos] Hide "show details" if there are no details to show. 6182da6 [Daniel Darabos] Set a configurable limit on maximum call stack depth. It can be useful in memory-constrained situations with large numbers of stages. 8fe2e34 [Daniel Darabos] Store call stack for stages, display it on the UI.
I'm not sure about the test -- I get a lot of unrelated failures for some reason. I'll try to sort it out. But hopefully the automation will test this for me if I send a pull request :).
I'll attach a demo HTML in Jira.