SPARK-2035: Store call stack for stages, display it on the UI. by darabos · Pull Request #981 · apache/spark

darabos · 2014-06-05T15:10:08Z

I'm not sure about the test -- I get a lot of unrelated failures for some reason. I'll try to sort it out. But hopefully the automation will test this for me if I send a pull request :).

I'll attach a demo HTML in Jira.

AmplabJenkins · 2014-06-05T15:13:00Z

Can one of the admins verify this patch?

rxin · 2014-06-06T00:39:59Z

This is fairly cool. However, instead of "show details", can we rename that "call stack"? Because it is not always the code line there (it is changeable by other things).

ankurdave · 2014-06-06T03:53:51Z

Cool, +1! I wrote essentially the same thing today before noticing this PR, except that I wanted to show the RDD creation site for cached RDDs instead of stages.

mridulm · 2014-06-06T09:22:23Z

What is the impact on memory footprint for this change ?
We have jobs which result in 40k stages, and we have jobs which have stages with 50k tasks.
I would assume the former would be impacted : but by how much ?

And if the impact is non trivial (the master is already at limits of its memory), how can the user go about reducing the memory footprint of this change ? Possibly disabling it ? Reducing amount of info stored ? Something better/different ?

Since we have support for jobgroup, most of the time the actual stacktrace if fairly easy to infer : hence the query.

darabos · 2014-06-06T10:08:20Z

Wow, 40k stages? 😮 To estimate the memory use, I guess a large stack trace could be ~10 kB, so it would be 400 MB total. Would that be noticeable compared to the base memory use from the 40k stages?

How about I limit the stack trace to 1 kB? That would still be 20 lines, if there is an average of 50 characters per line. I expect it would be enough in most cases, and it would limit the memory use to 40 MB in your giant job.

(I could make the cutoff a system property. Should I?)

mridulm · 2014-06-06T11:47:32Z

I dont have solutions, particularly since I have not looked much into ui code in spark.
Having hit memory issues form ui recently (and since it is too important to disable it for our jobs), I wanted to find out what sort of impact changes to ui have from a memory point of view and whether useful features can be added without too high a cost

…l in memory-constrained situations with large numbers of stages.

darabos · 2014-06-06T14:31:11Z

@ankurdave: Cool! For some reason I didn't wire up RDDs, only stages. Your change should complement this nicely.

@rxin: I went with "details" instead of "call stack" exactly because I can imagine situations where it is something other than a call stack. Probably that's the kind of situation where the stage name is not a code location. But I'm happy to change it if you prefer.

ankurdave · 2014-06-06T19:54:44Z

ok to test

AmplabJenkins · 2014-06-06T19:57:50Z

Build triggered.

AmplabJenkins · 2014-06-06T20:04:08Z

Build started.

AmplabJenkins · 2014-06-06T20:05:49Z

Build finished.

AmplabJenkins · 2014-06-06T20:05:49Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15514/

darabos · 2014-06-06T20:57:16Z

I forgot scalastyle, sorry :(. Try again?

AmplabJenkins · 2014-06-06T20:57:50Z

Build triggered.

AmplabJenkins · 2014-06-06T20:58:00Z

Build started.

AmplabJenkins · 2014-06-06T22:09:29Z

Build finished.

AmplabJenkins · 2014-06-06T22:09:29Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15517/

darabos · 2014-06-06T22:29:04Z

I have some fixing to do.

…esting, this parameter always originates in SparkContext.scala, and will never be null.

AmplabJenkins · 2014-06-06T23:02:49Z

Build triggered.

AmplabJenkins · 2014-06-06T23:02:55Z

Build started.

AmplabJenkins · 2014-06-07T00:17:36Z

Build finished.

AmplabJenkins · 2014-06-07T00:17:37Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15519/

darabos · 2014-06-07T12:39:23Z

Okay, now it is a binary incompatibility:

[error]  * method getCallSite()java.lang.String in class org.apache.spark.SparkContext has now a different result type; was: java.lang.String, is now: org.apache.spark.util.CallSite
[error]  * method creationSiteInfo()org.apache.spark.util.Utils#CallSiteInfo in class org.apache.spark.rdd.RDD does not have a correspondent in new version
[error]  * class org.apache.spark.util.Utils#CallSiteInfo does not have a correspondent in new version

These are all private methods/classes, but they can still affect binary compatibility. What is the recommended way for solving this? Thanks!

pwendell · 2014-06-08T08:47:20Z

@darabos the compatibility issues are false positives. You can add excludes for them in project/MimaExcludes.scala to work around it. We've just turned on binary checking for Spark core and we are learning about some corner cases in the checking tools (https://issues.apache.org/jira/browse/SPARK-2069). For now it's fine to just add an exclude for it manually.

…re private methods/classes, so we ought to be safe.

AmplabJenkins · 2014-06-13T12:27:05Z

Build triggered.

AmplabJenkins · 2014-06-13T12:27:13Z

Build started.

AmplabJenkins · 2014-06-13T12:28:22Z

Build finished.

AmplabJenkins · 2014-06-13T12:28:22Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15761/

AmplabJenkins · 2014-06-13T12:47:05Z

Build triggered.

AmplabJenkins · 2014-06-13T12:47:13Z

Build started.

darabos · 2014-06-13T12:51:36Z

Thanks for the feedback! I've added JSON (de)serialization code for the new field. Patched in your change (thanks!). And added one more line to the top of the stack trace.

AmplabJenkins · 2014-06-13T14:10:48Z

Build finished.

AmplabJenkins · 2014-06-13T14:10:48Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15762/

darabos · 2014-06-13T14:39:54Z

It's a failure in pyspark/sql.py, but I can't reproduce it locally either in my branch or in upstream master. How did Jenkins do it?

File "pyspark/sql.py", line 120, in __main__.SQLContext.parquetFile
Failed example:
    srdd.collect() == srdd2.collect()
Expected:
    True
Got:
    False

pwendell · 2014-06-16T07:02:03Z

@darabos hey that's a flaky unit test we fixed a while back. The issue is that your patch doesn't merge cleanly with master, so it's not getting the "fix". Do you mind updating your patch so it merges cleanly? BTW - I'm working on something that better explains this situation in our QA harness (where a patch isn't merging), but it's not ready yet!

Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala

AmplabJenkins · 2014-06-16T15:24:45Z

Merged build triggered.

AmplabJenkins · 2014-06-16T15:24:51Z

Merged build started.

AmplabJenkins · 2014-06-16T16:03:28Z

Merged build finished.

AmplabJenkins · 2014-06-16T16:03:28Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15815/

AmplabJenkins · 2014-06-16T16:14:45Z

Merged build triggered.

AmplabJenkins · 2014-06-16T16:14:51Z

Merged build started.

AmplabJenkins · 2014-06-16T16:58:23Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-16T16:58:23Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15816/

pwendell · 2014-06-17T07:07:52Z

LGTM - thanks for the work on this. I'm going to go ahead and merge it.

darabos · 2014-06-17T13:09:43Z

Thanks Patrick!
@ankurdave: Do you want to add this to the storage UI? I can probably do it too if you're busy.

I'm not sure about the test -- I get a lot of unrelated failures for some reason. I'll try to sort it out. But hopefully the automation will test this for me if I send a pull request :). I'll attach a demo HTML in [Jira](https://issues.apache.org/jira/browse/SPARK-2035). Author: Daniel Darabos <darabos.daniel@gmail.com> Author: Patrick Wendell <pwendell@gmail.com> Closes apache#981 from darabos/darabos-call-stack and squashes the following commits: f7c6bfa [Daniel Darabos] Fix bad merge. I undid 83c226d by Doris. 3d0a48d [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack b857849 [Daniel Darabos] Style: Break long line. ecb5690 [Daniel Darabos] Include the last Spark method in the full stack trace. Otherwise it is not visible if the stage name is overridden. d00a85b [Patrick Wendell] Make call sites for stages non-optional and well defined b9eba24 [Daniel Darabos] Make StageInfo.details non-optional. Add JSON serialization code for the new field. Verify JSON backward compatibility. 4312828 [Daniel Darabos] Remove Mima excludes for CallSite. They should be unnecessary now, with SPARK-2070 fixed. 0920750 [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack a4b1faf [Daniel Darabos] Add Mima exclusions for the CallSite changes it has picked up. They are private methods/classes, so we ought to be safe. 932f810 [Daniel Darabos] Use empty CallSite instead of null in DAGSchedulerSuite. Outside of testing, this parameter always originates in SparkContext.scala, and will never be null. ccd89d1 [Daniel Darabos] Fix long lines. ac173e4 [Daniel Darabos] Hide "show details" if there are no details to show. 6182da6 [Daniel Darabos] Set a configurable limit on maximum call stack depth. It can be useful in memory-constrained situations with large numbers of stages. 8fe2e34 [Daniel Darabos] Store call stack for stages, display it on the UI.

Store call stack for stages, display it on the UI.

8fe2e34

darabos added 2 commits June 6, 2014 15:51

Set a configurable limit on maximum call stack depth. It can be usefu…

6182da6

…l in memory-constrained situations with large numbers of stages.

Hide "show details" if there are no details to show.

ac173e4

Fix long lines.

ccd89d1

Use empty CallSite instead of null in DAGSchedulerSuite. Outside of t…

932f810

…esting, this parameter always originates in SparkContext.scala, and will never be null.

Add Mima exclusions for the CallSite changes it has picked up. They a…

a4b1faf

…re private methods/classes, so we ought to be safe.

Style: Break long line.

b857849

Merge remote-tracking branch 'upstream/master' into darabos-call-stack

3d0a48d

Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala

Fix bad merge. I undid 83c226d by Doris.

f7c6bfa

asfgit closed this in 23a12ce Jun 17, 2014

Conversation

darabos commented Jun 5, 2014

Uh oh!

AmplabJenkins commented Jun 5, 2014

Uh oh!

rxin commented Jun 6, 2014

Uh oh!

ankurdave commented Jun 6, 2014

Uh oh!

mridulm commented Jun 6, 2014

Uh oh!

darabos commented Jun 6, 2014

Uh oh!

mridulm commented Jun 6, 2014

Uh oh!

darabos commented Jun 6, 2014

Uh oh!

ankurdave commented Jun 6, 2014

Uh oh!

AmplabJenkins commented Jun 6, 2014

Uh oh!

AmplabJenkins commented Jun 6, 2014

Uh oh!

AmplabJenkins commented Jun 6, 2014

Uh oh!

AmplabJenkins commented Jun 6, 2014

Uh oh!

darabos commented Jun 6, 2014

Uh oh!

AmplabJenkins commented Jun 6, 2014

Uh oh!

AmplabJenkins commented Jun 6, 2014

Uh oh!

AmplabJenkins commented Jun 6, 2014

Uh oh!

AmplabJenkins commented Jun 6, 2014

Uh oh!

darabos commented Jun 6, 2014

Uh oh!

AmplabJenkins commented Jun 6, 2014

Uh oh!

AmplabJenkins commented Jun 6, 2014

Uh oh!

AmplabJenkins commented Jun 7, 2014

Uh oh!

AmplabJenkins commented Jun 7, 2014

Uh oh!

darabos commented Jun 7, 2014

Uh oh!

pwendell commented Jun 8, 2014

Uh oh!

AmplabJenkins commented Jun 13, 2014

Uh oh!

AmplabJenkins commented Jun 13, 2014

Uh oh!

AmplabJenkins commented Jun 13, 2014

Uh oh!

AmplabJenkins commented Jun 13, 2014

Uh oh!

AmplabJenkins commented Jun 13, 2014

Uh oh!

AmplabJenkins commented Jun 13, 2014

Uh oh!

darabos commented Jun 13, 2014

Uh oh!

AmplabJenkins commented Jun 13, 2014

Uh oh!

AmplabJenkins commented Jun 13, 2014

Uh oh!

darabos commented Jun 13, 2014

Uh oh!

pwendell commented Jun 16, 2014

Uh oh!

AmplabJenkins commented Jun 16, 2014

Uh oh!

AmplabJenkins commented Jun 16, 2014

Uh oh!

AmplabJenkins commented Jun 16, 2014

Uh oh!

AmplabJenkins commented Jun 16, 2014