Skip to content

SPARK-2035: Store call stack for stages, display it on the UI.#981

Closed
darabos wants to merge 14 commits intoapache:masterfrom
darabos:darabos-call-stack
Closed

SPARK-2035: Store call stack for stages, display it on the UI.#981
darabos wants to merge 14 commits intoapache:masterfrom
darabos:darabos-call-stack

Conversation

@darabos
Copy link
Contributor

@darabos darabos commented Jun 5, 2014

I'm not sure about the test -- I get a lot of unrelated failures for some reason. I'll try to sort it out. But hopefully the automation will test this for me if I send a pull request :).

I'll attach a demo HTML in Jira.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rxin
Copy link
Contributor

rxin commented Jun 6, 2014

This is fairly cool. However, instead of "show details", can we rename that "call stack"? Because it is not always the code line there (it is changeable by other things).

@ankurdave
Copy link
Contributor

Cool, +1! I wrote essentially the same thing today before noticing this PR, except that I wanted to show the RDD creation site for cached RDDs instead of stages.

@mridulm
Copy link
Contributor

mridulm commented Jun 6, 2014

What is the impact on memory footprint for this change ?
We have jobs which result in 40k stages, and we have jobs which have stages with 50k tasks.
I would assume the former would be impacted : but by how much ?

And if the impact is non trivial (the master is already at limits of its memory), how can the user go about reducing the memory footprint of this change ? Possibly disabling it ? Reducing amount of info stored ? Something better/different ?

Since we have support for jobgroup, most of the time the actual stacktrace if fairly easy to infer : hence the query.

@darabos
Copy link
Contributor Author

darabos commented Jun 6, 2014

Wow, 40k stages? 😮 To estimate the memory use, I guess a large stack trace could be ~10 kB, so it would be 400 MB total. Would that be noticeable compared to the base memory use from the 40k stages?

How about I limit the stack trace to 1 kB? That would still be 20 lines, if there is an average of 50 characters per line. I expect it would be enough in most cases, and it would limit the memory use to 40 MB in your giant job.

(I could make the cutoff a system property. Should I?)

@mridulm
Copy link
Contributor

mridulm commented Jun 6, 2014

I dont have solutions, particularly since I have not looked much into ui code in spark.
Having hit memory issues form ui recently (and since it is too important to disable it for our jobs), I wanted to find out what sort of impact changes to ui have from a memory point of view and whether useful features can be added without too high a cost

@darabos
Copy link
Contributor Author

darabos commented Jun 6, 2014

@ankurdave: Cool! For some reason I didn't wire up RDDs, only stages. Your change should complement this nicely.

@rxin: I went with "details" instead of "call stack" exactly because I can imagine situations where it is something other than a call stack. Probably that's the kind of situation where the stage name is not a code location. But I'm happy to change it if you prefer.

@ankurdave
Copy link
Contributor

ok to test

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15514/

@darabos
Copy link
Contributor Author

darabos commented Jun 6, 2014

I forgot scalastyle, sorry :(. Try again?

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15517/

@darabos
Copy link
Contributor Author

darabos commented Jun 6, 2014

I have some fixing to do.

…esting, this parameter always originates in SparkContext.scala, and will never be null.
@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15519/

@darabos
Copy link
Contributor Author

darabos commented Jun 7, 2014

Okay, now it is a binary incompatibility:

[error]  * method getCallSite()java.lang.String in class org.apache.spark.SparkContext has now a different result type; was: java.lang.String, is now: org.apache.spark.util.CallSite
[error]  * method creationSiteInfo()org.apache.spark.util.Utils#CallSiteInfo in class org.apache.spark.rdd.RDD does not have a correspondent in new version
[error]  * class org.apache.spark.util.Utils#CallSiteInfo does not have a correspondent in new version

These are all private methods/classes, but they can still affect binary compatibility. What is the recommended way for solving this? Thanks!

@pwendell
Copy link
Contributor

pwendell commented Jun 8, 2014

@darabos the compatibility issues are false positives. You can add excludes for them in project/MimaExcludes.scala to work around it. We've just turned on binary checking for Spark core and we are learning about some corner cases in the checking tools (https://issues.apache.org/jira/browse/SPARK-2069). For now it's fine to just add an exclude for it manually.

…re private methods/classes, so we ought to be safe.
@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15761/

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@darabos
Copy link
Contributor Author

darabos commented Jun 13, 2014

Thanks for the feedback! I've added JSON (de)serialization code for the new field. Patched in your change (thanks!). And added one more line to the top of the stack trace.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15762/

@darabos
Copy link
Contributor Author

darabos commented Jun 13, 2014

It's a failure in pyspark/sql.py, but I can't reproduce it locally either in my branch or in upstream master. How did Jenkins do it?

File "pyspark/sql.py", line 120, in __main__.SQLContext.parquetFile
Failed example:
    srdd.collect() == srdd2.collect()
Expected:
    True
Got:
    False

@pwendell
Copy link
Contributor

@darabos hey that's a flaky unit test we fixed a while back. The issue is that your patch doesn't merge cleanly with master, so it's not getting the "fix". Do you mind updating your patch so it merges cleanly? BTW - I'm working on something that better explains this situation in our QA harness (where a patch isn't merging), but it's not ready yet!

Conflicts:
	core/src/main/scala/org/apache/spark/rdd/RDD.scala
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15815/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15816/

@pwendell
Copy link
Contributor

LGTM - thanks for the work on this. I'm going to go ahead and merge it.

@asfgit asfgit closed this in 23a12ce Jun 17, 2014
@darabos
Copy link
Contributor Author

darabos commented Jun 17, 2014

Thanks Patrick!
@ankurdave: Do you want to add this to the storage UI? I can probably do it too if you're busy.

pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
I'm not sure about the test -- I get a lot of unrelated failures for some reason. I'll try to sort it out. But hopefully the automation will test this for me if I send a pull request :).

I'll attach a demo HTML in [Jira](https://issues.apache.org/jira/browse/SPARK-2035).

Author: Daniel Darabos <darabos.daniel@gmail.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes apache#981 from darabos/darabos-call-stack and squashes the following commits:

f7c6bfa [Daniel Darabos] Fix bad merge. I undid 83c226d by Doris.
3d0a48d [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack
b857849 [Daniel Darabos] Style: Break long line.
ecb5690 [Daniel Darabos] Include the last Spark method in the full stack trace. Otherwise it is not visible if the stage name is overridden.
d00a85b [Patrick Wendell] Make call sites for stages non-optional and well defined
b9eba24 [Daniel Darabos] Make StageInfo.details non-optional. Add JSON serialization code for the new field. Verify JSON backward compatibility.
4312828 [Daniel Darabos] Remove Mima excludes for CallSite. They should be unnecessary now, with SPARK-2070 fixed.
0920750 [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack
a4b1faf [Daniel Darabos] Add Mima exclusions for the CallSite changes it has picked up. They are private methods/classes, so we ought to be safe.
932f810 [Daniel Darabos] Use empty CallSite instead of null in DAGSchedulerSuite. Outside of testing, this parameter always originates in SparkContext.scala, and will never be null.
ccd89d1 [Daniel Darabos] Fix long lines.
ac173e4 [Daniel Darabos] Hide "show details" if there are no details to show.
6182da6 [Daniel Darabos] Set a configurable limit on maximum call stack depth. It can be useful in memory-constrained situations with large numbers of stages.
8fe2e34 [Daniel Darabos] Store call stack for stages, display it on the UI.
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
I'm not sure about the test -- I get a lot of unrelated failures for some reason. I'll try to sort it out. But hopefully the automation will test this for me if I send a pull request :).

I'll attach a demo HTML in [Jira](https://issues.apache.org/jira/browse/SPARK-2035).

Author: Daniel Darabos <darabos.daniel@gmail.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes apache#981 from darabos/darabos-call-stack and squashes the following commits:

f7c6bfa [Daniel Darabos] Fix bad merge. I undid 83c226d by Doris.
3d0a48d [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack
b857849 [Daniel Darabos] Style: Break long line.
ecb5690 [Daniel Darabos] Include the last Spark method in the full stack trace. Otherwise it is not visible if the stage name is overridden.
d00a85b [Patrick Wendell] Make call sites for stages non-optional and well defined
b9eba24 [Daniel Darabos] Make StageInfo.details non-optional. Add JSON serialization code for the new field. Verify JSON backward compatibility.
4312828 [Daniel Darabos] Remove Mima excludes for CallSite. They should be unnecessary now, with SPARK-2070 fixed.
0920750 [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack
a4b1faf [Daniel Darabos] Add Mima exclusions for the CallSite changes it has picked up. They are private methods/classes, so we ought to be safe.
932f810 [Daniel Darabos] Use empty CallSite instead of null in DAGSchedulerSuite. Outside of testing, this parameter always originates in SparkContext.scala, and will never be null.
ccd89d1 [Daniel Darabos] Fix long lines.
ac173e4 [Daniel Darabos] Hide "show details" if there are no details to show.
6182da6 [Daniel Darabos] Set a configurable limit on maximum call stack depth. It can be useful in memory-constrained situations with large numbers of stages.
8fe2e34 [Daniel Darabos] Store call stack for stages, display it on the UI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants