Add caching information to rdd.toDebugString#1535
Add caching information to rdd.toDebugString#1535nkronenfeld wants to merge 9 commits intoapache:masterfrom nkronenfeld:feature/debug-caching2
Conversation
|
Hey, do you mind putting an example of what the output looks like in the PR description? |
…es more clear
Changes RDD.toDebugString() to show hierarchy and shuffle transformations more clearly
New output:
```
(3) FlatMappedValuesRDD[325] at apply at Transformer.scala:22
| MappedValuesRDD[324] at apply at Transformer.scala:22
| CoGroupedRDD[323] at apply at Transformer.scala:22
+-(5) MappedRDD[320] at apply at Transformer.scala:22
| | MappedRDD[319] at apply at Transformer.scala:22
| | MappedValuesRDD[318] at apply at Transformer.scala:22
| | MapPartitionsRDD[317] at apply at Transformer.scala:22
| | ShuffledRDD[316] at apply at Transformer.scala:22
| +-(10) MappedRDD[315] at apply at Transformer.scala:22
| | ParallelCollectionRDD[314] at apply at Transformer.scala:22
+-(100) MappedRDD[322] at apply at Transformer.scala:22
| ParallelCollectionRDD[321] at apply at Transformer.scala:22
```
Author: Gregory Owen <greowen@gmail.com>
Closes #1364 from GregOwen/to-debug-string and squashes the following commits:
08f5c78 [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly
1603f7b [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly
|
Done, and I also left a comment on Greg Owen's PR from yesterday asking him for formatting comments |
|
Sorry, forgot to move one small formatting issue over from the old branch, I'll check that in as soon as I test it. [DONE] |
|
QA tests have started for PR 1535. This patch merges cleanly. |
|
QA results for PR 1535: |
|
@gowen mind taking a look? |
There was a problem hiding this comment.
s"$partitionStr $desc"
s"$nextPrefix $desc"
There was a problem hiding this comment.
And elsewhere in this PR, avoid string concatenation with + when string interpolation would be equally clear or clearer.
|
thanks mark, I had no idea that existed. |
|
QA tests have started for PR 1535. This patch merges cleanly. |
|
QA results for PR 1535: |
|
QA tests have started for PR 1535. This patch merges cleanly. |
|
QA results for PR 1535: |
|
I'm not sure what to do about this test failure; all I've changed is toDebugString, and this is in a spark streaming test which never calls that, so I'm pretty sure it's nothing to do with me. |
|
Jenkins, test this please |
|
QA tests have started for PR 1535. This patch merges cleanly. |
|
QA results for PR 1535: |
There was a problem hiding this comment.
Hey on this one, this is actually an extremely operation... I wonder if maybe for now it's better to not put this in there and only put the storage level.
There was a problem hiding this comment.
I'm not sure what you mean - do you mean "an extremely costly operation"?
Assuming that to be the case, two comments::
- I though about attaching flags to the function so one could specify the type of debug information desired; I think that makes the function too complex, but I'm hardly firm in that idea.
- This whole function is specifically to help a developer with debugging. I don't think having it be costly is all that bad.
There was a problem hiding this comment.
Ah sorry, yeah I mean this very costly. I'd rather not do this in a debug function - because people will do things like print debug statements inside of loops. In that case the debugging will significantly alter the performance of their application. There is a separate JIRA to make this function faster (it's a function also used in the UI), but until that's fixed I'd rather not call it here:
There was a problem hiding this comment.
BTW - we can create a JIRA to add this back once SPARK-2316 is fixed if you'd like.
|
Looks good to me. |
…e shown or not. Default is for it not to be shown.
|
I just parameterized the memory so one can display it or not as desired (with not displaying it the default) - is that sufficient? I forgot to put in the note about the JIRA into the code, I'll definitely add that too, or I can back out the optional nature and just leave in the code comment about the JIRA Also, while I was at it, I marked this method as DeveloperAPI - it seems to me an oversight that it isn't, but if I'm wrong, or if that should be in a separate PR, let me know, it's trivial to put back, of course. Let me know which you want, please. Thanks, |
|
QA tests have started for PR 1535. This patch merges cleanly. |
|
QA results for PR 1535: |
|
QA tests have started for PR 1535. This patch merges cleanly. |
|
QA results for PR 1535: |
…ack out if necessary)
|
QA tests have started for PR 1535. This patch DID NOT merge cleanly! |
|
QA results for PR 1535: |
|
If I'm reading that correctly, that test failure is from an MLLib change that's nothing to do with what I've done? Perhaps I'll just try it again, maybe it's a bad sync with master: Jenkins, please test this |
|
QA tests have started for PR 1535. This patch merges cleanly. |
|
QA results for PR 1535: |
|
Hey @nkronenfeld - I traced through the exact function call more closely and I actually think it's fine. The issue I pointed out in the JIRA is orthogonal. So I'm fine to just revert this back to always showing the status. However, we should not mark this as a developer API. This is a stable API we are happy to support forever. Still, this will cause a significant amount of object allocation due to the way other internal function calls happen (it is basically O(all blocks)) for an application. It might be nice to add a note to the docs that the operation might be expensive and should not be called inside of a critical code path. Though we could likely optimize those things down the road. |
|
Thanks, @pwendel. I can revert it back if you want - is that preferable to the way it is now, with the option to include the memory info or not? I'll start with taking out the DeveloperAPI and adjusting the docs; I'll leave off taking out the optional memory parameter until I hear from you again. |
|
yeah to keep it simple let's just always have it show memory. I'd rather not add a new public API for this |
|
QA tests have started for PR 1535. This patch merges cleanly. |
|
QA results for PR 1535: |
|
OK, @pwendel, I think it's set now. Let me know if there are merge problems, I can resubmit on a clean branch if necessary. |
|
Hey this looks good. Merging it now into mater. Sorry about the delay. |
I find it useful to see where in an RDD's DAG data is cached, so I figured others might too.
I've added both the caching level, and the actual memory state of the RDD.
Some of this is redundant with the web UI (notably the actual memory state), but (a) that is temporary, and (b) putting it in the DAG tree shows some context that can help a lot.
For example:
```
(4) ShuffledRDD[3] at reduceByKey at <console>:14
+-(4) MappedRDD[2] at map at <console>:14
| MapPartitionsRDD[1] at mapPartitions at <console>:12
| ParallelCollectionRDD[0] at parallelize at <console>:12
```
should change to
```
(4) ShuffledRDD[3] at reduceByKey at <console>:14 [Memory Deserialized 1x Replicated]
| CachedPartitions: 4; MemorySize: 50.8 MB; TachyonSize: 0.0 B; DiskSize: 0.0 B
+-(4) MappedRDD[2] at map at <console>:14 [Memory Deserialized 1x Replicated]
| MapPartitionsRDD[1] at mapPartitions at <console>:12 [Memory Deserialized 1x Replicated]
| CachedPartitions: 4; MemorySize: 109.1 MB; TachyonSize: 0.0 B; DiskSize: 0.0 B
| ParallelCollectionRDD[0] at parallelize at <console>:12 [Memory Deserialized 1x Replicated]
```
Author: Nathan Kronenfeld <nkronenfeld@oculusinfo.com>
Closes apache#1535 from nkronenfeld/feature/debug-caching2 and squashes the following commits:
40490bc [Nathan Kronenfeld] Back out DeveloperAPI and arguments to RDD.toDebugString, reinstate memory output
794e6a3 [Nathan Kronenfeld] Attempt to merge mima changes from master
6fe9e80 [Nathan Kronenfeld] Add exclusions to allow for signature change in toDebugString (will back out if necessary)
31d6769 [Nathan Kronenfeld] Attempt to get rid of style errors. Add comments for the new memory usage parameter.
a0f6f76 [Nathan Kronenfeld] Add parameter to RDD.toDebugString to allow detailed memory info to be shown or not. Default is for it not to be shown.
f8f565a [Nathan Kronenfeld] Fix code style error
8f54287 [Nathan Kronenfeld] Changed string addition to string interpolation as per PR comments
2a0cd4d [Nathan Kronenfeld] Fixed a small formatting issue I forgot to copy over from the old branch
8fbecb6 [Nathan Kronenfeld] Add caching information to rdd.toDebugString
I find it useful to see where in an RDD's DAG data is cached, so I figured others might too.
I've added both the caching level, and the actual memory state of the RDD.
Some of this is redundant with the web UI (notably the actual memory state), but (a) that is temporary, and (b) putting it in the DAG tree shows some context that can help a lot.
For example:
should change to