Add caching information to rdd.toDebugString by nkronenfeld · Pull Request #1535 · apache/spark

nkronenfeld · 2014-07-22T18:45:40Z

I find it useful to see where in an RDD's DAG data is cached, so I figured others might too.

I've added both the caching level, and the actual memory state of the RDD.

Some of this is redundant with the web UI (notably the actual memory state), but (a) that is temporary, and (b) putting it in the DAG tree shows some context that can help a lot.

For example:

(4) ShuffledRDD[3] at reduceByKey at <console>:14
 +-(4) MappedRDD[2] at map at <console>:14
    |  MapPartitionsRDD[1] at mapPartitions at <console>:12
    |  ParallelCollectionRDD[0] at parallelize at <console>:12

should change to

(4) ShuffledRDD[3] at reduceByKey at <console>:14 [Memory Deserialized 1x Replicated]
 |       CachedPartitions: 4; MemorySize: 50.8 MB; TachyonSize: 0.0 B; DiskSize: 0.0 B
 +-(4) MappedRDD[2] at map at <console>:14 [Memory Deserialized 1x Replicated]
    |  MapPartitionsRDD[1] at mapPartitions at <console>:12 [Memory Deserialized 1x Replicated]
    |      CachedPartitions: 4; MemorySize: 109.1 MB; TachyonSize: 0.0 B; DiskSize: 0.0 B
    |  ParallelCollectionRDD[0] at parallelize at <console>:12 [Memory Deserialized 1x Replicated]

pwendell · 2014-07-22T18:46:43Z

Hey, do you mind putting an example of what the output looks like in the PR description?

…es more clear Changes RDD.toDebugString() to show hierarchy and shuffle transformations more clearly New output: ``` (3) FlatMappedValuesRDD[325] at apply at Transformer.scala:22 | MappedValuesRDD[324] at apply at Transformer.scala:22 | CoGroupedRDD[323] at apply at Transformer.scala:22 +-(5) MappedRDD[320] at apply at Transformer.scala:22 | | MappedRDD[319] at apply at Transformer.scala:22 | | MappedValuesRDD[318] at apply at Transformer.scala:22 | | MapPartitionsRDD[317] at apply at Transformer.scala:22 | | ShuffledRDD[316] at apply at Transformer.scala:22 | +-(10) MappedRDD[315] at apply at Transformer.scala:22 | | ParallelCollectionRDD[314] at apply at Transformer.scala:22 +-(100) MappedRDD[322] at apply at Transformer.scala:22 | ParallelCollectionRDD[321] at apply at Transformer.scala:22 ``` Author: Gregory Owen <greowen@gmail.com> Closes #1364 from GregOwen/to-debug-string and squashes the following commits: 08f5c78 [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly 1603f7b [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly

nkronenfeld · 2014-07-22T18:52:35Z

Done, and I also left a comment on Greg Owen's PR from yesterday asking him for formatting comments

nkronenfeld · 2014-07-22T20:39:37Z

Sorry, forgot to move one small formatting issue over from the old branch, I'll check that in as soon as I test it. [DONE]

SparkQA · 2014-07-22T20:48:25Z

QA tests have started for PR 1535. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16987/consoleFull

SparkQA · 2014-07-22T20:49:13Z

QA results for PR 1535:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16987/consoleFull

pwendell · 2014-07-22T20:54:22Z

@gowen mind taking a look?

markhamstra · 2014-07-22T21:55:22Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

s"$partitionStr $desc"
s"$nextPrefix $desc"

And elsewhere in this PR, avoid string concatenation with + when string interpolation would be equally clear or clearer.

nkronenfeld · 2014-07-23T02:10:50Z

thanks mark, I had no idea that existed.

SparkQA · 2014-07-23T02:13:33Z

QA tests have started for PR 1535. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17005/consoleFull

SparkQA · 2014-07-23T02:14:21Z

QA results for PR 1535:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17005/consoleFull

SparkQA · 2014-07-23T02:48:30Z

QA tests have started for PR 1535. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17008/consoleFull

SparkQA · 2014-07-23T04:23:57Z

QA results for PR 1535:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17008/consoleFull

nkronenfeld · 2014-07-23T13:10:44Z

I'm not sure what to do about this test failure; all I've changed is toDebugString, and this is in a spark streaming test which never calls that, so I'm pretty sure it's nothing to do with me.

markhamstra · 2014-07-23T13:43:49Z

Jenkins, test this please

SparkQA · 2014-07-23T13:48:20Z

QA tests have started for PR 1535. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17035/consoleFull

SparkQA · 2014-07-23T15:29:06Z

QA results for PR 1535:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17035/consoleFull

pwendell · 2014-07-23T18:49:20Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

Hey on this one, this is actually an extremely operation... I wonder if maybe for now it's better to not put this in there and only put the storage level.

I'm not sure what you mean - do you mean "an extremely costly operation"?

Assuming that to be the case, two comments::

I though about attaching flags to the function so one could specify the type of debug information desired; I think that makes the function too complex, but I'm hardly firm in that idea.

This whole function is specifically to help a developer with debugging. I don't think having it be costly is all that bad.

Ah sorry, yeah I mean this very costly. I'd rather not do this in a debug function - because people will do things like print debug statements inside of loops. In that case the debugging will significantly alter the performance of their application. There is a separate JIRA to make this function faster (it's a function also used in the UI), but until that's fixed I'd rather not call it here:

https://issues.apache.org/jira/browse/SPARK-2316

BTW - we can create a JIRA to add this back once SPARK-2316 is fixed if you'd like.

GregOwen · 2014-07-23T19:58:34Z

Looks good to me.

…e shown or not. Default is for it not to be shown.

nkronenfeld · 2014-07-28T19:40:40Z

I just parameterized the memory so one can display it or not as desired (with not displaying it the default) - is that sufficient?

I forgot to put in the note about the JIRA into the code, I'll definitely add that too, or I can back out the optional nature and just leave in the code comment about the JIRA

Also, while I was at it, I marked this method as DeveloperAPI - it seems to me an oversight that it isn't, but if I'm wrong, or if that should be in a separate PR, let me know, it's trivial to put back, of course.

Let me know which you want, please.

Thanks,
-Nathan

SparkQA · 2014-07-28T19:43:51Z

QA tests have started for PR 1535. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17302/consoleFull

SparkQA · 2014-07-28T19:44:35Z

QA results for PR 1535:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17302/consoleFull

…sage parameter.

SparkQA · 2014-07-28T21:08:48Z

QA tests have started for PR 1535. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17304/consoleFull

SparkQA · 2014-07-28T21:54:08Z

QA results for PR 1535:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17304/consoleFull

…ack out if necessary)

SparkQA · 2014-07-28T22:24:08Z

QA tests have started for PR 1535. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17310/consoleFull

SparkQA · 2014-07-29T00:11:12Z

QA results for PR 1535:
- This patch FAILED unit tests.

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17310/consoleFull

nkronenfeld · 2014-07-29T14:57:00Z

If I'm reading that correctly, that test failure is from an MLLib change that's nothing to do with what I've done? Perhaps I'll just try it again, maybe it's a bad sync with master:

Jenkins, please test this

SparkQA · 2014-07-29T16:28:53Z

QA tests have started for PR 1535. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17363/consoleFull

SparkQA · 2014-07-29T17:16:51Z

QA results for PR 1535:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17363/consoleFull

pwendell · 2014-07-30T06:47:42Z

Hey @nkronenfeld - I traced through the exact function call more closely and I actually think it's fine. The issue I pointed out in the JIRA is orthogonal. So I'm fine to just revert this back to always showing the status. However, we should not mark this as a developer API. This is a stable API we are happy to support forever.

Still, this will cause a significant amount of object allocation due to the way other internal function calls happen (it is basically O(all blocks)) for an application. It might be nice to add a note to the docs that the operation might be expensive and should not be called inside of a critical code path. Though we could likely optimize those things down the road.

nkronenfeld · 2014-07-30T14:26:06Z

Thanks, @pwendel. I can revert it back if you want - is that preferable to the way it is now, with the option to include the memory info or not?

I'll start with taking out the DeveloperAPI and adjusting the docs; I'll leave off taking out the optional memory parameter until I hear from you again.

pwendell · 2014-07-30T18:40:55Z

yeah to keep it simple let's just always have it show memory. I'd rather not add a new public API for this showMemory thing at the moment.

…emory output

SparkQA · 2014-07-31T22:39:15Z

QA tests have started for PR 1535. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17612/consoleFull

SparkQA · 2014-07-31T23:31:25Z

QA results for PR 1535:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17612/consoleFull

nkronenfeld · 2014-08-01T00:30:35Z

OK, @pwendel, I think it's set now. Let me know if there are merge problems, I can resubmit on a clean branch if necessary.

pwendell · 2014-08-15T05:15:18Z

Hey this looks good. Merging it now into mater. Sorry about the delay.

I find it useful to see where in an RDD's DAG data is cached, so I figured others might too. I've added both the caching level, and the actual memory state of the RDD. Some of this is redundant with the web UI (notably the actual memory state), but (a) that is temporary, and (b) putting it in the DAG tree shows some context that can help a lot. For example: ``` (4) ShuffledRDD[3] at reduceByKey at <console>:14 +-(4) MappedRDD[2] at map at <console>:14 | MapPartitionsRDD[1] at mapPartitions at <console>:12 | ParallelCollectionRDD[0] at parallelize at <console>:12 ``` should change to ``` (4) ShuffledRDD[3] at reduceByKey at <console>:14 [Memory Deserialized 1x Replicated] | CachedPartitions: 4; MemorySize: 50.8 MB; TachyonSize: 0.0 B; DiskSize: 0.0 B +-(4) MappedRDD[2] at map at <console>:14 [Memory Deserialized 1x Replicated] | MapPartitionsRDD[1] at mapPartitions at <console>:12 [Memory Deserialized 1x Replicated] | CachedPartitions: 4; MemorySize: 109.1 MB; TachyonSize: 0.0 B; DiskSize: 0.0 B | ParallelCollectionRDD[0] at parallelize at <console>:12 [Memory Deserialized 1x Replicated] ``` Author: Nathan Kronenfeld <nkronenfeld@oculusinfo.com> Closes apache#1535 from nkronenfeld/feature/debug-caching2 and squashes the following commits: 40490bc [Nathan Kronenfeld] Back out DeveloperAPI and arguments to RDD.toDebugString, reinstate memory output 794e6a3 [Nathan Kronenfeld] Attempt to merge mima changes from master 6fe9e80 [Nathan Kronenfeld] Add exclusions to allow for signature change in toDebugString (will back out if necessary) 31d6769 [Nathan Kronenfeld] Attempt to get rid of style errors. Add comments for the new memory usage parameter. a0f6f76 [Nathan Kronenfeld] Add parameter to RDD.toDebugString to allow detailed memory info to be shown or not. Default is for it not to be shown. f8f565a [Nathan Kronenfeld] Fix code style error 8f54287 [Nathan Kronenfeld] Changed string addition to string interpolation as per PR comments 2a0cd4d [Nathan Kronenfeld] Fixed a small formatting issue I forgot to copy over from the old branch 8fbecb6 [Nathan Kronenfeld] Add caching information to rdd.toDebugString

Add caching information to rdd.toDebugString

8fbecb6

Fixed a small formatting issue I forgot to copy over from the old branch

2a0cd4d

markhamstra reviewed Jul 22, 2014
View reviewed changes

Changed string addition to string interpolation as per PR comments

8f54287

Fix code style error

f8f565a

pwendell reviewed Jul 23, 2014
View reviewed changes

Add parameter to RDD.toDebugString to allow detailed memory info to b…

a0f6f76

…e shown or not. Default is for it not to be shown.

Attempt to get rid of style errors. Add comments for the new memory u…

31d6769

…sage parameter.

Add exclusions to allow for signature change in toDebugString (will b…

6fe9e80

…ack out if necessary)

Attempt to merge mima changes from master

794e6a3

Back out DeveloperAPI and arguments to RDD.toDebugString, reinstate m…

40490bc

…emory output

asfgit closed this in fba8ec3 Aug 15, 2014

Conversation

nkronenfeld commented Jul 22, 2014

Uh oh!

pwendell commented Jul 22, 2014

Uh oh!

nkronenfeld commented Jul 22, 2014

Uh oh!

nkronenfeld commented Jul 22, 2014

Uh oh!

SparkQA commented Jul 22, 2014

Uh oh!

SparkQA commented Jul 22, 2014

Uh oh!

pwendell commented Jul 22, 2014

Uh oh!

markhamstra Jul 22, 2014

Choose a reason for hiding this comment

Uh oh!

markhamstra Jul 22, 2014

Choose a reason for hiding this comment

Uh oh!

nkronenfeld commented Jul 23, 2014

Uh oh!

SparkQA commented Jul 23, 2014

Uh oh!

SparkQA commented Jul 23, 2014

Uh oh!

SparkQA commented Jul 23, 2014

Uh oh!

SparkQA commented Jul 23, 2014

Uh oh!

nkronenfeld commented Jul 23, 2014

Uh oh!

markhamstra commented Jul 23, 2014

Uh oh!

SparkQA commented Jul 23, 2014

Uh oh!

SparkQA commented Jul 23, 2014

Uh oh!

pwendell Jul 23, 2014

Choose a reason for hiding this comment

Uh oh!

nkronenfeld Jul 23, 2014

Choose a reason for hiding this comment

Uh oh!

pwendell Jul 24, 2014

Choose a reason for hiding this comment

Uh oh!

pwendell Jul 24, 2014

Choose a reason for hiding this comment

Uh oh!

GregOwen commented Jul 23, 2014

Uh oh!

nkronenfeld commented Jul 28, 2014

Uh oh!

SparkQA commented Jul 28, 2014

Uh oh!

SparkQA commented Jul 28, 2014

Uh oh!

SparkQA commented Jul 28, 2014

Uh oh!

SparkQA commented Jul 28, 2014

Uh oh!

SparkQA commented Jul 28, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

nkronenfeld commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

pwendell commented Jul 30, 2014

Uh oh!

nkronenfeld commented Jul 30, 2014

Uh oh!

pwendell commented Jul 30, 2014

Uh oh!

SparkQA commented Jul 31, 2014