[SPARK-3694] RDD and Task serialization debugging output #3518

ilganeli · 2014-11-30T18:36:20Z

Hi all - in addition to what was explicitly requested in the original JIRA, I also added the ability to have a trace of the serialization for RDDs so that you can see which specific dependency is unserializable. For debugging task serialization, I added a debug log output that shows the file and jar dependencies. However, I am unsure whether I can add more functionality there. For the RDD, it is possible to attempt to serialize each dependency in turn, which is why I can identify which component fails. For task debugging, I did not see a straightforward way to do the same thing. If anyone can suggest an approach here, I would be happily to implement it.

…erialization.

AmplabJenkins · 2014-11-30T18:37:10Z

Can one of the admins verify this patch?

JoshRosen · 2014-11-30T18:47:20Z

I'm going to let Jenkins test this, but my hunch is that the first run is going to fail due to Scalastyle warnings / errors. I'll comment on a couple of these style points inline.

Jenkins, this is ok to test.

JoshRosen · 2014-11-30T18:48:04Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+   * @return - An output string qualifying success or failure.
+   */
+  private def isSerializable(rdd: RDD[_]): String = {
+    SerializationHelper.isSerializable(closureSerializer,rdd)


Put spaces between arguments: closureSerializer, rdd (see https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide)

SparkQA · 2014-12-18T21:35:35Z

Test build #24598 has finished for PR 3518 at commit 8e5f710.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class RDDTrace (rdd : RDD[_], depth : Int, result : SerializationHelper.SerializedRef)

AmplabJenkins · 2014-12-18T21:35:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24598/
Test PASSed.

ilganeli · 2014-12-29T22:50:07Z

Hi @JoshRosen - just checking in to make sure things are moving on #3638 since it's a blocker to this patch. Please let me know how that's going, looks to be almost complete. Thanks!

SparkQA · 2015-01-12T15:57:42Z

Test build #25420 has started for PR 3518 at commit a32f0ac.

This patch merges cleanly.

ilganeli · 2015-01-12T15:58:24Z

Hi @JoshRosen, #3638 has now been merged and I've resolved the minor merge conflicts and pushed the updates. If you could please review this at your convenience, I'd love to have it merged in as well. Thanks!

SparkQA · 2015-01-12T17:05:28Z

Test build #25420 has finished for PR 3518 at commit a32f0ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class RDDTrace (rdd : RDD[_], depth : Int, result : SerializationHelper.SerializedRef)

AmplabJenkins · 2015-01-12T17:05:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25420/
Test PASSed.

pwendell · 2015-01-15T22:31:42Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -827,9 +868,21 @@ class DAGScheduler(
    // might modify state of objects referenced in their closures. This is necessary in Hadoop
    // where the JobConf/Configuration object is not thread-safe.
    var taskBinary: Broadcast[Array[Byte]] = null
+
+    // Check if RDD serialization debugging is enabled
+    val debugSerialization: Boolean = sc.getConf.getBoolean("spark.serializer.debug", false)


Rather than having a config option for this, why not just always run this debugging output after we've seen a serialization failure? The performance overhead won't matter much if we do it after a failure only.

Ah I see - this does that already. Yeah so I'd just remove the config option and just always print debugging output if there is a failure when serializing in the TaskSetManager. We usually try not to add config options unless there is a really compelling reason to not have the feature enabled.

pwendell · 2015-01-15T23:01:43Z

Hey just took a quick pass with some code style suggestions (more coming) and usability suggestions. One thing, would it be possible to track the name of the fields you are traversing? This would make the debugging output more useful. Also, is there a good reason to print the hash code? How would users use that?

ilganeli · 2015-01-15T23:07:20Z

Hi Patrick - thanks for the feedback. I would love to print out the names of the fields but I wasn't able to figure out a way to do that - could you suggest how?

I wasn't sure if printing the hash code was useful or not, Josh included it in his original example of a traversal so I figured I'd leave it in. I didn't know if there would be a way to look it up post-facto.

SparkQA · 2015-01-16T00:22:40Z

Test build #25620 has started for PR 3518 at commit 1d2d563.

This patch merges cleanly.

SparkQA · 2015-01-16T00:23:38Z

Test build #25620 has finished for PR 3518 at commit 1d2d563.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-16T00:23:39Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25620/
Test FAILed.

SparkQA · 2015-01-16T01:07:42Z

Test build #25624 has started for PR 3518 at commit 5b93dc1.

This patch merges cleanly.

SparkQA · 2015-01-16T02:11:34Z

Test build #25624 has finished for PR 3518 at commit 5b93dc1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-16T02:11:38Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25624/
Test PASSed.

pwendell · 2015-01-18T02:43:57Z

Hey @ilganeli - I took a slightly deeper look this time. I still don't totally follow how this all hooks together, but I wonder if it's possible to write a single utility function that is much simpler. It would just do the following:

/**
 * Given an object reference, recursively traverses all fields of the reference,
 * fields of objects within those fields, and so on. If any of those references
 * are neither Serializable nor Externalizable, prints the path from the root object
 * to the reference.
 */
def findNonSerailizableReferences(root: AnyRef): String {

}

And it would do something like:

Start with the root reference.
Given a reference, check if all of the fields are themselves serializable (meaning for their class c, Serializable.class.isAssignableFrom(c) or Externalizable.class.isAssignableFrom(c)).
Traverse the graph of all referred-to objects, maintaining path information. Path information means both the sequence of parent pointers and the field name.

This wouldn't work for custom serializers, it would only work for the Java serializer. However, that's all we support for closure's anyways. You can get the name and type of each field using reflection. There doesn't need to be any specific handling of the fact that the objects you are inspecting are RDD's. The ideal output is something like this - it would print the path along with field name, type, and maybe the toString of each object encountered.

Found non-serializable class com.user.Foo referenced from root object:

root: com.user.SomeUserRDD
| deps: Seq 
 | XXX (not sure what the internal pointers will look like here in a Seq)
  | el0: com.user.SomeOtherRDD
    | foo: com.user.Foo (NOT SERIALIZABLE)

In this example there is an rdd in the lineage that has a filed called foo that is of a non serializable type.

pwendell · 2015-01-18T07:38:13Z

Hey so it looks like while I was reviewing this patch @rxin actually ran into this and just wrote a fix himself (#4093). That fix is actually even simpler than what I was proposing and almost strictly better (the only downside is you don't get field names). So I'm guessing we'll go with that one, but thanks for taking a whack at this.

pwendell · 2015-01-18T07:40:45Z

BTW - my apologies for marking this as a starter task, it turned out to be more complicated. We can credit you for having worked on the feature as well.

ilganeli · 2015-01-24T03:54:37Z

Hey @pwendell - not a problem. The solutions are similar but Reynold's has fewer moving parts. I appreciate the recognition. Thanks!

Ilya Ganelin added 11 commits October 30, 2014 11:02

Created class to traverse dependency graph of RDD

6c99762

Started walker code

47ccc22

RDD WAlker updates

a8d5332

Added debug output to task serialization. Added debug output to RDD s…

a63652f

…erialization.

Rebase

05f2cc0

Style errors

cbb1d77

Merge remote-tracking branch 'upstream/master'

1831000

Manual merge of updates

916a31c

Added helper files

bfb723d

Fixed whitespace errors

e0a8153

Updated documentation to add debug parameter for rdd serialization

cb6ebb1

JoshRosen reviewed Nov 30, 2014
View reviewed changes

Fixed merge issues from SPARK-4737

a32f0ac

pwendell reviewed Jan 15, 2015
View reviewed changes

Updated to use scala queues insteadof google queues

1d2d563

Fixed style issue

5b93dc1

rxin mentioned this pull request Jan 19, 2015

[SPARK-5307] SerializationDebugger #4098

Closed

ilganeli closed this Jan 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3694] RDD and Task serialization debugging output #3518

[SPARK-3694] RDD and Task serialization debugging output #3518

ilganeli commented Nov 30, 2014

AmplabJenkins commented Nov 30, 2014

JoshRosen commented Nov 30, 2014

JoshRosen Nov 30, 2014

SparkQA commented Dec 18, 2014

AmplabJenkins commented Dec 18, 2014

ilganeli commented Dec 29, 2014

SparkQA commented Jan 12, 2015

ilganeli commented Jan 12, 2015

SparkQA commented Jan 12, 2015

AmplabJenkins commented Jan 12, 2015

pwendell Jan 15, 2015

pwendell Jan 15, 2015

pwendell commented Jan 15, 2015

ilganeli commented Jan 15, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

pwendell commented Jan 18, 2015

pwendell commented Jan 18, 2015

pwendell commented Jan 18, 2015

ilganeli commented Jan 24, 2015

[SPARK-3694] RDD and Task serialization debugging output #3518

[SPARK-3694] RDD and Task serialization debugging output #3518

Conversation

ilganeli commented Nov 30, 2014

AmplabJenkins commented Nov 30, 2014

JoshRosen commented Nov 30, 2014

JoshRosen Nov 30, 2014

Choose a reason for hiding this comment

SparkQA commented Dec 18, 2014

AmplabJenkins commented Dec 18, 2014

ilganeli commented Dec 29, 2014

SparkQA commented Jan 12, 2015

ilganeli commented Jan 12, 2015

SparkQA commented Jan 12, 2015

AmplabJenkins commented Jan 12, 2015

pwendell Jan 15, 2015

Choose a reason for hiding this comment

pwendell Jan 15, 2015

Choose a reason for hiding this comment

pwendell commented Jan 15, 2015

ilganeli commented Jan 15, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

pwendell commented Jan 18, 2015

pwendell commented Jan 18, 2015

pwendell commented Jan 18, 2015

ilganeli commented Jan 24, 2015