Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3694] RDD and Task serialization debugging output #3518

Closed
wants to merge 25 commits into from

Conversation

ilganeli
Copy link

Hi all - in addition to what was explicitly requested in the original JIRA, I also added the ability to have a trace of the serialization for RDDs so that you can see which specific dependency is unserializable. For debugging task serialization, I added a debug log output that shows the file and jar dependencies. However, I am unsure whether I can add more functionality there. For the RDD, it is possible to attempt to serialize each dependency in turn, which is why I can identify which component fails. For task debugging, I did not see a straightforward way to do the same thing. If anyone can suggest an approach here, I would be happily to implement it.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@JoshRosen
Copy link
Contributor

I'm going to let Jenkins test this, but my hunch is that the first run is going to fail due to Scalastyle warnings / errors. I'll comment on a couple of these style points inline.

Jenkins, this is ok to test.

* @return - An output string qualifying success or failure.
*/
private def isSerializable(rdd: RDD[_]): String = {
SerializationHelper.isSerializable(closureSerializer,rdd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put spaces between arguments: closureSerializer, rdd (see https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide)

@SparkQA
Copy link

SparkQA commented Dec 18, 2014

Test build #24598 has finished for PR 3518 at commit 8e5f710.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class RDDTrace (rdd : RDD[_], depth : Int, result : SerializationHelper.SerializedRef)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24598/
Test PASSed.

@ilganeli
Copy link
Author

Hi @JoshRosen - just checking in to make sure things are moving on #3638 since it's a blocker to this patch. Please let me know how that's going, looks to be almost complete. Thanks!

@SparkQA
Copy link

SparkQA commented Jan 12, 2015

Test build #25420 has started for PR 3518 at commit a32f0ac.

  • This patch merges cleanly.

@ilganeli
Copy link
Author

Hi @JoshRosen, #3638 has now been merged and I've resolved the minor merge conflicts and pushed the updates. If you could please review this at your convenience, I'd love to have it merged in as well. Thanks!

@SparkQA
Copy link

SparkQA commented Jan 12, 2015

Test build #25420 has finished for PR 3518 at commit a32f0ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class RDDTrace (rdd : RDD[_], depth : Int, result : SerializationHelper.SerializedRef)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25420/
Test PASSed.

@@ -827,9 +868,21 @@ class DAGScheduler(
// might modify state of objects referenced in their closures. This is necessary in Hadoop
// where the JobConf/Configuration object is not thread-safe.
var taskBinary: Broadcast[Array[Byte]] = null

// Check if RDD serialization debugging is enabled
val debugSerialization: Boolean = sc.getConf.getBoolean("spark.serializer.debug", false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having a config option for this, why not just always run this debugging output after we've seen a serialization failure? The performance overhead won't matter much if we do it after a failure only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see - this does that already. Yeah so I'd just remove the config option and just always print debugging output if there is a failure when serializing in the TaskSetManager. We usually try not to add config options unless there is a really compelling reason to not have the feature enabled.

@pwendell
Copy link
Contributor

Hey just took a quick pass with some code style suggestions (more coming) and usability suggestions. One thing, would it be possible to track the name of the fields you are traversing? This would make the debugging output more useful. Also, is there a good reason to print the hash code? How would users use that?

@ilganeli
Copy link
Author

Hi Patrick - thanks for the feedback. I would love to print out the names of the fields but I wasn't able to figure out a way to do that - could you suggest how?

I wasn't sure if printing the hash code was useful or not, Josh included it in his original example of a traversal so I figured I'd leave it in. I didn't know if there would be a way to look it up post-facto.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25620 has started for PR 3518 at commit 1d2d563.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25620 has finished for PR 3518 at commit 1d2d563.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25620/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25624 has started for PR 3518 at commit 5b93dc1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25624 has finished for PR 3518 at commit 5b93dc1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25624/
Test PASSed.

@pwendell
Copy link
Contributor

Hey @ilganeli - I took a slightly deeper look this time. I still don't totally follow how this all hooks together, but I wonder if it's possible to write a single utility function that is much simpler. It would just do the following:

/**
 * Given an object reference, recursively traverses all fields of the reference,
 * fields of objects within those fields, and so on. If any of those references
 * are neither Serializable nor Externalizable, prints the path from the root object
 * to the reference.
 */
def findNonSerailizableReferences(root: AnyRef): String {

}

And it would do something like:

  1. Start with the root reference.
  2. Given a reference, check if all of the fields are themselves serializable (meaning for their class c, Serializable.class.isAssignableFrom(c) or Externalizable.class.isAssignableFrom(c)).
  3. Traverse the graph of all referred-to objects, maintaining path information. Path information means both the sequence of parent pointers and the field name.

This wouldn't work for custom serializers, it would only work for the Java serializer. However, that's all we support for closure's anyways. You can get the name and type of each field using reflection. There doesn't need to be any specific handling of the fact that the objects you are inspecting are RDD's. The ideal output is something like this - it would print the path along with field name, type, and maybe the toString of each object encountered.

Found non-serializable class com.user.Foo referenced from root object:

root: com.user.SomeUserRDD
| deps: Seq 
 | XXX (not sure what the internal pointers will look like here in a Seq)
  | el0: com.user.SomeOtherRDD
    | foo: com.user.Foo (NOT SERIALIZABLE)

In this example there is an rdd in the lineage that has a filed called foo that is of a non serializable type.

@pwendell
Copy link
Contributor

Hey so it looks like while I was reviewing this patch @rxin actually ran into this and just wrote a fix himself (#4093). That fix is actually even simpler than what I was proposing and almost strictly better (the only downside is you don't get field names). So I'm guessing we'll go with that one, but thanks for taking a whack at this.

@pwendell
Copy link
Contributor

BTW - my apologies for marking this as a starter task, it turned out to be more complicated. We can credit you for having worked on the feature as well.

@ilganeli
Copy link
Author

Hey @pwendell - not a problem. The solutions are similar but Reynold's has fewer moving parts. I appreciate the recognition. Thanks!

@ilganeli ilganeli closed this Jan 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants