SPARK-6414: Spark driver failed with NPE on job cancelation #5124

hunglin · 2015-03-22T05:43:35Z

Use Option for ActiveJob.properties to avoid NPE bug

AmplabJenkins · 2015-03-22T05:47:10Z

Can one of the admins verify this patch?

hunglin · 2015-03-22T05:49:13Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -805,7 +806,7 @@ class DAGScheduler(
    }

    val properties = if (jobIdToActiveJob.contains(jobId)) {
-      jobIdToActiveJob(stage.jobId).properties
+      jobIdToActiveJob(stage.jobId).properties.orNull


@pwendell , is there a good reason that SparkListenerStageSubmitted and SparkListenerJobStart (both @DeveloperAPI) are using

properties: Properties = null

because otherwise, I would like to update these to Option as well.

I don't know if there's a good reason for this, but I don't think we can change it at this point without breaking binary compatibility. We could use annotations / comments to make those fields' nullability more apparent, though.

Got it, thanks for the answer.

JoshRosen · 2015-03-23T18:01:54Z

Jenkins, this is ok to test.

SparkQA · 2015-03-23T18:03:06Z

Test build #29009 has started for PR 5124 at commit 687434c.

This patch merges cleanly.

JoshRosen · 2015-03-23T18:22:25Z

It looks like this NPE bug has been around for a while, but it seems pretty hard to hit (which is probably why it hasn't been reported before). I think that we should be able to trigger / reproduce this by creating a new SparkContext, ensuring that the thread-local properties are null, launching a long-running job, then attempting to cancel all jobs in some non-existent job group. Can we add a regression test for this? Shouldn't be too hard if my hunch is right.

It looks like we don't directly expose the Properties object to users, so if we wanted to we could go even further and convert all of the upstream nullable Properties into Options[Properties] as well. If you look at the call chain leading to this use of properties, it looks like it can only be null if no local properties have ever been set the job submitting thread, its parent thread, or any of its other ancestor threads. Therefore, maybe we can just eliminate the whole null / Option stuff entirely by ensuring that the thread-local has an initialValue instead of having it be null in some circumstances and not others.

Therefore, here's my suggestion:

Add a regression test and confirm that it reproduces the original bug.
Override SparkContext.localProperties.initialValue to return a new empty properties object (since this is how we lazily initialize the properties in the existing code. Update the other parts of SparkContext that set this to account for this change.
Add a few assert(properties != null) so that we catch errors up-front. I'd add these checks at the entry points of the DAGScheduler, e.g. the private[spark] submitJob methods that are called from SparkContext.

Your patch looks good overall, but if we can I think we should just fix the underlying messiness.

JoshRosen · 2015-03-23T18:24:01Z

(Sorry, that should have been SparkContext.localProperties.initialValue above; I've revised my comment)

SparkQA · 2015-03-23T19:25:00Z

Test build #29009 has finished for PR 5124 at commit 687434c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-23T19:25:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29009/
Test PASSed.

SparkQA · 2015-03-27T00:22:34Z

Test build #29266 has started for PR 5124 at commit baea4fd.

This patch merges cleanly.

hunglin · 2015-03-27T00:25:18Z

@JoshRosen , thanks for the help. I updated this PR with

regression test
use Option[Properties] for the value of SparkContext.localProperties
add initialValue to SparkContext.localProperties

SparkQA · 2015-03-27T01:54:13Z

Test build #29266 has finished for PR 5124 at commit baea4fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-27T01:54:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29266/
Test PASSed.

JoshRosen · 2015-03-27T05:31:32Z

core/src/main/scala/org/apache/spark/SparkContext.scala

-    if (localProperties.get() == null) {
-      localProperties.set(new Properties())
+    if (localProperties.get().isEmpty) {
+      localProperties.set(Some(new Properties()))


Instead of having localProperties be an Option[Property], why not just have override def initialValue(): Properties = new Properties()?

Basically, this behavior of having localProperties be null or None until we set at least one property is confusing to me; I think it's simpler to just eagerly perform this initialization before we set any properties.

@JoshRosen the idea is to use Option so compiler can help us prevent NullPointerException. And by eagerly setting localProperties to None, localProperties won't be null. As a contributor, when I see the type is Option[Properties], I know I'm safe from NPE, because I might not aware of the eagerly setting to empty Properties somewhere in the codes, and end up putting all if (localProperties.get != null) code everywhere. If we can change DeveloperApi, I would like to change all properties = null in SparkListenerStageSubmitted and SparkListenerJobStart to Option.

That's just my 2 cents, I'm more than happy to discuss and listen to your thoughts.

Option helps you avoid NPEs by convention; it's not guaranteed to prevent NPE, since you can still do things like this:

scala> var x: Option[Int] = null x: Option[Int] = null scala> x res0: Option[Int] = null scala> x.get java.lang.NullPointerException ... 33 elided scala> x match { case Some(_) => (); case None => () } scala.MatchError: null ... 33 elided

There's actually a small risk of this issue occurring when you're not methodical when refactoring existing null-passing code to pass options. If you have something like def foo(x: String) and you refactor it to def foo(x: Option[String]), you'll still run into trouble if you had old code that called foo(null), since this code will still compile correctly after the refactoring. This means that you have to search for all callers of foo() when performing the refactoring, rather than just changing the type and fixing whatever compiler errors occur. A similar problem can occur if you call Some(x) where x == null, which can also happen when refactoring large legacy codebases, since this causes you to wind up with a Some(null).

To address the actual issue at hand, though, I'm suggesting that it's clearer to change where we perform our initialization so that localProperties.get() always returns a valid Properties object, even if it's an empty properties object. This will prevent localProperties.get() from returning null, so we can then begin to look at all of the code which checks whether (localProperties.get() == null) and refactor that to assume that localProperties is non-null, and so on, until we've removed all of the nullability and options from this code path.

If you look at DAGScheduler, I think that properties flow into it via handleJobSubmitted. The properties that flow to this location come from either SparkContext.runJob or SparkContext.runApproximateJob, both of which pass localProperties.get: https://github.com/hunglin/spark/blob/baea4fd9f6df0af466e51a8f19380f194ec502ae/core/src/main/scala/org/apache/spark/SparkContext.scala#L1493

If you continue to apply this sort of reasoning to all of the places where these properties flow, I think that we'll find that the properties won't be null unless they're null in the SparkContext run*Job methods, which will be prevented by overriding initialValue to return an empty properties object.

If we can change DeveloperApi, I would like to change all properties = null in SparkListenerStageSubmitted and SparkListenerJobStart to Option.

A nice advantage of returning an empty properties object rather than using Option is that we don't need to break binary compatibility in these events. It might be nice to add a note to these events' docstrings to clarify that the field used to be nullable in earlier versions, since this issue might bite users who write code against the newest version of Spark's listener API and get NPEs when they run on older versions.

One gotcha is that the UI's JSONProtocol will continue to deserialize old event streams such that these events can still contain null properties fields. It wouldn't be too hard to modify the propertiesFromJson and propertiesToJson to also return empty Properties objects instead of nulls.

By the way, I think this change would significantly shrink the size of this patch, too, since none of the internal methods' type signatures would need to change.

I see, I'll update the PR.

SparkQA · 2015-03-28T01:57:39Z

Test build #29334 has started for PR 5124 at commit 6cfe888.

This patch merges cleanly.

hunglin · 2015-03-28T01:59:32Z

@JoshRosen I updated PR, please review. Thanks.

SparkQA · 2015-03-28T02:03:21Z

Test build #29335 has started for PR 5124 at commit aef7ac6.

This patch merges cleanly.

SparkQA · 2015-03-28T03:20:54Z

Test build #29334 has finished for PR 5124 at commit 6cfe888.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-28T03:20:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29334/
Test PASSed.

SparkQA · 2015-03-28T03:29:03Z

Test build #29335 has finished for PR 5124 at commit aef7ac6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-28T03:29:07Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29335/
Test PASSed.

JoshRosen · 2015-03-30T22:44:24Z

core/src/test/scala/org/apache/spark/SparkContextSuite.scala

+      Await.ready(future, Duration(2, TimeUnit.SECONDS))
+
+      /**
+       * In SPARK-6414, sc.cancelJobGroup will cause NullPointerException and cause


This should use the // single-line comment style, not Scaladoc style.

thanks for reminding, updated the comment style.

JoshRosen · 2015-03-30T22:48:50Z

Aside from two minor comments that I just left, this looks good to me.

SparkQA · 2015-03-31T02:18:19Z

Test build #29439 has started for PR 5124 at commit 3d64166.

SparkQA · 2015-03-31T03:43:13Z

Test build #29439 has finished for PR 5124 at commit 3d64166.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-03-31T03:43:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29439/
Test PASSed.

hunglin · 2015-04-02T16:40:15Z

hi @JoshRosen , do you think this PR is ready to be merged? Please let me know if you have more comments. I'm happy to address them.

JoshRosen · 2015-04-02T16:53:36Z

LGTM; looks like there's a merge conflict, but I'll fix it myself.

JoshRosen · 2015-04-02T17:03:03Z

Looks like the Apache Git sever is down / inaccessible, so I'll have to commit this in a little bit.

Author: Hung Lin <hunglin@gmail.com>

hunglin · 2015-04-02T18:12:10Z

@JoshRosen, thanks for the help. I just rebased this PR against master. It should also fix the conflict.

SparkQA · 2015-04-02T18:13:20Z

Test build #29620 has started for PR 5124 at commit 2290b6b.

SparkQA · 2015-04-02T19:48:31Z

Test build #29620 has finished for PR 5124 at commit 2290b6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-02T19:48:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29620/
Test PASSed.

Use Option for ActiveJob.properties to avoid NPE bug Author: Hung Lin <hung.lin@gmail.com> Closes #5124 from hunglin/SPARK-6414 and squashes the following commits: 2290b6b [Hung Lin] [SPARK-6414][core] Fix NPE in SparkContext.cancelJobGroup() (cherry picked from commit e3202aa) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Use Option for ActiveJob.properties to avoid NPE bug Author: Hung Lin <hung.lin@gmail.com> Closes #5124 from hunglin/SPARK-6414 and squashes the following commits: 2290b6b [Hung Lin] [SPARK-6414][core] Fix NPE in SparkContext.cancelJobGroup() (cherry picked from commit e3202aa) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala Conflicts: core/src/test/scala/org/apache/spark/SparkContextSuite.scala

JoshRosen · 2015-04-02T21:19:45Z

I've merged this into master (1.4.0), branch-1.3 (1.3.1), and branch-1.2 (1.2.2). Thanks!

hunglin · 2015-04-03T03:14:09Z

@JoshRosen, no problem. I'm happy to contribute.

Use Option for ActiveJob.properties to avoid NPE bug Author: Hung Lin <hung.lin@gmail.com> Closes apache#5124 from hunglin/SPARK-6414 and squashes the following commits: 2290b6b [Hung Lin] [SPARK-6414][core] Fix NPE in SparkContext.cancelJobGroup() (cherry picked from commit e3202aa) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala Conflicts: core/src/test/scala/org/apache/spark/SparkContextSuite.scala

hunglin reviewed Mar 22, 2015
View reviewed changes

hunglin force-pushed the SPARK-6414 branch from 687434c to baea4fd Compare March 27, 2015 00:20

JoshRosen reviewed Mar 27, 2015
View reviewed changes

hunglin force-pushed the SPARK-6414 branch from baea4fd to 6cfe888 Compare March 28, 2015 01:53

hunglin force-pushed the SPARK-6414 branch from 6cfe888 to aef7ac6 Compare March 28, 2015 01:58

JoshRosen reviewed Mar 30, 2015
View reviewed changes

hunglin force-pushed the SPARK-6414 branch from aef7ac6 to 3d64166 Compare March 31, 2015 02:14

[SPARK-6414][core] Fix NPE in SparkContext.cancelJobGroup()

2290b6b

Author: Hung Lin <hunglin@gmail.com>

hunglin force-pushed the SPARK-6414 branch from 3d64166 to 2290b6b Compare April 2, 2015 18:09

asfgit closed this in e3202aa Apr 2, 2015

SPARK-6414: Spark driver failed with NPE on job cancelation #5124

SPARK-6414: Spark driver failed with NPE on job cancelation #5124

Conversation

hunglin commented Mar 22, 2015

AmplabJenkins commented Mar 22, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshRosen commented Mar 23, 2015

SparkQA commented Mar 23, 2015

JoshRosen commented Mar 23, 2015

JoshRosen commented Mar 23, 2015

SparkQA commented Mar 23, 2015

AmplabJenkins commented Mar 23, 2015

SparkQA commented Mar 27, 2015

hunglin commented Mar 27, 2015

SparkQA commented Mar 27, 2015

AmplabJenkins commented Mar 27, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 28, 2015

hunglin commented Mar 28, 2015

SparkQA commented Mar 28, 2015

SparkQA commented Mar 28, 2015

AmplabJenkins commented Mar 28, 2015

SparkQA commented Mar 28, 2015

AmplabJenkins commented Mar 28, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshRosen commented Mar 30, 2015

SparkQA commented Mar 31, 2015

SparkQA commented Mar 31, 2015

AmplabJenkins commented Mar 31, 2015

hunglin commented Apr 2, 2015

JoshRosen commented Apr 2, 2015

JoshRosen commented Apr 2, 2015

hunglin commented Apr 2, 2015

SparkQA commented Apr 2, 2015

SparkQA commented Apr 2, 2015

AmplabJenkins commented Apr 2, 2015

JoshRosen commented Apr 2, 2015

hunglin commented Apr 3, 2015