Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-6414: Spark driver failed with NPE on job cancelation #5124

Closed
wants to merge 1 commit into from

Conversation

hunglin
Copy link

@hunglin hunglin commented Mar 22, 2015

Use Option for ActiveJob.properties to avoid NPE bug

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@@ -805,7 +806,7 @@ class DAGScheduler(
}

val properties = if (jobIdToActiveJob.contains(jobId)) {
jobIdToActiveJob(stage.jobId).properties
jobIdToActiveJob(stage.jobId).properties.orNull
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwendell , is there a good reason that SparkListenerStageSubmitted and SparkListenerJobStart (both @DeveloperAPI) are using

properties: Properties = null

because otherwise, I would like to update these to Option as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if there's a good reason for this, but I don't think we can change it at this point without breaking binary compatibility. We could use annotations / comments to make those fields' nullability more apparent, though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the answer.

@JoshRosen
Copy link
Contributor

Jenkins, this is ok to test.

@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #29009 has started for PR 5124 at commit 687434c.

  • This patch merges cleanly.

@JoshRosen
Copy link
Contributor

It looks like this NPE bug has been around for a while, but it seems pretty hard to hit (which is probably why it hasn't been reported before). I think that we should be able to trigger / reproduce this by creating a new SparkContext, ensuring that the thread-local properties are null, launching a long-running job, then attempting to cancel all jobs in some non-existent job group. Can we add a regression test for this? Shouldn't be too hard if my hunch is right.

It looks like we don't directly expose the Properties object to users, so if we wanted to we could go even further and convert all of the upstream nullable Properties into Options[Properties] as well. If you look at the call chain leading to this use of properties, it looks like it can only be null if no local properties have ever been set the job submitting thread, its parent thread, or any of its other ancestor threads. Therefore, maybe we can just eliminate the whole null / Option stuff entirely by ensuring that the thread-local has an initialValue instead of having it be null in some circumstances and not others.

Therefore, here's my suggestion:

  • Add a regression test and confirm that it reproduces the original bug.
  • Override SparkContext.localProperties.initialValue to return a new empty properties object (since this is how we lazily initialize the properties in the existing code. Update the other parts of SparkContext that set this to account for this change.
  • Add a few assert(properties != null) so that we catch errors up-front. I'd add these checks at the entry points of the DAGScheduler, e.g. the private[spark] submitJob methods that are called from SparkContext.

Your patch looks good overall, but if we can I think we should just fix the underlying messiness.

@JoshRosen
Copy link
Contributor

(Sorry, that should have been SparkContext.localProperties.initialValue above; I've revised my comment)

@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #29009 has finished for PR 5124 at commit 687434c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29009/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 27, 2015

Test build #29266 has started for PR 5124 at commit baea4fd.

  • This patch merges cleanly.

@hunglin
Copy link
Author

hunglin commented Mar 27, 2015

@JoshRosen , thanks for the help. I updated this PR with

  1. regression test
  2. use Option[Properties] for the value of SparkContext.localProperties
  3. add initialValue to SparkContext.localProperties

@SparkQA
Copy link

SparkQA commented Mar 27, 2015

Test build #29266 has finished for PR 5124 at commit baea4fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29266/
Test PASSed.

if (localProperties.get() == null) {
localProperties.set(new Properties())
if (localProperties.get().isEmpty) {
localProperties.set(Some(new Properties()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having localProperties be an Option[Property], why not just have override def initialValue(): Properties = new Properties()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, this behavior of having localProperties be null or None until we set at least one property is confusing to me; I think it's simpler to just eagerly perform this initialization before we set any properties.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoshRosen the idea is to use Option so compiler can help us prevent NullPointerException. And by eagerly setting localProperties to None, localProperties won't be null. As a contributor, when I see the type is Option[Properties], I know I'm safe from NPE, because I might not aware of the eagerly setting to empty Properties somewhere in the codes, and end up putting all if (localProperties.get != null) code everywhere. If we can change DeveloperApi, I would like to change all properties = null in SparkListenerStageSubmitted and SparkListenerJobStart to Option.

That's just my 2 cents, I'm more than happy to discuss and listen to your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option helps you avoid NPEs by convention; it's not guaranteed to prevent NPE, since you can still do things like this:

scala> var x: Option[Int] = null
x: Option[Int] = null

scala> x
res0: Option[Int] = null

scala> x.get
java.lang.NullPointerException
  ... 33 elided

scala> x match { case Some(_) => (); case None => () }
scala.MatchError: null
  ... 33 elided

There's actually a small risk of this issue occurring when you're not methodical when refactoring existing null-passing code to pass options. If you have something like def foo(x: String) and you refactor it to def foo(x: Option[String]), you'll still run into trouble if you had old code that called foo(null), since this code will still compile correctly after the refactoring. This means that you have to search for all callers of foo() when performing the refactoring, rather than just changing the type and fixing whatever compiler errors occur. A similar problem can occur if you call Some(x) where x == null, which can also happen when refactoring large legacy codebases, since this causes you to wind up with a Some(null).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address the actual issue at hand, though, I'm suggesting that it's clearer to change where we perform our initialization so that localProperties.get() always returns a valid Properties object, even if it's an empty properties object. This will prevent localProperties.get() from returning null, so we can then begin to look at all of the code which checks whether (localProperties.get() == null) and refactor that to assume that localProperties is non-null, and so on, until we've removed all of the nullability and options from this code path.

If you look at DAGScheduler, I think that properties flow into it via handleJobSubmitted. The properties that flow to this location come from either SparkContext.runJob or SparkContext.runApproximateJob, both of which pass localProperties.get: https://github.com/hunglin/spark/blob/baea4fd9f6df0af466e51a8f19380f194ec502ae/core/src/main/scala/org/apache/spark/SparkContext.scala#L1493

If you continue to apply this sort of reasoning to all of the places where these properties flow, I think that we'll find that the properties won't be null unless they're null in the SparkContext run*Job methods, which will be prevented by overriding initialValue to return an empty properties object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can change DeveloperApi, I would like to change all properties = null in SparkListenerStageSubmitted and SparkListenerJobStart to Option.

A nice advantage of returning an empty properties object rather than using Option is that we don't need to break binary compatibility in these events. It might be nice to add a note to these events' docstrings to clarify that the field used to be nullable in earlier versions, since this issue might bite users who write code against the newest version of Spark's listener API and get NPEs when they run on older versions.

One gotcha is that the UI's JSONProtocol will continue to deserialize old event streams such that these events can still contain null properties fields. It wouldn't be too hard to modify the propertiesFromJson and propertiesToJson to also return empty Properties objects instead of nulls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I think this change would significantly shrink the size of this patch, too, since none of the internal methods' type signatures would need to change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'll update the PR.

@SparkQA
Copy link

SparkQA commented Mar 28, 2015

Test build #29334 has started for PR 5124 at commit 6cfe888.

  • This patch merges cleanly.

@hunglin
Copy link
Author

hunglin commented Mar 28, 2015

@JoshRosen I updated PR, please review. Thanks.

@SparkQA
Copy link

SparkQA commented Mar 28, 2015

Test build #29335 has started for PR 5124 at commit aef7ac6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 28, 2015

Test build #29334 has finished for PR 5124 at commit 6cfe888.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29334/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 28, 2015

Test build #29335 has finished for PR 5124 at commit aef7ac6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29335/
Test PASSed.

Await.ready(future, Duration(2, TimeUnit.SECONDS))

/**
* In SPARK-6414, sc.cancelJobGroup will cause NullPointerException and cause
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the // single-line comment style, not Scaladoc style.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for reminding, updated the comment style.

@JoshRosen
Copy link
Contributor

Aside from two minor comments that I just left, this looks good to me.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29439 has started for PR 5124 at commit 3d64166.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29439 has finished for PR 5124 at commit 3d64166.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29439/
Test PASSed.

@hunglin
Copy link
Author

hunglin commented Apr 2, 2015

hi @JoshRosen , do you think this PR is ready to be merged? Please let me know if you have more comments. I'm happy to address them.

@JoshRosen
Copy link
Contributor

LGTM; looks like there's a merge conflict, but I'll fix it myself.

@JoshRosen
Copy link
Contributor

Looks like the Apache Git sever is down / inaccessible, so I'll have to commit this in a little bit.

Author: Hung Lin <hunglin@gmail.com>
@hunglin
Copy link
Author

hunglin commented Apr 2, 2015

@JoshRosen, thanks for the help. I just rebased this PR against master. It should also fix the conflict.

@SparkQA
Copy link

SparkQA commented Apr 2, 2015

Test build #29620 has started for PR 5124 at commit 2290b6b.

@SparkQA
Copy link

SparkQA commented Apr 2, 2015

Test build #29620 has finished for PR 5124 at commit 2290b6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29620/
Test PASSed.

@asfgit asfgit closed this in e3202aa Apr 2, 2015
asfgit pushed a commit that referenced this pull request Apr 2, 2015
Use Option for ActiveJob.properties to avoid NPE bug

Author: Hung Lin <hung.lin@gmail.com>

Closes #5124 from hunglin/SPARK-6414 and squashes the following commits:

2290b6b [Hung Lin] [SPARK-6414][core] Fix NPE in SparkContext.cancelJobGroup()

(cherry picked from commit e3202aa)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
asfgit pushed a commit that referenced this pull request Apr 2, 2015
Use Option for ActiveJob.properties to avoid NPE bug

Author: Hung Lin <hung.lin@gmail.com>

Closes #5124 from hunglin/SPARK-6414 and squashes the following commits:

2290b6b [Hung Lin] [SPARK-6414][core] Fix NPE in SparkContext.cancelJobGroup()

(cherry picked from commit e3202aa)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Conflicts:
	core/src/test/scala/org/apache/spark/SparkContextSuite.scala
@JoshRosen
Copy link
Contributor

I've merged this into master (1.4.0), branch-1.3 (1.3.1), and branch-1.2 (1.2.2). Thanks!

@hunglin
Copy link
Author

hunglin commented Apr 3, 2015

@JoshRosen, no problem. I'm happy to contribute.

markhamstra pushed a commit to markhamstra/spark that referenced this pull request Apr 15, 2015
Use Option for ActiveJob.properties to avoid NPE bug

Author: Hung Lin <hung.lin@gmail.com>

Closes apache#5124 from hunglin/SPARK-6414 and squashes the following commits:

2290b6b [Hung Lin] [SPARK-6414][core] Fix NPE in SparkContext.cancelJobGroup()

(cherry picked from commit e3202aa)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Conflicts:
	core/src/test/scala/org/apache/spark/SparkContextSuite.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants