-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-6414: Spark driver failed with NPE on job cancelation #5124
Conversation
Can one of the admins verify this patch? |
@@ -805,7 +806,7 @@ class DAGScheduler( | |||
} | |||
|
|||
val properties = if (jobIdToActiveJob.contains(jobId)) { | |||
jobIdToActiveJob(stage.jobId).properties | |||
jobIdToActiveJob(stage.jobId).properties.orNull |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pwendell , is there a good reason that SparkListenerStageSubmitted and SparkListenerJobStart (both @DeveloperAPI) are using
properties: Properties = null
because otherwise, I would like to update these to Option as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if there's a good reason for this, but I don't think we can change it at this point without breaking binary compatibility. We could use annotations / comments to make those fields' nullability more apparent, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the answer.
Jenkins, this is ok to test. |
Test build #29009 has started for PR 5124 at commit
|
It looks like this NPE bug has been around for a while, but it seems pretty hard to hit (which is probably why it hasn't been reported before). I think that we should be able to trigger / reproduce this by creating a new SparkContext, ensuring that the thread-local properties are null, launching a long-running job, then attempting to cancel all jobs in some non-existent job group. Can we add a regression test for this? Shouldn't be too hard if my hunch is right. It looks like we don't directly expose the Properties object to users, so if we wanted to we could go even further and convert all of the upstream nullable Therefore, here's my suggestion:
Your patch looks good overall, but if we can I think we should just fix the underlying messiness. |
(Sorry, that should have been |
Test build #29009 has finished for PR 5124 at commit
|
Test PASSed. |
Test build #29266 has started for PR 5124 at commit
|
@JoshRosen , thanks for the help. I updated this PR with
|
Test build #29266 has finished for PR 5124 at commit
|
Test PASSed. |
if (localProperties.get() == null) { | ||
localProperties.set(new Properties()) | ||
if (localProperties.get().isEmpty) { | ||
localProperties.set(Some(new Properties())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having localProperties
be an Option[Property]
, why not just have override def initialValue(): Properties = new Properties()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, this behavior of having localProperties
be null or None until we set at least one property is confusing to me; I think it's simpler to just eagerly perform this initialization before we set any properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoshRosen the idea is to use Option so compiler can help us prevent NullPointerException. And by eagerly setting localProperties to None, localProperties won't be null. As a contributor, when I see the type is Option[Properties], I know I'm safe from NPE, because I might not aware of the eagerly setting to empty Properties somewhere in the codes, and end up putting all if (localProperties.get != null)
code everywhere. If we can change DeveloperApi, I would like to change all properties = null
in SparkListenerStageSubmitted and SparkListenerJobStart to Option.
That's just my 2 cents, I'm more than happy to discuss and listen to your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option helps you avoid NPEs by convention; it's not guaranteed to prevent NPE, since you can still do things like this:
scala> var x: Option[Int] = null
x: Option[Int] = null
scala> x
res0: Option[Int] = null
scala> x.get
java.lang.NullPointerException
... 33 elided
scala> x match { case Some(_) => (); case None => () }
scala.MatchError: null
... 33 elided
There's actually a small risk of this issue occurring when you're not methodical when refactoring existing null-passing code to pass options. If you have something like def foo(x: String)
and you refactor it to def foo(x: Option[String])
, you'll still run into trouble if you had old code that called foo(null)
, since this code will still compile correctly after the refactoring. This means that you have to search for all callers of foo()
when performing the refactoring, rather than just changing the type and fixing whatever compiler errors occur. A similar problem can occur if you call Some(x)
where x == null
, which can also happen when refactoring large legacy codebases, since this causes you to wind up with a Some(null)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address the actual issue at hand, though, I'm suggesting that it's clearer to change where we perform our initialization so that localProperties.get()
always returns a valid Properties object, even if it's an empty properties object. This will prevent localProperties.get()
from returning null, so we can then begin to look at all of the code which checks whether (localProperties.get() == null)
and refactor that to assume that localProperties
is non-null, and so on, until we've removed all of the nullability and options from this code path.
If you look at DAGScheduler, I think that properties flow into it via handleJobSubmitted
. The properties that flow to this location come from either SparkContext.runJob or SparkContext.runApproximateJob, both of which pass localProperties.get
: https://github.com/hunglin/spark/blob/baea4fd9f6df0af466e51a8f19380f194ec502ae/core/src/main/scala/org/apache/spark/SparkContext.scala#L1493
If you continue to apply this sort of reasoning to all of the places where these properties flow, I think that we'll find that the properties won't be null unless they're null in the SparkContext run*Job methods, which will be prevented by overriding initialValue
to return an empty properties object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can change DeveloperApi, I would like to change all properties = null in SparkListenerStageSubmitted and SparkListenerJobStart to Option.
A nice advantage of returning an empty properties object rather than using Option
is that we don't need to break binary compatibility in these events. It might be nice to add a note to these events' docstrings to clarify that the field used to be nullable in earlier versions, since this issue might bite users who write code against the newest version of Spark's listener API and get NPEs when they run on older versions.
One gotcha is that the UI's JSONProtocol will continue to deserialize old event streams such that these events can still contain null
properties fields. It wouldn't be too hard to modify the propertiesFromJson and propertiesToJson to also return empty Properties objects instead of nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I think this change would significantly shrink the size of this patch, too, since none of the internal methods' type signatures would need to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I'll update the PR.
Test build #29334 has started for PR 5124 at commit
|
@JoshRosen I updated PR, please review. Thanks. |
Test build #29335 has started for PR 5124 at commit
|
Test build #29334 has finished for PR 5124 at commit
|
Test PASSed. |
Test build #29335 has finished for PR 5124 at commit
|
Test PASSed. |
Await.ready(future, Duration(2, TimeUnit.SECONDS)) | ||
|
||
/** | ||
* In SPARK-6414, sc.cancelJobGroup will cause NullPointerException and cause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use the //
single-line comment style, not Scaladoc style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for reminding, updated the comment style.
Aside from two minor comments that I just left, this looks good to me. |
Test build #29439 has started for PR 5124 at commit |
Test build #29439 has finished for PR 5124 at commit
|
Test PASSed. |
hi @JoshRosen , do you think this PR is ready to be merged? Please let me know if you have more comments. I'm happy to address them. |
LGTM; looks like there's a merge conflict, but I'll fix it myself. |
Looks like the Apache Git sever is down / inaccessible, so I'll have to commit this in a little bit. |
Author: Hung Lin <hunglin@gmail.com>
@JoshRosen, thanks for the help. I just rebased this PR against master. It should also fix the conflict. |
Test build #29620 has started for PR 5124 at commit |
Test build #29620 has finished for PR 5124 at commit
|
Test PASSed. |
Use Option for ActiveJob.properties to avoid NPE bug Author: Hung Lin <hung.lin@gmail.com> Closes #5124 from hunglin/SPARK-6414 and squashes the following commits: 2290b6b [Hung Lin] [SPARK-6414][core] Fix NPE in SparkContext.cancelJobGroup() (cherry picked from commit e3202aa) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
Use Option for ActiveJob.properties to avoid NPE bug Author: Hung Lin <hung.lin@gmail.com> Closes #5124 from hunglin/SPARK-6414 and squashes the following commits: 2290b6b [Hung Lin] [SPARK-6414][core] Fix NPE in SparkContext.cancelJobGroup() (cherry picked from commit e3202aa) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala Conflicts: core/src/test/scala/org/apache/spark/SparkContextSuite.scala
I've merged this into |
@JoshRosen, no problem. I'm happy to contribute. |
Use Option for ActiveJob.properties to avoid NPE bug Author: Hung Lin <hung.lin@gmail.com> Closes apache#5124 from hunglin/SPARK-6414 and squashes the following commits: 2290b6b [Hung Lin] [SPARK-6414][core] Fix NPE in SparkContext.cancelJobGroup() (cherry picked from commit e3202aa) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala Conflicts: core/src/test/scala/org/apache/spark/SparkContextSuite.scala
Use Option for ActiveJob.properties to avoid NPE bug