-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-5548: Fixed a race condition in AkkaUtilsSuite #4343
SPARK-5548: Fixed a race condition in AkkaUtilsSuite #4343
Conversation
Once approved, I'll create another PR for master. |
Test build #26675 has started for PR 4343 at commit
|
@@ -371,7 +371,7 @@ class AkkaUtilsSuite extends FunSuite with LocalSparkContext with ResetSystemPro | |||
AkkaUtils.address(AkkaUtils.protocol(slaveSystem), "spark", "localhost", boundPort, "MapOutputTracker")) | |||
val timeout = AkkaUtils.lookupTimeout(conf) | |||
intercept[TimeoutException] { | |||
slaveTracker.trackerActor = Await.result(selection.resolveOne(timeout), timeout) | |||
slaveTracker.trackerActor = Await.result(selection.resolveOne(timeout * 2), timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the the PR's comment (= the commit message) how increasing a timeout fixes a race condition, instead of just making it less likely?
@jacek-lewandowski Since master and branch-1.3 are pretty much in sync at this point, feel free to open your PR directly against master; I can handle backports myself as part of the merge process, since our merge script makes it really easy to do cherry-picks. It's still worth opening separate backport PRs if the branches have diverged significantly, but if they haven't it's generally easier to open PRs against master and have the committer handle the ports. If you'd like a patch to be merged into multiple branches, just leave a comment in the PR description so the committer knows where to merge it. |
Test build #26675 has finished for PR 4343 at commit
|
Test PASSed. |
@jacek-lewandowski how about just catching either exception? |
@pwendell I think the problem with catching either exception is that the |
I see, so you are worried about false negatives. For now, let's increase the timeout then. |
`Await.result` and `selection.resolveOne` runs the same timeout simultaneously. When `Await.result` timeout is reached first, then `TimeoutException` is thrown. On the other hand, when `selection.resolveOne` timeout is reached first, `ActorNotFoundException` is thrown. This is an obvious race condition and the easiest way to fix it is to increase the timeout of one method to make sure the code fails on the other method first. Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #4343 from jacek-lewandowski/SPARK-5548-1.3 and squashes the following commits: b9ba47e [Jacek Lewandowski] SPARK-5548: Fixed a race condition in AkkaUtilsSuite
@pwendell so is this pr enough (according to what @JoshRosen said?) or should i create another one for master? |
This one is fine. I can take care of getting it in master. |
Actually can you close this PR? I already merged but it doesn't close correclty when someone merges into a topic branch. |
The race condition seems still there. @jacek-lewandowski I'm not 100% sure, but it doesn't seem related to the timeout this time. The stack trace indicates that the |
Yep, looks like it's still here: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/93/testReport/junit/org.apache.spark.util/AkkaUtilsSuite/remote_fetch_ssl_on___untrusted_server/ The
|
Here is the new PR #4653 |
Await.result
andselection.resolveOne
runs the same timeout simultaneously. WhenAwait.result
timeout is reached first, thenTimeoutException
is thrown. On the other hand, whenselection.resolveOne
timeout is reached first,ActorNotFoundException
is thrown. This is an obvious race condition and the easiest way to fix it is to increase the timeout of one method to make sure the code fails on the other method first.