Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23020][core] Fix races in launcher code, test. #20223

Closed
wants to merge 4 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Jan 10, 2018

The race in the code is because the handle might update
its state to the wrong state if the connection handling
thread is still processing incoming data; so the handle
needs to wait for the connection to finish up before
checking the final state.

The race in the test is because when waiting for a handle
to reach a final state, the waitFor() method needs to wait
until all handle state is updated (which also includes
waiting for the connection thread above to finish).
Otherwise, waitFor() may return too early, which would cause
a bunch of different races (like the listener not being yet
notified of the state change, or being in the middle of
being notified, or the handle not being properly disposed
and causing postChecks() to assert).

On top of that I found, by code inspection, a couple of
potential races that could make a handle end up in the
wrong state when being killed.

Tested by running the existing unit tests a lot (and not
seeing the errors I was seeing before).

Marcelo Vanzin added 2 commits January 10, 2018 12:52
The race in the code is because the handle might update
its state to the wrong state if the connection handling
thread is still processing incoming data; so the handle
needs to wait for the connection to finish up before
checking the final state.

The race in the test is because when waiting for a handle
to reach a final state, the waitFor() method needs to wait
until all handle state is updated (which also includes
waiting for the connection thread above to finish).
Otherwise, waitFor() may return too early, which would cause
a bunch of different races (like the listener not being yet
notified of the state change, or being in the middle of
being notified, or the handle not being properly disposed
and causing postChecks() to assert).

On top of that I found, by code inspection, a couple of
potential races that could make a handle end up in the
wrong state when being killed.

Tested by running the existing unit tests a lot (and not
seeing the errors I was seeing before).
@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85929 has finished for PR 20223 at commit 5139f60.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85937 has finished for PR 20223 at commit 2018e29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -91,10 +92,15 @@ LauncherConnection getConnection() {
return connection;
}

boolean isDisposed() {
synchronized boolean isDisposed() {
return disposed;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we simply mark disposed as transient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The synchronized is not protecting the variable, but all the actions that happen around the code that sets the variable.

So this method returning guarantees that either all the disposal tasks have run or none have.

That being said ServerConnection.close is not really holding the handle lock as it should, so I might have to make some changes here.

These are not really contended locks. In the worst case there will be 2 threads looking the them. So trying to come up with a finer-grained locking scheme here sounds like more trouble than it's worth.

if (!closed) {
synchronized (this) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong with this? It's a classic "double-checked locking" in java.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sub-class needs coarser-grained locking so this wasn't really doing anything useful.

@@ -137,7 +139,9 @@ public void testInProcessLauncher() throws Exception {
// Here DAGScheduler is stopped, while SparkContext.clearActiveContext may not be called yet.
// Wait for a reasonable amount of time to avoid creating two active SparkContext in JVM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is 500 ms not a reasonable waiting time anymore?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here Vanzin is waiting for the active SparkContext being empty within 5 seconds. It is a better solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that 500ms is not reasonable, it's that this code will return more quickly when the context shuts down in under 500ms, have a longer grace period in case it ever takes more than 500ms, and fail the test here if the context does not shut down.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of synchronized, and some of them seems unnecessary

return disposed;
}

synchronized void markDisposed() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need synchronized here?

@@ -91,10 +92,15 @@ LauncherConnection getConnection() {
return connection;
}

boolean isDisposed() {
synchronized boolean isDisposed() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need synchronized here

@@ -71,15 +71,16 @@ public void stop() {
@Override
public synchronized void disconnect() {
if (!disposed) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if(!isDisposed()) ?

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85984 has finished for PR 20223 at commit 9c7d2d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85986 has finished for PR 20223 at commit 8d52338.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (childProc.isAlive()) {
childProc.destroyForcibly();
if (!isDisposed()) {
setState(State.KILLED);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why put setState in the front? It should be OK but in previous code some of the setState is called in the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question, but should be OK as we always synchronize when touching the state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this because the old disconnect code, at least, might change the handle's state. It's easier to put this call first and not have to reason about whether that will happen.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jan 16, 2018

Test build #86142 has finished for PR 20223 at commit 8d52338.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

@sameeragarwal
Copy link
Member

merging to master/2.3. Thanks!

asfgit pushed a commit that referenced this pull request Jan 16, 2018
The race in the code is because the handle might update
its state to the wrong state if the connection handling
thread is still processing incoming data; so the handle
needs to wait for the connection to finish up before
checking the final state.

The race in the test is because when waiting for a handle
to reach a final state, the waitFor() method needs to wait
until all handle state is updated (which also includes
waiting for the connection thread above to finish).
Otherwise, waitFor() may return too early, which would cause
a bunch of different races (like the listener not being yet
notified of the state change, or being in the middle of
being notified, or the handle not being properly disposed
and causing postChecks() to assert).

On top of that I found, by code inspection, a couple of
potential races that could make a handle end up in the
wrong state when being killed.

Tested by running the existing unit tests a lot (and not
seeing the errors I was seeing before).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #20223 from vanzin/SPARK-23020.

(cherry picked from commit 66217da)
Signed-off-by: Sameer Agarwal <sameerag@apache.org>
@asfgit asfgit closed this in 66217da Jan 16, 2018
@vanzin vanzin deleted the SPARK-23020 branch January 16, 2018 19:48
@sameeragarwal
Copy link
Member

This patch unfortunately broke YarnClusterSuite.timeout to get SparkContext in cluster mode triggers failure and SparkLauncherSuite.testInProcessLauncher is still flaky. I'm going to revert this patch and temporarily disable the test in SparkLauncherSuite to fix the build.

@cloud-fan
Copy link
Contributor

cloud-fan commented Jan 17, 2018

We didn't run the yarn test with this PR due to https://issues.apache.org/jira/browse/SPARK-10300

This indicates that it might be a bad idea to skip the yarn test if we don't change yarn code. In this case, we touch the code in spark core and cause the failure in yarn test.

@vanzin
Copy link
Contributor Author

vanzin commented Jan 17, 2018

At this point in time it might be a good idea to revert SPARK-10300, since the YARN tests don't really add that much to the test run time compared to the others anymore...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants