[SPARK-23020][core] Fix races in launcher code, test. #20223

vanzin · 2018-01-10T20:55:25Z

The race in the code is because the handle might update
its state to the wrong state if the connection handling
thread is still processing incoming data; so the handle
needs to wait for the connection to finish up before
checking the final state.

The race in the test is because when waiting for a handle
to reach a final state, the waitFor() method needs to wait
until all handle state is updated (which also includes
waiting for the connection thread above to finish).
Otherwise, waitFor() may return too early, which would cause
a bunch of different races (like the listener not being yet
notified of the state change, or being in the middle of
being notified, or the handle not being properly disposed
and causing postChecks() to assert).

On top of that I found, by code inspection, a couple of
potential races that could make a handle end up in the
wrong state when being killed.

Tested by running the existing unit tests a lot (and not
seeing the errors I was seeing before).

The race in the code is because the handle might update its state to the wrong state if the connection handling thread is still processing incoming data; so the handle needs to wait for the connection to finish up before checking the final state. The race in the test is because when waiting for a handle to reach a final state, the waitFor() method needs to wait until all handle state is updated (which also includes waiting for the connection thread above to finish). Otherwise, waitFor() may return too early, which would cause a bunch of different races (like the listener not being yet notified of the state change, or being in the middle of being notified, or the handle not being properly disposed and causing postChecks() to assert). On top of that I found, by code inspection, a couple of potential races that could make a handle end up in the wrong state when being killed. Tested by running the existing unit tests a lot (and not seeing the errors I was seeing before).

SparkQA · 2018-01-11T00:48:42Z

Test build #85929 has finished for PR 20223 at commit 5139f60.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-11T03:22:52Z

Test build #85937 has finished for PR 20223 at commit 2018e29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-11T13:07:34Z

launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java

@@ -91,10 +92,15 @@ LauncherConnection getConnection() {
    return connection;
  }

-  boolean isDisposed() {
+  synchronized boolean isDisposed() {
    return disposed;


can we simply mark disposed as transient?

The synchronized is not protecting the variable, but all the actions that happen around the code that sets the variable.

So this method returning guarantees that either all the disposal tasks have run or none have.

That being said ServerConnection.close is not really holding the handle lock as it should, so I might have to make some changes here.

These are not really contended locks. In the worst case there will be 2 threads looking the them. So trying to come up with a finer-grained locking scheme here sounds like more trouble than it's worth.

cloud-fan · 2018-01-11T13:27:10Z

launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java

    if (!closed) {
-      synchronized (this) {


what's wrong with this? It's a classic "double-checked locking" in java.

The sub-class needs coarser-grained locking so this wasn't really doing anything useful.

cloud-fan · 2018-01-11T13:28:36Z

core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java

@@ -137,7 +139,9 @@ public void testInProcessLauncher() throws Exception {
      // Here DAGScheduler is stopped, while SparkContext.clearActiveContext may not be called yet.
      // Wait for a reasonable amount of time to avoid creating two active SparkContext in JVM.


why is 500 ms not a reasonable waiting time anymore?

Here Vanzin is waiting for the active SparkContext being empty within 5 seconds. It is a better solution.

It's not that 500ms is not reasonable, it's that this code will return more quickly when the context shuts down in under 500ms, have a longer grace period in case it ever takes more than 500ms, and fail the test here if the context does not shut down.

gengliangwang

There are a lot of synchronized, and some of them seems unnecessary

gengliangwang · 2018-01-11T15:44:47Z

launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java

    return disposed;
  }

+  synchronized void markDisposed() {


why do we need synchronized here?

gengliangwang · 2018-01-11T15:45:42Z

launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java

@@ -91,10 +92,15 @@ LauncherConnection getConnection() {
    return connection;
  }

-  boolean isDisposed() {
+  synchronized boolean isDisposed() {


why do we need synchronized here

gengliangwang · 2018-01-11T15:51:46Z

launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java

@@ -71,15 +71,16 @@ public void stop() {
  @Override
  public synchronized void disconnect() {
    if (!disposed) {


if(!isDisposed()) ?

SparkQA · 2018-01-11T22:07:40Z

Test build #85984 has finished for PR 20223 at commit 9c7d2d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-11T23:23:33Z

Test build #85986 has finished for PR 20223 at commit 8d52338.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-01-15T16:25:08Z

launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java

-      if (childProc.isAlive()) {
-        childProc.destroyForcibly();
+    if (!isDisposed()) {
+      setState(State.KILLED);


Why put setState in the front? It should be OK but in previous code some of the setState is called in the end.

I have the same question, but should be OK as we always synchronize when touching the state.

I changed this because the old disconnect code, at least, might change the handle's state. It's easier to put this call first and not have to reason about whether that will happen.

dongjoon-hyun · 2018-01-15T21:16:22Z

Retest this please.

SparkQA · 2018-01-16T01:00:33Z

Test build #86142 has finished for PR 20223 at commit 8d52338.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-16T02:53:19Z

LGTM

sameeragarwal · 2018-01-16T06:41:13Z

merging to master/2.3. Thanks!

The race in the code is because the handle might update its state to the wrong state if the connection handling thread is still processing incoming data; so the handle needs to wait for the connection to finish up before checking the final state. The race in the test is because when waiting for a handle to reach a final state, the waitFor() method needs to wait until all handle state is updated (which also includes waiting for the connection thread above to finish). Otherwise, waitFor() may return too early, which would cause a bunch of different races (like the listener not being yet notified of the state change, or being in the middle of being notified, or the handle not being properly disposed and causing postChecks() to assert). On top of that I found, by code inspection, a couple of potential races that could make a handle end up in the wrong state when being killed. Tested by running the existing unit tests a lot (and not seeing the errors I was seeing before). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20223 from vanzin/SPARK-23020. (cherry picked from commit 66217da) Signed-off-by: Sameer Agarwal <sameerag@apache.org>

sameeragarwal · 2018-01-17T04:22:12Z

This patch unfortunately broke YarnClusterSuite.timeout to get SparkContext in cluster mode triggers failure and SparkLauncherSuite.testInProcessLauncher is still flaky. I'm going to revert this patch and temporarily disable the test in SparkLauncherSuite to fix the build.

cloud-fan · 2018-01-17T05:45:22Z

We didn't run the yarn test with this PR due to https://issues.apache.org/jira/browse/SPARK-10300

This indicates that it might be a bad idea to skip the yarn test if we don't change yarn code. In this case, we touch the code in spark core and cause the failure in yarn test.

vanzin · 2018-01-17T17:33:21Z

At this point in time it might be a good idea to revert SPARK-10300, since the YARN tests don't really add that much to the test run time compared to the others anymore...

Marcelo Vanzin added 2 commits January 10, 2018 12:52

Missed a sleep.

2018e29

cloud-fan reviewed Jan 11, 2018

View reviewed changes

gengliangwang reviewed Jan 11, 2018

View reviewed changes

Marcelo Vanzin added 2 commits January 11, 2018 10:18

Fix some more synchronization issues.

9c7d2d3

Fix a test to test the right error.

8d52338

gengliangwang reviewed Jan 15, 2018

View reviewed changes

asfgit closed this in 66217da Jan 16, 2018

vanzin deleted the SPARK-23020 branch January 16, 2018 19:48

cloud-fan mentioned this pull request Jan 18, 2018

revert [SPARK-10030] Use tags to control which tests to run depending on changes being tested #20310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23020][core] Fix races in launcher code, test. #20223

[SPARK-23020][core] Fix races in launcher code, test. #20223

vanzin commented Jan 10, 2018

SparkQA commented Jan 11, 2018

SparkQA commented Jan 11, 2018

cloud-fan Jan 11, 2018

vanzin Jan 11, 2018

cloud-fan Jan 11, 2018

vanzin Jan 11, 2018

cloud-fan Jan 11, 2018

gengliangwang Jan 11, 2018 •

edited

vanzin Jan 11, 2018

gengliangwang left a comment

gengliangwang Jan 11, 2018

gengliangwang Jan 11, 2018

gengliangwang Jan 11, 2018

SparkQA commented Jan 11, 2018

SparkQA commented Jan 11, 2018

gengliangwang Jan 15, 2018

cloud-fan Jan 16, 2018

vanzin Jan 16, 2018

dongjoon-hyun commented Jan 15, 2018

SparkQA commented Jan 16, 2018

cloud-fan commented Jan 16, 2018

sameeragarwal commented Jan 16, 2018

sameeragarwal commented Jan 17, 2018

cloud-fan commented Jan 17, 2018 •

edited

vanzin commented Jan 17, 2018

		@@ -137,7 +139,9 @@ public void testInProcessLauncher() throws Exception {
		// Here DAGScheduler is stopped, while SparkContext.clearActiveContext may not be called yet.
		// Wait for a reasonable amount of time to avoid creating two active SparkContext in JVM.

[SPARK-23020][core] Fix races in launcher code, test. #20223

[SPARK-23020][core] Fix races in launcher code, test. #20223

Conversation

vanzin commented Jan 10, 2018

SparkQA commented Jan 11, 2018

SparkQA commented Jan 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang Jan 11, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 11, 2018

SparkQA commented Jan 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jan 15, 2018

SparkQA commented Jan 16, 2018

cloud-fan commented Jan 16, 2018

sameeragarwal commented Jan 16, 2018

sameeragarwal commented Jan 17, 2018

cloud-fan commented Jan 17, 2018 • edited

vanzin commented Jan 17, 2018

gengliangwang Jan 11, 2018 •

edited

cloud-fan commented Jan 17, 2018 •

edited