[SPARK-3106] Fix the race condition issue about Connection and ConnectionManager #2019

sarutak · 2014-08-18T19:38:59Z

No description provided.

SparkQA · 2014-08-18T19:45:18Z

QA tests have started for PR 2019 at commit 698a47e.

This patch merges cleanly.

SparkQA · 2014-08-18T19:45:57Z

QA tests have finished for PR 2019 at commit 698a47e.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-18T20:05:19Z

QA tests have started for PR 2019 at commit 48ae3c6.

This patch merges cleanly.

SparkQA · 2014-08-18T20:56:19Z

QA tests have finished for PR 2019 at commit 48ae3c6.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

sarutak · 2014-08-19T11:39:55Z

This change can resolve being threw ClosedChannelException, CancelledKeyException and warning message "Corresponding SendingConnectionManagerId not found" and "All connections not cleaned up" we can face recently.

SparkQA · 2014-08-19T11:40:19Z

QA tests have started for PR 2019 at commit e1b580e.

This patch merges cleanly.

SparkQA · 2014-08-19T12:35:44Z

QA tests have finished for PR 2019 at commit e1b580e.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-19T12:50:21Z

QA tests have started for PR 2019 at commit ce07ae5.

This patch merges cleanly.

SparkQA · 2014-08-19T13:44:27Z

QA tests have finished for PR 2019 at commit ce07ae5.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

arahuja · 2014-08-19T16:59:16Z

I tested this patch on branch-1.0 and still see those Exceptions in the logs, curious to know if you expected this to work there as well, or on YARN?

Exceptions:

14/08/19 12:39:42 WARN SendingConnection: Error writing in connection to ConnectionManagerId(demeter-csmaz11-4.demeter.hpc.mssm.edu,35328)
java.nio.channels.AsynchronousCloseException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
at org.apache.spark.network.SendingConnection.write(Connection.scala:380)
at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

14/08/19 12:37:25 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(demeter-csmau08-20.demeter.hpc.mssm.edu,53302)
14/08/19 12:37:25 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@24644e7f
java.nio.channels.CancelledKeyException
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:287)
at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)

sarutak · 2014-08-19T17:06:40Z

Hi @arahuja , I tested on Hadoop 2.x with YARN.
It seemed that exceptions like you mentioned got calm down.

Before I changed, I saw those exception on drivers. Where did you see those exceptions?

sarutak · 2014-08-19T17:18:11Z

@arahuja I found a path which we meet the situation like you mention. I'll fix soon.

sarutak · 2014-08-21T08:47:10Z

@arahuja I've modified. Can you test with new PR?

SparkQA · 2014-08-21T08:50:33Z

QA tests have started for PR 2019 at commit 5f91c8d.

This patch merges cleanly.

SparkQA · 2014-08-21T09:05:33Z

QA tests have started for PR 2019 at commit 2d7f444.

This patch merges cleanly.

SparkQA · 2014-08-21T09:37:36Z

QA tests have finished for PR 2019 at commit 5f91c8d.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

sarutak · 2014-08-21T09:39:25Z

Jenkins, retest this please.

SparkQA · 2014-08-21T09:45:33Z

QA tests have started for PR 2019 at commit 2d7f444.

This patch merges cleanly.

SparkQA · 2014-08-21T10:00:09Z

QA tests have finished for PR 2019 at commit 2d7f444.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- In multiclass classification, all$2^`
- public final class JavaDecisionTree

SparkQA · 2014-08-21T10:41:21Z

QA tests have finished for PR 2019 at commit 2d7f444.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-21T15:05:37Z

QA tests have started for PR 2019 at commit 22bae6f.

This patch merges cleanly.

SparkQA · 2014-08-21T15:06:33Z

QA tests have finished for PR 2019 at commit 22bae6f.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-23T08:50:43Z

QA tests have started for PR 2019 at commit 814692c.

This patch merges cleanly.

SparkQA · 2014-08-23T09:45:55Z

QA tests have finished for PR 2019 at commit 814692c.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2014-08-23T10:50:58Z

core/src/main/scala/org/apache/spark/network/Connection.scala

+        }
+        channel.close()
+        closed = true
+      }


This is incorrect change.
Any of those methods can throw an exception - leaving Connection.closed as false.

What is the point of the synchronized btw ? None of the other methods are protected by this lock

SendingConnection#close is called from 3 threads on the same instance.
For example, 1st thread of handle-read-write-executor calls ReceivingConnection#close -> SendingConnection#close, 2nd thread of handle-read-write-executor calles SendingConnection#close and 3rd thread of connection-manager-thread calls ConnectionManager#run -> SendingConnection#close.

I think, if it threw exception from any methods in close(), connection is not marked as closed because one of those thread is expected to close resources even if another thread fail to close.

And synchronized block is for protect being called SendingConnection#close from 3 threads.
It can be one of following situation.
(1) One thread of handle-read-write-execuor evaluates key.cancel in SendingConnection#close
(2) Then, connection-manager-thread calls removeConnection via callOnCloseCallback and evaluates "connectionsyKey -= connection.key". This should be fail because connection.key is null at this time.

After (2) above, connection-manager-thread expects connectionsByKey.size != 0 in ConnectionManager#stop but that size cannot be 0 and we get log message "All connections not cleaned up".

The way to handle this is to make closed an AtomicBoolean and do a getAndSet.
If the result of getAndSet is false, which means closed was false on invocation, only then do the actual logic of close from earlier : it is a bug that all invocations of close was trying to do the same thing.

Essentially :
a) Change
var closed = false
to
var closed = new AtomicBoolean(false)

b) Change close() to

def close() { val prev = closed.getAndSet(true) if (! prev) { closeImpl() } }

Where closeImpl is a private method containing the logic from earlier close (except for the closed variable update).

This will ensure that failures in closeImpl will still result in connection being marked as close; and repeated invocations will not cause same code to be executed and other failures to surface (like missing id from map, etc).

If we set closed to true and some of method in close() throws Exceptions, it should be inconsistent state and same instance of SendingConnection#close called by another thread cannot be recover because closed is set to true.

I think you are misunderstanding the intent of what close is supposed to do for Connection classes. It is supposed to mirror normal expectation of close on streams - barring the bug I mentioned about.

In a nutshell, it is supposed to mark connection as closed (so the repeated invocations of the method are idempotent), and cleanup if required. Take a look at how close is implemented in general in various jdk IO classes.

O.K. Connecton#close is just for mark as closed and failure during closing does not need to be recovered right?
If it is, using AtomicBoolean is reasonable.

mridulm · 2014-08-23T10:56:35Z

handling tcp/ip events is by definition async, particularly when state changes can happen orthogonal to state within java variables.
so there is only so much you can try to do to reduce exceptions you see in the logs - the important point is not to prevent issues (which is not possible if you want to write performent robust code), but to detect them and ensure it is handled properly.

GIven that, the changes here look fragile : we can revisit this PR when they are addressed, since I think there is value in some of these.
(For example, make closed an atomic boolean and do a getAndSet and do the expensive close only if previous value was false; and so on)

sarutak · 2014-08-24T13:05:56Z

One of issues I'd like to resolve in this PR is miss-detection when SedingConnection is closed by corresponding ReceivingConnection in removeConnection.
If SendingConnection close itself, invoking removeConnection then, corresponding ReceivingConnection fail to close the SendingConnection because sendingConnectionOpt.isDefined is false.

      if (!sendingConnectionOpt.isDefined) {
        logError(s"Corresponding SendingConnection to ${remoteConnectionManagerId} not found")
        return
      }

Actually, this situation is not error.

Can we remove the logic for closing SendingConnection? It's expected be closed by itself or ConnectionManager#stop right?

…calSocketAddressByKey Modified logging message when CancelledKeyException is thrown

…nt unwilling CancelledKeyException

…void unwilling CancelledKeyException and key cancellation before "connectionsByKey -= connection.key" logic in ConnectionManager#removeConnection

…ceivingConnection is closed

… multiple threads

SparkQA · 2014-08-26T07:21:18Z

QA tests have started for PR 2019 at commit 855c207.

This patch merges cleanly.

sarutak · 2014-08-26T07:47:49Z

/CC @JoshRosen

In this PR, I want to resolve following issues.

(1) Race condition between a thread invoking ConnectionManager#stop and a thread invoking threads invoking Connection#close

In this case, if a thread invoking ConnectionManager#stop evaluates "connectionsByKey -= connection.key" in ConnectionManager#removeConnection() after a thread invoking Connection#close evaluates k.cancel or channel.close in Connection#close(), warning message "All connections not cleaned up" appears because when evaluating "connectionsByKey -= connection.key", key is already null.

(2) Race condition between a thread invoking SendingConnection#close and a thread invoking SendingConnection#close after invoking ReceivingConnection#close

In this case, if a thread invoking ReceivingConnection#close evaluates "!sendingConnectionOpt.isDefined" in ConnectionManager#removeConnection after a thread invoking SendingConnection#close evaluates connectionsById -= "sendingConnectionManagerId" in ConnectionManager#removeConnection, "!sendingConnectionOpt.isDefined" is true and error message "Corresponding SendingConnection to ${remoteConnectionManagerId} not found" appears.

(3) Race condition between a thread invoking ConnectionManager#run and threads invoking Connection#close

In this case, if a thread invoking ConnectionManager#run evaluates "! key.invalid", after threads invoking Connection#close evaluates key.cancel, "! key.invalid" is true and error message related to CancelledKeyException appears.

SparkQA · 2014-08-26T08:16:57Z

QA tests have finished for PR 2019 at commit 855c207.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

…cala to avoid race condition

SparkQA · 2014-08-26T11:10:57Z

QA tests have started for PR 2019 at commit 4eee6c9.

This patch merges cleanly.

SparkQA · 2014-08-26T12:05:42Z

QA tests have finished for PR 2019 at commit 4eee6c9.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2014-08-26T17:05:10Z

core/src/main/scala/org/apache/spark/network/ConnectionManager.scala

@@ -280,42 +280,46 @@ private[spark] class ConnectionManager(
        }

        while(!keyInterestChangeRequests.isEmpty) {
+          // Expect key interested in OP_ACCEPT is not change its interest


Not sure I understand what the comment is trying to say.

If key for OP_ACCEPT enter this loop, connectionsByKey.getOrElse(key, null) will return null so this logic ignore OP_ACCEPT. I'll refine the comment.

sarutak changed the title ~~[SPARK-3106] Suppress unwilling CancelledKeyException, ClosedChannelException and error messages caused by SendingConnection~~ (WIP)[SPARK-3106] Suppress unwilling CancelledKeyException, ClosedChannelException and error messages caused by SendingConnection Aug 18, 2014

sarutak changed the title ~~(WIP)[SPARK-3106] Suppress unwilling CancelledKeyException, ClosedChannelException and error messages caused by SendingConnection~~ [SPARK-3106] Suppress unwilling CancelledKeyException, ClosedChannelException and error messages caused by SendingConnection Aug 19, 2014

sarutak changed the title ~~[SPARK-3106] Suppress unwilling CancelledKeyException, ClosedChannelException and error messages caused by SendingConnection~~ [SPARK-3106] *Race Condition Issue* Fix the order of closing resources when Connection is closed Aug 19, 2014

Improve error message when facing CancelledKeyException

f974132

sarutak force-pushed the SPARK-3106 branch from ce07ae5 to 5f91c8d Compare August 21, 2014 08:45

sarutak force-pushed the SPARK-3106 branch from 5f91c8d to 2d7f444 Compare August 21, 2014 09:00

sarutak added 2 commits August 21, 2014 23:29

Removed unreachable code

c3a4bc3

Modified to avoid throwing NPE during logging error message

4dc6050

mridulm reviewed Aug 23, 2014
View reviewed changes

sarutak added 7 commits August 25, 2014 00:39

Added mkAddressInfoStringByKey, getRemoteSocketAddressByKey and getLo…

078a908

…calSocketAddressByKey Modified logging message when CancelledKeyException is thrown

Merge branch 'SPARK-3171' of github.com:sarutak/spark into SPARK-3106

9fe5e47

Modified closing order of selector in ConnectionManager#stop to preve…

9c854cb

…nt unwilling CancelledKeyException

Modified resource closing order of resources in Connection#close to a…

02ae5a7

…void unwilling CancelledKeyException and key cancellation before "connectionsByKey -= connection.key" logic in ConnectionManager#removeConnection

Removed the logic which close SendingConnection when corresponding Re…

82a27ab

…ceivingConnection is closed

Modified Connection#close to reduce the effect of multiple closing by…

f279274

… multiple threads

Modified ConnectionManager to avoid race conditions

855c207

sarutak force-pushed the SPARK-3106 branch from 814692c to 855c207 Compare August 26, 2014 07:16

Add synchronization logic to Connection.scala and ConnectionManager.s…

4eee6c9

…cala to avoid race condition

vanzin reviewed Aug 26, 2014
View reviewed changes

sarutak closed this Nov 10, 2014

sarutak deleted the SPARK-3106 branch April 11, 2015 05:21

[SPARK-3106] Fix the race condition issue about Connection and ConnectionManager #2019

[SPARK-3106] Fix the race condition issue about Connection and ConnectionManager #2019

Conversation

sarutak commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

sarutak commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

arahuja commented Aug 19, 2014

sarutak commented Aug 19, 2014

sarutak commented Aug 19, 2014

sarutak commented Aug 21, 2014

SparkQA commented Aug 21, 2014

SparkQA commented Aug 21, 2014

SparkQA commented Aug 21, 2014

sarutak commented Aug 21, 2014

SparkQA commented Aug 21, 2014

SparkQA commented Aug 21, 2014

SparkQA commented Aug 21, 2014

SparkQA commented Aug 21, 2014

SparkQA commented Aug 21, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm commented Aug 23, 2014

sarutak commented Aug 24, 2014

SparkQA commented Aug 26, 2014

sarutak commented Aug 26, 2014

SparkQA commented Aug 26, 2014

SparkQA commented Aug 26, 2014

SparkQA commented Aug 26, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment