-
Notifications
You must be signed in to change notification settings - Fork 156
CompletableFuture might complete in unexpected thread - Copycat gets unreliable #75
Comments
Just for a quick illustration of the problem: Basically this is what happens:
Output is:
That means that the Function we pass in "thenCompose" is executed on the main thread, but copycat expects it to be executed on the same thread as the one writing "10: Complete!". |
The Copycat client internals have been completely rewritten, but I think we still do need to do some review of that code before a full release to make sure this issue is not present. I've seen some non-intuitive behavior in @bgloeckle correct me if I'm wrong, but essentially what you're seeing is if the main thread gets to the A while back, we actually had similar issues with I think I may just want to do the same thing and replace I will go through and check all the uses of this method. If this pattern isn't being used in the new client I'll close this. |
Yes, that's what I was seeing (I think that was on the "beta5" release). I did not yet check all the code again for the newer releases (I'm just in the process of switching diqube to copycat rc2), but essentially all the methods that create the "pipeline" that will be executed as soon as a CompletableFuture is completed should themselves take care of being executed on the correct thread. I think it's a good idea to have a single utility class for that in copycat. |
Hi!
Copycat uses the class
CompletableFuture
heavily and expects the handlers along those "pipelines" to be called on the same thread as thecomplete
function was called on: For example the server classes (e.g. ServerState, *State, ServerContext, ...) expect to be only called on the single thread that is created in aSingleThreadContext
in the constructor ofServerContext
. With this assumption, Copycat has much simpler code throughout these classes, because it does not need to take any care of multithreading. Jordan and I discussed about that already in a thread in the Google Group: https://groups.google.com/d/msg/copycat/p9j8I0SRw3M/xR7fwplvCwAJ.Exactly the same example that I talk about in the mail thread came up for my implementation in diqube now again (diqube internally uses copycat to have a reliable way to distribute some internal data across the cluster).
In the logs of some of diqubes tests, the following lines started to show up sometimes, and in those cases the copycat server did not start up correctly:
The first line is logged by diqube code that tries to start up the copycat server. That then starts up without having any other nodes in the cluster (single node setup in those tests). Note that in the last line, copycats "ServerState" logs that it identified a single member cluster, but then there are no logs at all any more from the server (in the whole log of that test). That last log entry is made on a different thread, though: "main" instead of the copycat server thread. This should never happen, as copycat expects for the
ServerState#join()
method (which logs that line) to be called on the copycat server thread; but in this case it does not happen.It is slightly hard to debug this, as most of the time it works just fine (and the
ServerState#join()
is not executed on the main thread but on the copycat server thread). Therefore I can just guess that theServerState#transition()
method calls#checkThread()
which in turn then throws anIllegalStateException
- I cannot see that exception in the log though, diqube continues executing the startup of the test on the "main" thread. That exception is swallowed and I guess it is swallowed inCompletableFuture
somewhere.So, I dug into why that
ServerState#join()
method is sometimes called on a wrong thread. First of all: diqubes implementation of the catalyst server completes a call to the#listen()
method very quickly and the returned future of that method is completed quickly as well. Now, I'm pretty sure that is is what happens:#open
method on CopycatServeropenFuture = context.open().thenCompose(completionFunction);
(see here)context.open()
is executed first, thereforeServerContext#open()
is executed (see here)#listen()
on the catalyst server. The future that is returned by#open()
is completed as soon as #listen replies.ServerContext#open()
- this means it installs thewhenComplete
handler on the future.ServerContext#open()
completes its future, too (after creating the new ServerState, line 85)whenComplete
of the result future ofServerContext#open()
is executed and logs "Server started successfully" (this is logged in the correct thread above!)ServerContext#open()
finishes (because everything is done)ServerContex#open()
returns the (already completed) futureCopycatServer
receives the result ofcontext.open()
and then executesthenCompose(completionFunction);
completionFunction
can only be called on the current thread (= "main") by the CompletableFuture, which it does -> the completionFunction is executed on thread "main".This shows that copycat cannot rely on the handlers that are registered on a CompletableFuture (like
thenCompose
,whenComplete
, ...) to be called in the same thread that called thecomplete
method on the CompletableFuture. This might work "most of the time", but that is most probably not good enough.I have an immediate solution for my concrete problem: I can simply delay the completion of the future that is returned on my servers #listen call to later.
But I think before any release candidates or final releases of copycat can be created, the usage of all CompletableFutures throughout the whole codebase (copycat and catalyst) have to be inspected and all such problems have to be fixed, so that copycat does not rely on any timings of when CompletableFutures may be completed. (EDIT: I understood the Google Group thread in the way that copycat usually assumes that the handlers are called on the same thread. As this is obviously not true, I think all occurrences have to be inspected. If copycat does not assume this usually, then this issue might just be a single bug, not a general implementation pattern issue)
It could be a solution to do
context.executor().execute(() -> ...)
in each result handler again, to simply ensure that everything is executed on the correct thread. Or all those methods that rely on being executed on a specific thread (that's like at least all the methods in *State, ServerContext etc.) do that.The text was updated successfully, but these errors were encountered: