Semi-reliable deadlocking when combining spawn_async (or status_async) with buffered or buffer_unordered #42
Comments
Ok, so I've done some The underlying issue is either in tokio-signal or PollEvented, not sure. Basically, there's a self-pipe which gets filled by a message handler, and sometimes a successful write happens but no one gets notified:
For comparaison, a successful run looks like this:
|
Hey @christophebiocca, thanks for the report and for the detailed logs and repro! I tried to play around with your example and I'm definitely getting consistent deadlocks too. I even tried spawning two commands directly into I also tried putting a Unfortunately, I'm not very familiar with the |
cc @alexcrichton, do you have any insight as to what could be going on here? |
Hm this does indeed look suspicious! I can't seem to reproduce locally though :( @christophebiocca should |
Additionally some |
@alexcrichton I was able to repro the issue pretty consistently on macOS. Also appears reproducible on a linux VM, though it took me a considerable amount of attempts before it occurred. Strace logs of when the issue occurs: https://gist.github.com/ipetkov/93158a6d2ce8c256a4688b0f12e16bb6 |
Hm curious! The hang looks like...
where I believe that means the file descriptor that the signal handler is writing to was deregistered from the event loop before we wrote data to it. As to why that's happening I'm not entirely sure, it may be related to the reactor changes in Tokio over the past few months (which I haven't been keeping up with). IIRC tokio-signal has some restrictions about the main event loop has to be kept alive and something may be getting mixed up. An epoll set may be getting destroyed if an event loop is going away and it's not expected to go away. |
Thanks for pointing me in the right direction @alexcrichton! I don't think the event loop is to blame here but how In the failed strace log we can see
Clearly the same fd is being registered twice with the event loop and then being dropped once which causes the starvation. Looking at a successful run we can see the same fd being added, removed, added, and removed again from the event loop
Basically I've opened an issue in the |
Awesome diagnosis @ipetkov, thanks for tracking this down! |
Here's an update around this issue: alexcrichton/tokio-signal#37 and alexcrichton/tokio-signal#41 appear to massively alleviate the starvation here, however, with enough testing persistence, I'm still able to observe deadlocks happening on a linux VM (was not able to observe on macOS). I'm currently exploring refactoring |
So a new release of @christophebiocca can you see if the issue has gone away for you? I was exploring an architectural change in alexcrichton/tokio-signal#43 which may have improved things, but it turned out to be a dead-end based on some |
I've upgraded all dependencies, making sure to specify the latest versions for tokio-signal and tokio-process. I'm still seeing the deadlock reliably. I've updated the MVCE repo to match what I'm using locally. I've tried this on two different machines to rule out hardware issues (the other machine has the same issue, but with a much lower frequency). Given that you're having issues recreating the issue is there anything I can do to help debug it from my end? |
Here are two |
Thanks for the update @christophebiocca! I just realized I had made some local changes to your example when testing for compatibility with the My hunch is that there is some lack of synchronization happening when running outside of |
Hi @christophebiocca I think I have a fix for this issue! Can you add the following to your [replace]
"tokio-signal:0.2.4" = { git = "https://github.com/ipetkov/tokio-signal", rev = "fresh-pipe" } |
* Now that we have a regression test for alexcrichton#42 whose fix was landed in tokio-signal 0.2.5, we should make sure we don't try to build with an earlier version than that
Alright, so the fix has landed in |
Testing on my side indicates this is resolved. Thank you. |
* Originally reported in alexcrichton/tokio-process#42 * The root cause appears to be due to two different PollEvented instances trying to consume readiness events from the same file descriptor. * Previously we would simply swallow any `AlreadyExists` errors when attempting to register the pipe receiver with the event loop. I'm not sure if this means the PollEvented wrapper wasn't fully registered to receive events, or maybe there is a potential race condition with how PollEvented consumes mio readiness events. Using a fresh/duplicate file descriptor appears to mitigate the issue, however. * I was also not able to reproduce the issue as an isolated test case so there is no regression test available within this crate (but we can add one in tokio-process)
There is a deadlock fixed in 0.2.5 which is hopefully the deadlock I have seen in sccache. The last thing printed in the logs is executing a tokio-process so it seems plausible. alexcrichton/tokio-process#42
There is a deadlock fixed in 0.2.5 which is hopefully the deadlock I have seen in sccache. The last thing printed in the logs is executing a tokio-process so it seems plausible. alexcrichton/tokio-process#42
There is a deadlock fixed in 0.2.5 which is hopefully the deadlock I have seen in sccache. The last thing printed in the logs is executing a tokio-process so it seems plausible. alexcrichton/tokio-process#42
There is a deadlock fixed in 0.2.5 which is hopefully the deadlock I have seen in sccache. The last thing printed in the logs is executing a tokio-process so it seems plausible. alexcrichton/tokio-process#42
There is a deadlock fixed in 0.2.5 which is hopefully the deadlock I have seen in sccache. The last thing printed in the logs is executing a tokio-process so it seems plausible. alexcrichton/tokio-process#42
Basically if I try to run multiple processes in parallel using
buffered
orbuffer_ordered
, the last process to run never completes. This seems independent of io piping or which particular command is being run. It seems to happen around 2/3rds of the time.There's a nonzero chance that this is actually a bug in futures itself but so far I haven't been able to recreate it without involving tokio-process.
MVCE: https://github.com/christophebiocca/tokio-process-deadlock-example
I haven't had the chance to test this on more than one machine yet, but I will soon.
Arch Linux
rustc 1.26.1 (827013a31 2018-05-25)
cargo 1.26.0 (0e7c5a931 2018-04-06)
Specific versions of all packages are in
Cargo.lock
in the example repository.The text was updated successfully, but these errors were encountered: