Skip to content
This repository has been archived by the owner on Jul 22, 2019. It is now read-only.

Semi-reliable deadlocking when combining spawn_async (or status_async) with buffered or buffer_unordered #42

Closed
christophebiocca opened this issue May 30, 2018 · 17 comments

Comments

@christophebiocca
Copy link

christophebiocca commented May 30, 2018

Basically if I try to run multiple processes in parallel using buffered or buffer_ordered, the last process to run never completes. This seems independent of io piping or which particular command is being run. It seems to happen around 2/3rds of the time.

There's a nonzero chance that this is actually a bug in futures itself but so far I haven't been able to recreate it without involving tokio-process.

MVCE: https://github.com/christophebiocca/tokio-process-deadlock-example

I haven't had the chance to test this on more than one machine yet, but I will soon.

Arch Linux
rustc 1.26.1 (827013a31 2018-05-25)
cargo 1.26.0 (0e7c5a931 2018-04-06)

Specific versions of all packages are in Cargo.lock in the example repository.

@christophebiocca christophebiocca changed the title Semi-reliable deadlocking when combining spawn_async with buffered or buffer_unordered Semi-reliable deadlocking when combining spawn_async (or status_async) with buffered or buffer_unordered May 30, 2018
@christophebiocca
Copy link
Author

christophebiocca commented May 30, 2018

Ok, so I've done some println-driven debugging to try and understand this.

The underlying issue is either in tokio-signal or PollEvented, not sure. Basically, there's a self-pipe which gets filled by a message handler, and sometimes a successful write happens but no one gets notified:

[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: GOING TO CREATE A CHILD TO RUN
[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: SUCCESSFULLY CREATED A CHILD TO RUN
[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: GOING TO CREATE A CHILD TO RUN
[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: SUCCESSFULLY CREATED A CHILD TO RUN
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING Child
[tokio-process:src/unix.rs:Child::poll_exit]: 0'th time through poll_exit loop on 13283
[tokio-signal:src/unix.rs:EventedReceiver/Evented::register]: REGISTERING
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Looks like there's no signal yet
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13283
[tokio-process:src/unix.rs:try_wait_process]: LIBC says we should wait more for 13283
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING_STATUS for Child: Ok(NotReady)
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING Child
I am spawned process #0
[tokio-process:src/unix.rs:Child::poll_exit]: 0'th time through poll_exit loop on 13284
[tokio-signal:src/unix.rs:EventedReceiver/Evented::register]: REGISTERING
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Looks like there's no signal yet
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13284
[tokio-process:src/unix.rs:try_wait_process]: LIBC says we should wait more for 13284
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING_STATUS for Child: Ok(NotReady)
[tokio-signal:src/unix.rs:handler]: PIPE SIGNALLING RESULT: Ok(1)
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING Child
[tokio-process:src/unix.rs:Child::poll_exit]: 0'th time through poll_exit loop on 13283
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Found a signal
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13283
[tokio-process:src/unix.rs:Child::poll_exit]: We got a result from 13283
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING_STATUS for Child: Ok(Ready(ExitStatus(ExitStatus(0))))
Result #0: ExitStatus(ExitStatus(0))
[tokio-signal:src/unix.rs:EventedReceiver/Evented::register]: UN-REGISTERING
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING Child
[tokio-process:src/unix.rs:Child::poll_exit]: 0'th time through poll_exit loop on 13284
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Looks like there's no signal yet
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13284
[tokio-process:src/unix.rs:try_wait_process]: LIBC says we should wait more for 13284
[tokio-process:src/unix.rs:Child::poll_exit]: We got something else than NotReady from our sigchld poll for 13284, trying again (we got Ready(Some(17)))
[tokio-process:src/unix.rs:Child::poll_exit]: 1'th time through poll_exit loop on 13284
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Looks like there's no signal yet
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13284
[tokio-process:src/unix.rs:try_wait_process]: LIBC says we should wait more for 13284
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING_STATUS for Child: Ok(NotReady)
I am spawned process #1
[tokio-signal:src/unix.rs:handler]: PIPE SIGNALLING RESULT: Ok(1)

For comparaison, a successful run looks like this:

[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: GOING TO CREATE A CHILD TO RUN
[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: SUCCESSFULLY CREATED A CHILD TO RUN
[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: GOING TO CREATE A CHILD TO RUN
[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: SUCCESSFULLY CREATED A CHILD TO RUN
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING Child
[tokio-process:src/unix.rs:Child::poll_exit]: 0'th time through poll_exit loop on 13304
I am spawned process #0
[tokio-signal:src/unix.rs:EventedReceiver/Evented::register]: REGISTERING
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Looks like there's no signal yet
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for [tokio-signal:src/unix.rs:handler]: PIPE SIGNALLING RESULT: Ok(1)
13304
[tokio-process:src/unix.rs:Child::poll_exit]: We got a result from 13304
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING_STATUS for Child: Ok(Ready(ExitStatus(ExitStatus(0))))
Result #0: ExitStatus(ExitStatus(0))
[tokio-signal:src/unix.rs:EventedReceiver/Evented::register]: UN-REGISTERING
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING Child
[tokio-process:src/unix.rs:Child::poll_exit]: 0'th time through poll_exit loop on 13305
[tokio-signal:src/unix.rs:EventedReceiver/Evented::register]: REGISTERING
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Found a signal
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13305
I am spawned process #1
[tokio-process:src/unix.rs:try_wait_process]: LIBC says we should wait more for 13305
[tokio-process:src/unix.rs:Child::poll_exit]: We got something else than NotReady from our sigchld poll for 13305, trying again (we got Ready(Some(17)))
[tokio-process:src/unix.rs:Child::poll_exit]: 1'th time through poll_exit loop on 13305
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Looks like there's no signal yet
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13305
[tokio-process:src/unix.rs:try_wait_process]: LIBC says we should wait more for 13305
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING_STATUS for Child: Ok(NotReady)
[tokio-signal:src/unix.rs:handler]: PIPE SIGNALLING RESULT: Ok(1)
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING Child
[tokio-process:src/unix.rs:Child::poll_exit]: 0'th time through poll_exit loop on 13305
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Found a signal
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13305
[tokio-process:src/unix.rs:Child::poll_exit]: We got a result from 13305
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING_STATUS for Child: Ok(Ready(ExitStatus(ExitStatus(0))))
Result #1: ExitStatus(ExitStatus(0))
[tokio-signal:src/unix.rs:EventedReceiver/Evented::register]: UN-REGISTERING

@ipetkov
Copy link
Collaborator

ipetkov commented Jun 1, 2018

Hey @christophebiocca, thanks for the report and for the detailed logs and repro!

I tried to play around with your example and I'm definitely getting consistent deadlocks too. I even tried spawning two commands directly into tokio without using buffered and I saw the issue happen as well, so I think this might be a bug with tokio-signal and not the Buffered future itself.

I also tried putting a ::std::thread::sleep_ms(500) statement before spawning the command and the issue went away, so there's probably some race condition being triggered which prevents us from successfully getting woken up once the second signal arrives.

Unfortunately, I'm not very familiar with the tokio-signal internals, so I don't have much of an idea of what could be wrong there. I'd recommend opening an issue with tokio-signal and getting some better insights there first!

@ipetkov
Copy link
Collaborator

ipetkov commented Jun 10, 2018

cc @alexcrichton, do you have any insight as to what could be going on here?

@alexcrichton
Copy link
Owner

Hm this does indeed look suspicious! I can't seem to reproduce locally though :(

@christophebiocca should cargo run just periodically deadlock in the repo above? Does it still do so for you today?

@alexcrichton
Copy link
Owner

Additionally some strace logs for both a successful and failing run may be helpful as well!

@ipetkov
Copy link
Collaborator

ipetkov commented Jun 28, 2018

@alexcrichton I was able to repro the issue pretty consistently on macOS. Also appears reproducible on a linux VM, though it took me a considerable amount of attempts before it occurred.

Strace logs of when the issue occurs: https://gist.github.com/ipetkov/93158a6d2ce8c256a4688b0f12e16bb6
Strace logs during a normal run for comparison: https://gist.github.com/ipetkov/8686ddee006e2fc4e6c24d37e1e166a0

@alexcrichton
Copy link
Owner

Hm curious! The hang looks like...

epoll_ctl(3, EPOLL_CTL_DEL, 6, {0, {u32=0, u64=0}}) = 0
...
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=17964, si_status=0, si_utime=0, si_stime=0} ---
write(7, "\1", 1)                       = 1

where I believe that means the file descriptor that the signal handler is writing to was deregistered from the event loop before we wrote data to it. As to why that's happening I'm not entirely sure, it may be related to the reactor changes in Tokio over the past few months (which I haven't been keeping up with).

IIRC tokio-signal has some restrictions about the main event loop has to be kept alive and something may be getting mixed up. An epoll set may be getting destroyed if an event loop is going away and it's not expected to go away.

@ipetkov
Copy link
Collaborator

ipetkov commented Jul 14, 2018

Thanks for pointing me in the right direction @alexcrichton! I don't think the event loop is to blame here but how tokio-signal interacts with it.

In the failed strace log we can see

epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=0, u64=0}}) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=4194305, u64=4194305}}) = -1 EEXIST (File exists)
epoll_ctl(3, EPOLL_CTL_DEL, 6, {0, {u32=0, u64=0}}) = 0

Clearly the same fd is being registered twice with the event loop and then being dropped once which causes the starvation.

Looking at a successful run we can see the same fd being added, removed, added, and removed again from the event loop

epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=0, u64=0}}) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 6, {0, {u32=0, u64=0}}) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=4194304, u64=4194304}}) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 6, {0, {u32=0, u64=0}}) = 0

Basically tokio-signal is unconditionally and prematurely deregistering the event pipe from the event loop whenever a signal goes away. Thus if the application happens to drop its signal stream before opening another one, everything works fine; if the futures happen to run in an overlapping manner, we see the dead lock appear.

I've opened an issue in the tokio-signal repo and an attempted PR for fixing the problem

@alexcrichton
Copy link
Owner

Awesome diagnosis @ipetkov, thanks for tracking this down!

@ipetkov
Copy link
Collaborator

ipetkov commented Jul 28, 2018

Here's an update around this issue: alexcrichton/tokio-signal#37 and alexcrichton/tokio-signal#41 appear to massively alleviate the starvation here, however, with enough testing persistence, I'm still able to observe deadlocks happening on a linux VM (was not able to observe on macOS).

I'm currently exploring refactoring tokio-signal to use a single global Driver instead of having each signal attempt to read from the readiness pipe (we used to do this before, but it was prone to starvation if the first event loop went away, but thanks to the new tokio APIs there may be a way around this). Initial prototypes appear to make this deadlock go away, so I'll keep working on a PR to tokio-signal

@ipetkov
Copy link
Collaborator

ipetkov commented Aug 19, 2018

So a new release of tokio-signal has been cut which includes the fixes mentioned above. I tried to reproduce the issue again, but I wasn't able to on neither macOS, neither a Linux VM (FWIW there have been some tokio updates since then in case that has alleviated anything).

@christophebiocca can you see if the issue has gone away for you?

I was exploring an architectural change in alexcrichton/tokio-signal#43 which may have improved things, but it turned out to be a dead-end based on some tokio limitations. I have one last idea to potentially clone pipe receiver handles for each signal, but it will degrade the current performance so I want to double check if this is still reproducible!

@christophebiocca
Copy link
Author

I've upgraded all dependencies, making sure to specify the latest versions for tokio-signal and tokio-process. I'm still seeing the deadlock reliably. I've updated the MVCE repo to match what I'm using locally. I've tried this on two different machines to rule out hardware issues (the other machine has the same issue, but with a much lower frequency).

Given that you're having issues recreating the issue is there anything I can do to help debug it from my end?

@christophebiocca
Copy link
Author

Here are two strace runs, one successful and the other not:

failed-trace.txt
successful-trace.txt

@ipetkov
Copy link
Collaborator

ipetkov commented Aug 23, 2018

Thanks for the update @christophebiocca! I just realized I had made some local changes to your example when testing for compatibility with the tokio runtime, but reverting things back I'm able to repro!

My hunch is that there is some lack of synchronization happening when running outside of tokio (or rather extra synchronization that tokio does before waking which alleviates the issue). I'll explore duplicating the wake pipe handle in tokio-signal and comment back here on the findings!

@ipetkov
Copy link
Collaborator

ipetkov commented Aug 28, 2018

Hi @christophebiocca I think I have a fix for this issue! Can you add the following to your Cargo.toml and check if the fix works for you?

[replace]
"tokio-signal:0.2.4" = { git = "https://github.com/ipetkov/tokio-signal", rev = "fresh-pipe" }

ipetkov added a commit to ipetkov/tokio-process that referenced this issue Aug 29, 2018
ipetkov added a commit to ipetkov/tokio-process that referenced this issue Aug 30, 2018
* Now that we have a regression test for alexcrichton#42 whose fix was landed in
tokio-signal 0.2.5, we should make sure we don't try to build with an
earlier version than that
alexcrichton added a commit that referenced this issue Aug 30, 2018
@ipetkov
Copy link
Collaborator

ipetkov commented Aug 30, 2018

Alright, so the fix has landed in tokio-signal 0.2.5, and we've landed a regression test here. @christophebiocca feel free to reopen if you're still experiencing issues!

@christophebiocca
Copy link
Author

Testing on my side indicates this is resolved. Thank you.

carllerche pushed a commit to carllerche/tokio-signal that referenced this issue Sep 10, 2018
* Originally reported in alexcrichton/tokio-process#42
* The root cause appears to be due to two different PollEvented
instances trying to consume readiness events from the same file
descriptor.
* Previously we would simply swallow any `AlreadyExists` errors when
attempting to register the pipe receiver with the event loop. I'm not
sure if this means the PollEvented wrapper wasn't fully registered to
receive events, or maybe there is a potential race condition with how
PollEvented consumes mio readiness events. Using a fresh/duplicate file
descriptor appears to mitigate the issue, however.
* I was also not able to reproduce the issue as an isolated test case so
there is no regression test available within this crate (but we can add
one in tokio-process)
Marwes pushed a commit to Marwes/sccache that referenced this issue Oct 15, 2018
There is a deadlock fixed in 0.2.5 which is hopefully the deadlock I
have seen in sccache. The last thing printed in the logs is executing a
tokio-process so it seems plausible.

alexcrichton/tokio-process#42
Marwes pushed a commit to Marwes/sccache that referenced this issue Oct 15, 2018
There is a deadlock fixed in 0.2.5 which is hopefully the deadlock I
have seen in sccache. The last thing printed in the logs is executing a
tokio-process so it seems plausible.

alexcrichton/tokio-process#42
Marwes pushed a commit to Marwes/sccache that referenced this issue Oct 26, 2018
There is a deadlock fixed in 0.2.5 which is hopefully the deadlock I
have seen in sccache. The last thing printed in the logs is executing a
tokio-process so it seems plausible.

alexcrichton/tokio-process#42
Marwes pushed a commit to Marwes/sccache that referenced this issue Nov 20, 2018
There is a deadlock fixed in 0.2.5 which is hopefully the deadlock I
have seen in sccache. The last thing printed in the logs is executing a
tokio-process so it seems plausible.

alexcrichton/tokio-process#42
Marwes pushed a commit to Marwes/sccache that referenced this issue Nov 20, 2018
There is a deadlock fixed in 0.2.5 which is hopefully the deadlock I
have seen in sccache. The last thing printed in the logs is executing a
tokio-process so it seems plausible.

alexcrichton/tokio-process#42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants