Semi-reliable deadlocking when combining spawn_async (or status_async) with buffered or buffer_unordered #42

christophebiocca · 2018-05-30T03:54:35Z

Basically if I try to run multiple processes in parallel using buffered or buffer_ordered, the last process to run never completes. This seems independent of io piping or which particular command is being run. It seems to happen around 2/3rds of the time.

There's a nonzero chance that this is actually a bug in futures itself but so far I haven't been able to recreate it without involving tokio-process.

MVCE: https://github.com/christophebiocca/tokio-process-deadlock-example

I haven't had the chance to test this on more than one machine yet, but I will soon.

Arch Linux
rustc 1.26.1 (827013a31 2018-05-25)
cargo 1.26.0 (0e7c5a931 2018-04-06)

Specific versions of all packages are in Cargo.lock in the example repository.

The text was updated successfully, but these errors were encountered:

christophebiocca · 2018-05-30T12:20:16Z

Ok, so I've done some println-driven debugging to try and understand this.

The underlying issue is either in tokio-signal or PollEvented, not sure. Basically, there's a self-pipe which gets filled by a message handler, and sometimes a successful write happens but no one gets notified:

[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: GOING TO CREATE A CHILD TO RUN
[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: SUCCESSFULLY CREATED A CHILD TO RUN
[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: GOING TO CREATE A CHILD TO RUN
[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: SUCCESSFULLY CREATED A CHILD TO RUN
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING Child
[tokio-process:src/unix.rs:Child::poll_exit]: 0'th time through poll_exit loop on 13283
[tokio-signal:src/unix.rs:EventedReceiver/Evented::register]: REGISTERING
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Looks like there's no signal yet
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13283
[tokio-process:src/unix.rs:try_wait_process]: LIBC says we should wait more for 13283
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING_STATUS for Child: Ok(NotReady)
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING Child
I am spawned process #0
[tokio-process:src/unix.rs:Child::poll_exit]: 0'th time through poll_exit loop on 13284
[tokio-signal:src/unix.rs:EventedReceiver/Evented::register]: REGISTERING
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Looks like there's no signal yet
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13284
[tokio-process:src/unix.rs:try_wait_process]: LIBC says we should wait more for 13284
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING_STATUS for Child: Ok(NotReady)
[tokio-signal:src/unix.rs:handler]: PIPE SIGNALLING RESULT: Ok(1)
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING Child
[tokio-process:src/unix.rs:Child::poll_exit]: 0'th time through poll_exit loop on 13283
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Found a signal
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13283
[tokio-process:src/unix.rs:Child::poll_exit]: We got a result from 13283
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING_STATUS for Child: Ok(Ready(ExitStatus(ExitStatus(0))))
Result #0: ExitStatus(ExitStatus(0))
[tokio-signal:src/unix.rs:EventedReceiver/Evented::register]: UN-REGISTERING
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING Child
[tokio-process:src/unix.rs:Child::poll_exit]: 0'th time through poll_exit loop on 13284
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Looks like there's no signal yet
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13284
[tokio-process:src/unix.rs:try_wait_process]: LIBC says we should wait more for 13284
[tokio-process:src/unix.rs:Child::poll_exit]: We got something else than NotReady from our sigchld poll for 13284, trying again (we got Ready(Some(17)))
[tokio-process:src/unix.rs:Child::poll_exit]: 1'th time through poll_exit loop on 13284
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Looks like there's no signal yet
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13284
[tokio-process:src/unix.rs:try_wait_process]: LIBC says we should wait more for 13284
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING_STATUS for Child: Ok(NotReady)
I am spawned process #1
[tokio-signal:src/unix.rs:handler]: PIPE SIGNALLING RESULT: Ok(1)

For comparaison, a successful run looks like this:

[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: GOING TO CREATE A CHILD TO RUN
[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: SUCCESSFULLY CREATED A CHILD TO RUN
[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: GOING TO CREATE A CHILD TO RUN
[tokio-process:src/lib.rs:CommandExt::spawn_async_with_handle]: SUCCESSFULLY CREATED A CHILD TO RUN
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING Child
[tokio-process:src/unix.rs:Child::poll_exit]: 0'th time through poll_exit loop on 13304
I am spawned process #0
[tokio-signal:src/unix.rs:EventedReceiver/Evented::register]: REGISTERING
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Looks like there's no signal yet
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for [tokio-signal:src/unix.rs:handler]: PIPE SIGNALLING RESULT: Ok(1)
13304
[tokio-process:src/unix.rs:Child::poll_exit]: We got a result from 13304
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING_STATUS for Child: Ok(Ready(ExitStatus(ExitStatus(0))))
Result #0: ExitStatus(ExitStatus(0))
[tokio-signal:src/unix.rs:EventedReceiver/Evented::register]: UN-REGISTERING
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING Child
[tokio-process:src/unix.rs:Child::poll_exit]: 0'th time through poll_exit loop on 13305
[tokio-signal:src/unix.rs:EventedReceiver/Evented::register]: REGISTERING
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Found a signal
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13305
I am spawned process #1
[tokio-process:src/unix.rs:try_wait_process]: LIBC says we should wait more for 13305
[tokio-process:src/unix.rs:Child::poll_exit]: We got something else than NotReady from our sigchld poll for 13305, trying again (we got Ready(Some(17)))
[tokio-process:src/unix.rs:Child::poll_exit]: 1'th time through poll_exit loop on 13305
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Looks like there's no signal yet
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13305
[tokio-process:src/unix.rs:try_wait_process]: LIBC says we should wait more for 13305
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING_STATUS for Child: Ok(NotReady)
[tokio-signal:src/unix.rs:handler]: PIPE SIGNALLING RESULT: Ok(1)
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING Child
[tokio-process:src/unix.rs:Child::poll_exit]: 0'th time through poll_exit loop on 13305
[tokio-signal:src/unix.rs:Driver/Future::poll]: Going to check for signals
[tokio-signal:src/unix.rs:Driver/Future::poll]: Found a signal
[tokio-process:src/unix.rs:Child::poll_exit]: We polled sigchld for 13305
[tokio-process:src/unix.rs:Child::poll_exit]: We got a result from 13305
[tokio-process:src/lib.rs:Child/Future::poll]: POLLING_STATUS for Child: Ok(Ready(ExitStatus(ExitStatus(0))))
Result #1: ExitStatus(ExitStatus(0))
[tokio-signal:src/unix.rs:EventedReceiver/Evented::register]: UN-REGISTERING

ipetkov · 2018-06-01T03:46:20Z

Hey @christophebiocca, thanks for the report and for the detailed logs and repro!

I tried to play around with your example and I'm definitely getting consistent deadlocks too. I even tried spawning two commands directly into tokio without using buffered and I saw the issue happen as well, so I think this might be a bug with tokio-signal and not the Buffered future itself.

I also tried putting a ::std::thread::sleep_ms(500) statement before spawning the command and the issue went away, so there's probably some race condition being triggered which prevents us from successfully getting woken up once the second signal arrives.

Unfortunately, I'm not very familiar with the tokio-signal internals, so I don't have much of an idea of what could be wrong there. I'd recommend opening an issue with tokio-signal and getting some better insights there first!

ipetkov · 2018-06-10T17:52:06Z

cc @alexcrichton, do you have any insight as to what could be going on here?

alexcrichton · 2018-06-27T21:22:38Z

Hm this does indeed look suspicious! I can't seem to reproduce locally though :(

@christophebiocca should cargo run just periodically deadlock in the repo above? Does it still do so for you today?

alexcrichton · 2018-06-27T21:24:31Z

Additionally some strace logs for both a successful and failing run may be helpful as well!

ipetkov · 2018-06-28T02:56:55Z

@alexcrichton I was able to repro the issue pretty consistently on macOS. Also appears reproducible on a linux VM, though it took me a considerable amount of attempts before it occurred.

Strace logs of when the issue occurs: https://gist.github.com/ipetkov/93158a6d2ce8c256a4688b0f12e16bb6
Strace logs during a normal run for comparison: https://gist.github.com/ipetkov/8686ddee006e2fc4e6c24d37e1e166a0

alexcrichton · 2018-06-29T02:28:55Z

Hm curious! The hang looks like...

epoll_ctl(3, EPOLL_CTL_DEL, 6, {0, {u32=0, u64=0}}) = 0
...
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=17964, si_status=0, si_utime=0, si_stime=0} ---
write(7, "\1", 1)                       = 1

where I believe that means the file descriptor that the signal handler is writing to was deregistered from the event loop before we wrote data to it. As to why that's happening I'm not entirely sure, it may be related to the reactor changes in Tokio over the past few months (which I haven't been keeping up with).

IIRC tokio-signal has some restrictions about the main event loop has to be kept alive and something may be getting mixed up. An epoll set may be getting destroyed if an event loop is going away and it's not expected to go away.

ipetkov · 2018-07-14T23:48:58Z

Thanks for pointing me in the right direction @alexcrichton! I don't think the event loop is to blame here but how tokio-signal interacts with it.

In the failed strace log we can see

epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=0, u64=0}}) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=4194305, u64=4194305}}) = -1 EEXIST (File exists)
epoll_ctl(3, EPOLL_CTL_DEL, 6, {0, {u32=0, u64=0}}) = 0

Clearly the same fd is being registered twice with the event loop and then being dropped once which causes the starvation.

Looking at a successful run we can see the same fd being added, removed, added, and removed again from the event loop

epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=0, u64=0}}) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 6, {0, {u32=0, u64=0}}) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=4194304, u64=4194304}}) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 6, {0, {u32=0, u64=0}}) = 0

Basically tokio-signal is unconditionally and prematurely deregistering the event pipe from the event loop whenever a signal goes away. Thus if the application happens to drop its signal stream before opening another one, everything works fine; if the futures happen to run in an overlapping manner, we see the dead lock appear.

I've opened an issue in the tokio-signal repo and an attempted PR for fixing the problem

alexcrichton · 2018-07-15T03:55:21Z

Awesome diagnosis @ipetkov, thanks for tracking this down!

ipetkov · 2018-07-28T19:44:33Z

Here's an update around this issue: alexcrichton/tokio-signal#37 and alexcrichton/tokio-signal#41 appear to massively alleviate the starvation here, however, with enough testing persistence, I'm still able to observe deadlocks happening on a linux VM (was not able to observe on macOS).

I'm currently exploring refactoring tokio-signal to use a single global Driver instead of having each signal attempt to read from the readiness pipe (we used to do this before, but it was prone to starvation if the first event loop went away, but thanks to the new tokio APIs there may be a way around this). Initial prototypes appear to make this deadlock go away, so I'll keep working on a PR to tokio-signal

ipetkov · 2018-08-19T02:44:03Z

So a new release of tokio-signal has been cut which includes the fixes mentioned above. I tried to reproduce the issue again, but I wasn't able to on neither macOS, neither a Linux VM (FWIW there have been some tokio updates since then in case that has alleviated anything).

@christophebiocca can you see if the issue has gone away for you?

I was exploring an architectural change in alexcrichton/tokio-signal#43 which may have improved things, but it turned out to be a dead-end based on some tokio limitations. I have one last idea to potentially clone pipe receiver handles for each signal, but it will degrade the current performance so I want to double check if this is still reproducible!

christophebiocca · 2018-08-19T20:17:02Z

I've upgraded all dependencies, making sure to specify the latest versions for tokio-signal and tokio-process. I'm still seeing the deadlock reliably. I've updated the MVCE repo to match what I'm using locally. I've tried this on two different machines to rule out hardware issues (the other machine has the same issue, but with a much lower frequency).

Given that you're having issues recreating the issue is there anything I can do to help debug it from my end?

christophebiocca · 2018-08-19T20:27:49Z

Here are two strace runs, one successful and the other not:

failed-trace.txt
successful-trace.txt

ipetkov · 2018-08-23T03:53:34Z

Thanks for the update @christophebiocca! I just realized I had made some local changes to your example when testing for compatibility with the tokio runtime, but reverting things back I'm able to repro!

My hunch is that there is some lack of synchronization happening when running outside of tokio (or rather extra synchronization that tokio does before waking which alleviates the issue). I'll explore duplicating the wake pipe handle in tokio-signal and comment back here on the findings!

ipetkov · 2018-08-28T23:33:42Z

Hi @christophebiocca I think I have a fix for this issue! Can you add the following to your Cargo.toml and check if the fix works for you?

[replace]
"tokio-signal:0.2.4" = { git = "https://github.com/ipetkov/tokio-signal", rev = "fresh-pipe" }

* Now that we have a regression test for alexcrichton#42 whose fix was landed in tokio-signal 0.2.5, we should make sure we don't try to build with an earlier version than that

Add regression test for #42

ipetkov · 2018-08-30T05:20:51Z

Alright, so the fix has landed in tokio-signal 0.2.5, and we've landed a regression test here. @christophebiocca feel free to reopen if you're still experiencing issues!

christophebiocca · 2018-08-31T14:13:22Z

Testing on my side indicates this is resolved. Thank you.

* Originally reported in alexcrichton/tokio-process#42 * The root cause appears to be due to two different PollEvented instances trying to consume readiness events from the same file descriptor. * Previously we would simply swallow any `AlreadyExists` errors when attempting to register the pipe receiver with the event loop. I'm not sure if this means the PollEvented wrapper wasn't fully registered to receive events, or maybe there is a potential race condition with how PollEvented consumes mio readiness events. Using a fresh/duplicate file descriptor appears to mitigate the issue, however. * I was also not able to reproduce the issue as an isolated test case so there is no regression test available within this crate (but we can add one in tokio-process)

There is a deadlock fixed in 0.2.5 which is hopefully the deadlock I have seen in sccache. The last thing printed in the logs is executing a tokio-process so it seems plausible. alexcrichton/tokio-process#42

christophebiocca changed the title ~~Semi-reliable deadlocking when combining spawn_async with buffered or buffer_unordered~~ Semi-reliable deadlocking when combining spawn_async (or status_async) with buffered or buffer_unordered May 30, 2018

christophebiocca mentioned this issue Jun 6, 2018

Signal is sometimes not delivered on Unix platforms, causing deadlocks in tokio-process alexcrichton/tokio-signal#32

Closed

ipetkov mentioned this issue Jul 14, 2018

Dropping an instance of Signal starves all other open instances using the same event loop alexcrichton/tokio-signal#36

Closed

ipetkov mentioned this issue Jul 31, 2018

Attempt to use a single global Driver instance -- DO NOT MERGE alexcrichton/tokio-signal#43

Closed

2 tasks

illicitonion mentioned this issue Aug 1, 2018

IsolatedProcessTest::test_integration_concat_with_snapshots_stdout hangs ~1/5 the time pantsbuild/pants#6242

Closed

ipetkov added a commit to ipetkov/tokio-process that referenced this issue Aug 28, 2018

Add regression test for alexcrichton#42

202708d

This was referenced Aug 28, 2018

Fix a possible starvation with concurrent Signal polls alexcrichton/tokio-signal#51

Merged

Add regression test for #42 #45

Merged

ipetkov added a commit to ipetkov/tokio-process that referenced this issue Aug 29, 2018

Add regression test for alexcrichton#42

67b1f61

alexcrichton closed this as completed in #45 Aug 30, 2018

alexcrichton added a commit that referenced this issue Aug 30, 2018

Merge pull request #45 from ipetkov/issue42

1638c32

Add regression test for #42

Marwes mentioned this issue Oct 15, 2018

Max concurrent jobs with sccache mozilla/sccache#248

Closed

Marwes mentioned this issue Nov 21, 2018

Intermittent hang in Firefox CI mozilla/sccache#335

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semi-reliable deadlocking when combining spawn_async (or status_async) with buffered or buffer_unordered #42

Semi-reliable deadlocking when combining spawn_async (or status_async) with buffered or buffer_unordered #42

christophebiocca commented May 30, 2018 •

edited

christophebiocca commented May 30, 2018 •

edited

ipetkov commented Jun 1, 2018

ipetkov commented Jun 10, 2018

alexcrichton commented Jun 27, 2018

alexcrichton commented Jun 27, 2018

ipetkov commented Jun 28, 2018

alexcrichton commented Jun 29, 2018

ipetkov commented Jul 14, 2018

alexcrichton commented Jul 15, 2018

ipetkov commented Jul 28, 2018

ipetkov commented Aug 19, 2018

christophebiocca commented Aug 19, 2018

christophebiocca commented Aug 19, 2018

ipetkov commented Aug 23, 2018

ipetkov commented Aug 28, 2018

ipetkov commented Aug 30, 2018

christophebiocca commented Aug 31, 2018

Semi-reliable deadlocking when combining spawn_async (or status_async) with buffered or buffer_unordered #42

Semi-reliable deadlocking when combining spawn_async (or status_async) with buffered or buffer_unordered #42

Comments

christophebiocca commented May 30, 2018 • edited

christophebiocca commented May 30, 2018 • edited

ipetkov commented Jun 1, 2018

ipetkov commented Jun 10, 2018

alexcrichton commented Jun 27, 2018

alexcrichton commented Jun 27, 2018

ipetkov commented Jun 28, 2018

alexcrichton commented Jun 29, 2018

ipetkov commented Jul 14, 2018

alexcrichton commented Jul 15, 2018

ipetkov commented Jul 28, 2018

ipetkov commented Aug 19, 2018

christophebiocca commented Aug 19, 2018

christophebiocca commented Aug 19, 2018

ipetkov commented Aug 23, 2018

ipetkov commented Aug 28, 2018

ipetkov commented Aug 30, 2018

christophebiocca commented Aug 31, 2018

christophebiocca commented May 30, 2018 •

edited

christophebiocca commented May 30, 2018 •

edited