Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pthread_cond_timedwait error with EINVAL on macos #4

Open
skywhale opened this issue Dec 10, 2020 · 10 comments
Open

pthread_cond_timedwait error with EINVAL on macos #4

skywhale opened this issue Dec 10, 2020 · 10 comments

Comments

@skywhale
Copy link

This happens when I run tests::bench_ipmpsc.

$ cd ipc-benchmarks
$ RUST_BACKTRACE=1 cargo bench tests::bench_ipmpsc -- --nocapture
...
running 1 test
thread 'main' panicked at 'error receiving: Runtime("timeout_ok(libc::pthread_cond_timedwait(self.0.condition.get(),\n                                        self.0.mutex.get(), &then)) failed: 22")', src/lib.rs:107:27
stack backtrace:
   0: rust_begin_unwind
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panicking.rs:493:5
   1: std::panicking::begin_panic_fmt
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panicking.rs:435:5
   2: test::bench::ns_iter_inner
   3: test::bench::Bencher::iter
   4: ipc_benchmarks::tests::bench_ipmpsc
   5: core::ops::function::FnOnce::call_once
   6: core::ops::function::FnMut::call_mut
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/core/src/ops/function.rs:150:5
   7: test::bench::Bencher::bench
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/bench.rs:45:9
   8: test::bench::benchmark::{{closure}}
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/bench.rs:192:51
   9: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panic.rs:322:9
  10: std::panicking::try::do_call
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panicking.rs:379:40
  11: std::panicking::try
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panicking.rs:343:19
  12: std::panic::catch_unwind
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/std/src/panic.rs:396:14
  13: test::bench::benchmark
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/bench.rs:192:18
  14: test::run_test
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/lib.rs:489:13
  15: test::run_tests
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/lib.rs:339:13
  16: test::console::run_tests_console
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/console.rs:289:5
  17: test::test_main
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/lib.rs:120:15
  18: test::test_main_static
             at /rustc/1700ca07c6dd7becff85678409a5df6ad4cf4f47/library/test/src/lib.rs:139:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test tests::bench_ipmpsc            ... FAILED

The error originates at

ipmpsc/src/lib.rs

Lines 160 to 164 in 2b4357e

nonzero!(timeout_ok(libc::pthread_cond_timedwait(
self.0.condition.get(),
self.0.mutex.get(),
&then
)))

A quick search suggests that EINVAL is returned when the tv_nsec is invalid, but it doesn't seem to be the case. I added a println right before the call, and I got:

tv_sec=1922927639
tv_nsec=612420000

Note the error happens non-deterministically. Some iterations succeed. The error does not happen on a Ubuntu machine with the same version of rustc.

Environment:
OS: macOS Catalina 10.15.7 (19H15)
rustc: nightly-2020-12-09-x86_64-apple-darwin

@skywhale
Copy link
Author

skywhale commented Dec 10, 2020

@dicej I tried to investigate further, but I don't have a clue. I changed the code to give a constant abstime and it still happens. I'm more than happy to help investigate the issue if you could help shed a bit more light.

@dicej
Copy link
Owner

dicej commented Dec 15, 2020

@skywhale Thanks so much for reporting this, and sorry for the delayed response. I'll take a look at this when I get a chance, probably this weekend.

@dicej
Copy link
Owner

dicej commented Dec 19, 2020

I've refactored the code a bit so it's using pthread_cond_wait instead of pthread_cond_timedwait, so that at least eliminates the timespec as a factor. However, it's still failing occasionally:

thread 'main' panicked at 'error receiving: Runtime("libc::pthread_cond_wait(self.0.header().condition.get(),\n                        self.0.header().mutex.get()) failed: 22")', src/lib.rs:106:27

Seems like it doesn't like either the condition or the mutex (or both). And it's a race condition, because sometimes it runs to completion without any error. Will continue studying it.

@dicej
Copy link
Owner

dicej commented Dec 19, 2020

I've also noticed that tests::aribitrary_case in the main test suite frequently deadlocks forever (with all threads waiting forever and never being notified). I'm guessing that's related even if the symptom is different.

Here's my current theory: unlike on Linux, Android, and Windows, using the same interprocess mutex and/or condition variable from different memory mappings within the same process is not safe on MacOS (and maybe not on the BSDs in general). That's what the tests and benchmarks do: spawn separate threads for each receiver and sender as if they were other processes. I'm guessing there's something in the MacOS/BSD pthread implementation that assumes PTHREAD_PROCESS_SHARED mutexes and condition variables are not aliased in different memory locations within the same process, and although it usually works anyway, it doesn't always.

When I change the code to make the senders and receiver all use the same memory mapping, everything seems to work reliably. That defeats the purpose of the tests, though, since there's no way that can happen when using separate processes.

Next, I'm going to change tests to actually spawn separate processes instead of separate threads and see what happens.

@dicej
Copy link
Owner

dicej commented Dec 23, 2020

Update: I went ahead and modified the tests and benchmarks to fork processes instead of spawning threads. Unfortunately, that didn't address the issue -- pthread_cond_wait still returns EINVAL unpredictably.

I don't see any way forward from here, unfortunately. I'll update the README to indicate that MacOS is not currently supported until someone figures out how to fix this.

@dicej
Copy link
Owner

dicej commented Dec 23, 2020

This comment indicates that PTHREAD_PROCESS_SHARED has been broken in MacOS since Lion: bitcoin/bitcoin#19411 (comment).

@skywhale
Copy link
Author

skywhale commented Dec 28, 2020

Thank you so much for looking into this @dicej. It's unfortunate that macOS Posix compliance is partial.

I did a bit more research, and it seems pthread shared-process read/write lock has never been supported on macOS. Their pthread implementation seems to be based on BSD 7.4, which lacks the support of PTHREAD_PROCESS_SHARED attribute as stated in the BUGS section (due to the use of pointers in their structs). It was resolved in BSD 11.0, but it hasn't been picked up by macOS sdk yet.

@remifontan
Copy link

remifontan commented Jun 30, 2022

Hi,
out of curiosity I tried the repro-steps on my mac and it is still failing with the same error - MacOs X 12.4.

I went down the internet rabbit hole and searched about similar pthread errors and see if there are alternatives.
ipc-channel seems to rely on mach ports to pass data from one process to another one.
Would it be possible to use mach port to implement an IP-Mutex? I am not familiar with this, so this may be a silly qestion.

Interestingly, boost has an interprocess_mutex class, which seems to work with MacOs.

However, digging a bit, it seems that on Mac, the implementation falls back to using a spin mutex.

so... perhaps the same thing should be done in ipmpsc?

@dicej
Copy link
Owner

dicej commented Jun 30, 2022

Yes, I was wondering about some Mach-specific way to do locking, or else using named pipes just for locking (but still using the ring buffer for actually moving data around).

The spin lock approach could also work. We'd either need to port the Boost code to Rust or create a Rust wrapper for a C++ library that exports the Boost functionality using a C API Rust can talk to. The latter would be easiest, I imagine. Ideally this would be its own crate so others could reuse it.

Note that ipmpsc needs both a mutex and a condition variable implementation; Boost appears to support both of those in its interprocess spin lock suite.

I don't expect I'll have time to work on this myself any time soon, but I'd be happy to review a PR.

@remifontan
Copy link

I'll see what I can do... no promises :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants