Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][R] mutex lock failed: Invalid argument on MacOS 10.13 on process shutdown #15189

Closed
Tracked by #14829
paleolimbot opened this issue Jan 4, 2023 · 8 comments · Fixed by #33613
Closed
Tracked by #14829

[CI][R] mutex lock failed: Invalid argument on MacOS 10.13 on process shutdown #15189

paleolimbot opened this issue Jan 4, 2023 · 8 comments · Fixed by #33613

Comments

@paleolimbot
Copy link
Member

Describe the bug, including details regarding any error messages, version, and platform.

In the MacOS 10.13 runners, we see:

libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument

On the latest nightly build ( https://github.com/ursacomputing/crossbow/actions/runs/3827980013/jobs/6513063333#step:7:22433 ).

Jacob did some research and it seems this might be something along the lines of a mutex lock attempt after the mutex has been destructed ( https://stackoverflow.com/questions/66773247/libcabi-dylib-terminating-with-uncaught-exception-of-type-std-1system-er ).

Unfortunately I don't have an Apple Clang-12 environment locally and so I'm not able to reproduce this at the moment. Perhaps even more unfortunately, MacOS 10.13 is the environment CRAN uses to build the R package binary that most Intel mac users will use.

Component(s)

C++, R

@westonpace
Copy link
Member

Do we have any idea when this started happening?

@paleolimbot
Copy link
Member Author

Unfortunately that CI job was segfaulting for several months and the fix was only recently discovered (#14582) 😞

@paleolimbot
Copy link
Member Author

Ok...I'm able to build and check the R package on 10.13 locally using my wife's old laptop (which I highly doubt saw this coming as how it would spend its retirement). Just to get it started I replicated the build environment I use locally, which is basically a build + install + set ARROW_HOME. I had to install OpenSSL from source to get that to work, and I couldn't get snappy to build (error: invalid output constraint '=@ccz' in asm) or GCS to link into the R package. Without those two, I wasn't able to replicate a segfault in R CMD check, although the check did pass without warnings or errors.

I imagine the next step is to try the autobrew script...my first attempt at that errored in a way I didn't understand but it deserves another shot.

@paleolimbot
Copy link
Member Author

As per Jacob's guidance, I should be running arrow/ci/scripts/r_test.sh whilst copying the brew formulas into place ( https://github.com/ursacomputing/crossbow/actions/runs/3827980013/jobs/6513063333#step:3:9 )

@paleolimbot
Copy link
Member Author

I have a backtrace!

bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
  * frame #0: 0x00007fff65a701f6 libc++abi.dylib`__cxa_throw
    frame #1: 0x00007fff65a437af libc++.1.dylib`std::__1::__throw_system_error(int, char const*) + 77
    frame #2: 0x00007fff65a35c93 libc++.1.dylib`std::__1::mutex::lock() + 29
    frame #3: 0x00007fff65f07c09 libcrypto.35.dylib`CRYPTO_lock + 169
    frame #4: 0x00007fff65fb76f0 libcrypto.35.dylib`int_thread_get + 48
    frame #5: 0x00007fff65fb79fb libcrypto.35.dylib`int_thread_del_item + 43
    frame #6: 0x00007fff65fb6a98 libcrypto.35.dylib`ERR_remove_thread_state + 104
    frame #7: 0x00007fff662c824f libcurl.4.dylib`Curl_close + 186
    frame #8: 0x00007fff662db7c2 libcurl.4.dylib`curl_easy_cleanup + 42
    frame #9: 0x000000010abece80 arrow.so`Aws::Http::CurlHandleContainer::~CurlHandleContainer() + 766
    frame #10: 0x000000010abe9263 arrow.so`Aws::Http::CurlHttpClient::~CurlHttpClient() + 287
    frame #11: 0x0000000109a18255 arrow.so`std::__1::shared_ptr<std::__1::unordered_set<int, std::__1::hash<int>, std::__1::equal_to<int>, std::__1::allocator<int>>>::~shared_ptr() + 49
    frame #12: 0x000000010aba294b arrow.so`Aws::Client::AWSClient::~AWSClient() + 131
    frame #13: 0x000000010a6eb01f arrow.so`std::__1::shared_ptr<arrow::fs::(anonymous namespace)::S3Client>::~shared_ptr() + 49
    frame #14: 0x000000010a704c45 arrow.so`std::__1::__shared_ptr_pointer<arrow::fs::(anonymous namespace)::RegionResolver*, std::__1::default_delete<arrow::fs::(anonymous namespace)::RegionResolver>, std::__1::allocator<arrow::fs::(anonymous namespace)::RegionResolver>>::__on_zero_shared() + 51
    frame #15: 0x000000010a6de1bd arrow.so`std::__1::shared_ptr<arrow::fs::(anonymous namespace)::RegionResolver>::~shared_ptr() + 49
    frame #16: 0x00007fff67b5deed libsystem_c.dylib`__cxa_finalize_ranges + 351
    frame #17: 0x00007fff67b5e1fe libsystem_c.dylib`exit + 55
    frame #18: 0x00000001020f5958 libR.dylib`Rstd_CleanUp(saveact=<unavailable>, status=0, runLast=1) at sys-std.c:1266:5 [opt]
    frame #19: 0x00000001020f8401 libR.dylib`R_CleanUp(saveact=<unavailable>, status=0, runLast=<unavailable>) at system.c:87:5 [opt]
    frame #20: 0x0000000101ff8d0a libR.dylib`do_quit(call=<unavailable>, op=<unavailable>, args=<unavailable>, rho=<unavailable>) at main.c:1417:5 [opt]
    frame #21: 0x0000000101fa7035 libR.dylib`bcEval(body=0x00007f8393fac400, rho=0x00007f8393fac6d8, useCache=<unavailable>) at eval.c:7136:14 [opt]
    frame #22: 0x0000000101f9fa01 libR.dylib`Rf_eval(e=<unavailable>, rho=<unavailable>) at eval.c:748:8 [opt]
    frame #23: 0x0000000101fbf839 libR.dylib`R_execClosure(call=0x00007f8393fac048, newrho=0x00007f8393fac6d8, sysparent=<unavailable>, rho=0x00007f83a402acc8, arglist=<unavailable>, op=<unavailable>) at eval.c:0 [opt]
    frame #24: 0x0000000101fbe627 libR.dylib`Rf_applyClosure(call=0x00007f8393fac048, op=<unavailable>, arglist=<unavailable>, rho=<unavailable>, suppliedvars=<unavailable>) at eval.c:1844:16 [opt]
    frame #25: 0x0000000101f9febb libR.dylib`Rf_eval(e=<unavailable>, rho=0x00007f83a402acc8) at eval.c:871:12 [opt]
    frame #26: 0x0000000101ff6ca7 libR.dylib`Rf_ReplIteration(rho=0x00007f83a402acc8, savestack=<unavailable>, browselevel=0, state=0x00007ffeedd29670) at main.c:262:2 [opt]
    frame #27: 0x0000000101ff83b1 libR.dylib`R_ReplConsole(rho=0x00007f83a402acc8, savestack=0, browselevel=0) at main.c:314:11 [opt]
    frame #28: 0x0000000101ff8302 libR.dylib`run_Rmainloop at main.c:1137:5 [opt]
    frame #29: 0x0000000101ff843e libR.dylib`Rf_mainloop at main.c:1144:5 [opt]
    frame #30: 0x0000000101ed8f5b R`main + 27
    frame #31: 0x00007fff67ab1015 libdyld.dylib`start + 1


@westonpace
Copy link
Member

Looks like you might need to try updating libreSSL:

https://marc.info/?l=libressl&m=152385494826535&w=2

Can you confirm it is running 2.2.7?

@paleolimbot
Copy link
Member Author

From everything I can tell, the default -lcurl results in LibreSSL 2.0.20 (although from what I read this should be 2.2.7 on MacOS 10.13). I'm pretty sure CRAN has solved this issue and has at least OpenSSL 1.1.1 but I will double check (and we at least need to figure out how to make the runner replicate that).

@kou kou changed the title mutex lock failed: Invalid argument on MacOS 10.13 on process shutdown [CI][R] mutex lock failed: Invalid argument on MacOS 10.13 on process shutdown Jan 7, 2023
@paleolimbot
Copy link
Member Author

I checked with Jeroen (see linked issue above) and:

  • The way we build libarrow on MacOS 10.13 does link to whatever runtime is available, which in the case of MacOS 10.13 really is LibreSSL 2.0.20.
  • On newer MacOS, the runtime is newer, so even though CRAN builds on 10.13, this problem shouldn't occur when installed on a newer MacOS (i.e., most humans).

Jeroen suggested skipping S3 tests on 10.13. The other option of attempting to link to a more modern curl/ssl stack is more likely to introduce errors, particularly this close to a release, and the number of users it would help would be minimal. There is a workaround for 10.13 users that run into this although it involves installing CMake, OpenSSL, and Arrow separately.

assignUser pushed a commit that referenced this issue Jan 12, 2023
# Which issue does this PR close?

Closes #15189

# Rationale for this change

The curl/ssl runtime on 10.13 results in a segfault when the process exits (even though all tests pass), so we get a spurious failure on our 10.13 runner test.

# What changes are included in this PR?

Updates the `skip_if_not_available()` function to special case the "s3" feature. The "re2" feature is handled similarly to prevent spurious valgrind errors from being reported.

# Are these changes tested?

These changes can only be tested via crossbow + the 10.13 runner (and locally on my own 10.13 machine after the PR is live and can be pulled there).

# Are there any user-facing changes?

Nope!
* Closes: #15189

Lead-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Co-authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>
@assignUser assignUser added this to the 11.0.0 milestone Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants