Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application segfault in 8.10.0 #14860

Closed
carlocab opened this issue Sep 11, 2024 · 22 comments
Closed

Application segfault in 8.10.0 #14860

carlocab opened this issue Sep 11, 2024 · 22 comments
Labels

Comments

@carlocab
Copy link

carlocab commented Sep 11, 2024

I did this

We're upgrading Homebrew's version of curl to 8.10.0. When testing dependents, we found this error from julia:

Backtrace
  ==> /opt/homebrew/Cellar/julia/1.10.5/bin/julia --startup-file=no --history-file=no --project=/private/tmp/julia-test-20240911-23434-72ttcc --procs 3 --eval using Pkg; Pkg.add("Example")
    Installing known registries into `~/.julia`
         Added `General` registry to ~/.julia/registries
      Updating registry at `~/.julia/registries/General.toml`
     Resolving package versions...
     Installed Example ─ v0.5.3
      Updating `~/Project.toml`
    [7876af07] + Example v0.5.3
      Updating `~/Manifest.toml`
    [7876af07] + Example v0.5.3
  Precompiling project...
    ✓ Example
    1 dependency successfully precompiled in 1 seconds
  
  [23470] signal (11.2): Segmentation fault: 11
  in expression starting at none:0
  Curl_hash_pick at /opt/homebrew/Cellar/curl/8.10.0/lib/libcurl.4.dylib (unknown line)
  curl_multi_assign at /opt/homebrew/Cellar/curl/8.10.0/lib/libcurl.4.dylib (unknown line)
  julia_socket_callback_92891.1 at /opt/homebrew/Cellar/julia/1.10.5/lib/julia/sys.dylib (unknown line)
  jfptr_socket_callback_92892.1 at /opt/homebrew/Cellar/julia/1.10.5/lib/julia/sys.dylib (unknown line)
  ijl_apply_generic at /opt/homebrew/Cellar/julia/1.10.5/lib/julia/libjulia-internal.1.10.5.dylib (unknown line)
  jlcapi_socket_callback_92823.1 at /opt/homebrew/Cellar/julia/1.10.5/lib/julia/sys.dylib (unknown line)
  Curl_multi_pollset_ev at /opt/homebrew/Cellar/curl/8.10.0/lib/libcurl.4.dylib (unknown line)
  cpool_update_shutdown_ev at /opt/homebrew/Cellar/curl/8.10.0/lib/libcurl.4.dylib (unknown line)
  cpool_discard_conn at /opt/homebrew/Cellar/curl/8.10.0/lib/libcurl.4.dylib (unknown line)
  Curl_cpool_destroy at /opt/homebrew/Cellar/curl/8.10.0/lib/libcurl.4.dylib (unknown line)
  curl_multi_cleanup at /opt/homebrew/Cellar/curl/8.10.0/lib/libcurl.4.dylib (unknown line)
  jlplt_curl_multi_cleanup_67119.1 at /opt/homebrew/Cellar/julia/1.10.5/lib/julia/sys.dylib (unknown line)
  julia_YY.31_92837.1 at /opt/homebrew/Cellar/julia/1.10.5/lib/julia/sys.dylib (unknown line)
  jfptr_YY.31_92838.1 at /opt/homebrew/Cellar/julia/1.10.5/lib/julia/sys.dylib (unknown line)
  ijl_apply_generic at /opt/homebrew/Cellar/julia/1.10.5/lib/julia/libjulia-internal.1.10.5.dylib (unknown line)
  julia__atexit_79291.1 at /opt/homebrew/Cellar/julia/1.10.5/lib/julia/sys.dylib (unknown line)
  jfptr__atexit_79292.1 at /opt/homebrew/Cellar/julia/1.10.5/lib/julia/sys.dylib (unknown line)
  ijl_apply_generic at /opt/homebrew/Cellar/julia/1.10.5/lib/julia/libjulia-internal.1.10.5.dylib (unknown line)
  ijl_atexit_hook at /opt/homebrew/Cellar/julia/1.10.5/lib/julia/libjulia-internal.1.10.5.dylib (unknown line)
  jl_repl_entrypoint at /opt/homebrew/Cellar/julia/1.10.5/lib/julia/libjulia-internal.1.10.5.dylib (unknown line)
  Allocations: 11678659 (Pool: 11669911; Big: 8748); GC: 18

Complete logs are available here. The referenced error starts here on macOS 14 arm64.

I am unsure if the problem is in curl or in julia, but it seems like the segfault is happening inside Curl_hash_pick.

I expected the following

No segfault. The segfault doesn't occur with curl 8.9.1.

curl/libcurl version

curl 8.10.0 (aarch64-apple-darwin23.6.0) libcurl/8.10.0 OpenSSL/3.3.2 (SecureTransport) zlib/1.2.12 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libssh2/1.11.0 nghttp2/1.63.0 librtmp/2.3
Release-Date: 2024-09-11
Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns ldap ldaps mqtt pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS brotli GSS-API HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz MultiSSL NTLM SPNEGO SSL threadsafe TLS-SRP UnixSockets zstd

operating system

Darwin Frontier.local 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:46 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6031 arm64

But note that this occurs on multiple other versions of macOS (macOS 12, 13, and 14 on both Intel and ARM64). We haven't yet been able to verify that this happens on Linux, but I will check this later.

@bagder
Copy link
Member

bagder commented Sep 11, 2024

Any chance we can get the stack trace from a libcurl built with debug that keeps the symbols?

@carlocab
Copy link
Author

That is doable, but a little tricky. Please feel free to ping me at the end of the day tomorrow if I haven't reported back here. (It's a busy time for Homebrew because we're preparing for the release of macOS 15, so I've got a lot on my plate at the moment.)

@piru
Copy link

piru commented Sep 11, 2024

Looks like a NULL deref to me:

function done!(multi::Multi)
    stoptimer!(multi)
    handle = multi.handle
    handle == C_NULL && return
    multi.handle = C_NULL      <-- set to to NULL
    curl_multi_cleanup(handle)  <--- called next
    nothing
end

ref: https://github.com/JuliaLang/Downloads.jl/blob/89d3c7dded535a77551e763a437a6d31e4d9bf84/src/Curl/Multi.jl#L28

curl_multi_cleanup() will eventually call the socket_callback. In unction socket_callback( ...):

            @check curl_multi_assign(multi.handle, sock, watcher_p)

ref: https://github.com/JuliaLang/Downloads.jl/blob/89d3c7dded535a77551e763a437a6d31e4d9bf84/src/Curl/Multi.jl#L176

As I understand it multi.handle is NULL here.

@bagder
Copy link
Member

bagder commented Sep 11, 2024

An added check as in #14862 will help libcurl detect this user mistake and avoid the crash, but the actual problem appears to be in the application.

@bevanjkay
Copy link

Thanks for the quick turnaround here @bagder
I have rebuilt curl with the patch and our Julia test is now passing.
We will flag the issue there too, if it does indeed need rectifying.

@carlocab
Copy link
Author

Thanks for resolving this!

@giordano
Copy link

For what is worth, I've been trying to see where exactly Downloads.jl actually calls curl_multi_assign with a NULL handle, but I can't ever hit it with all the (non-minimal) reproducers mentioned in this page. If someone can actually provide a minimal reproducer, then we can try to fix the call to curl_multi_assign

@cgwalters
Copy link

FYI fallout in ostreedev/ostree#3299 - we're still evaluating whether there's an outstanding bug in ostree here or whether this is actually a libcurl regression. It seems more likely to be the former, but still.

@giordano
Copy link

I mentioned at JuliaLang/Downloads.jl#260 (comment) that I just managed to hit the null handle with libcurl 8.10, but the offending code path isn't hit at all in previous versions of libcurl (no change of code on the julia side). Is there any relevant change in libcurl 8.10 that libraries wrapping it should be aware of?

@piru
Copy link

piru commented Sep 18, 2024

I believe it's a reasonable assumption that the callback can get called while the callback is in place and the object the callback is associated with is still in existence. This means that any resource referenced by the callback must remain valid, too. Prematurely cleaning up resources (such as pointers) that might get used by the callback seems like a bug to me.

I think the above applies regardless of libcurl version.

@giordano
Copy link

curl_multi_cleanup() will eventually call the socket_callback

Is that expected? That didn't happen with libcurl 8.9.0. Going through this with lldb I can see the callback is called at

curl/lib/multi.c

Lines 2758 to 2760 in 5a26371

Curl_cpool_destroy(&multi->cpool);
sockhash_destroy(&multi->sockhash);

Prematurely cleaning up resources (such as pointers) that might get used by the callback seems like a bug to me.

If I delete the line

    multi.handle = C_NULL

at https://github.com/JuliaLang/Downloads.jl/blob/89d3c7dded535a77551e763a437a6d31e4d9bf84/src/Curl/Multi.jl#L28 then the assertion at

curl/lib/multi.c

Line 2719 in 5a26371

if(GOOD_MULTI_HANDLE(multi)) {
is hit.

@bagder
Copy link
Member

bagder commented Sep 18, 2024

curl_multi_cleanup() will eventually call the socket_callback

Is that expected? That didn't happen with libcurl 8.9.0

Yes, it is to be expected. The callback is called a little differently now due to internal changes, but the API is the same and the application should be prepared for a callback there as well. It should also be prepared to be called for an "internal" easy handle that the application did not add itself.

@bagder
Copy link
Member

bagder commented Sep 18, 2024

the assertion at curl/lib/multi.c is hit

That's not an assertion unless you run a debug build, that's just a run-time check to prevent the function from working on a handle that is not good.

The assertion in a debug build is to help users catch this situation and address it.

@giordano
Copy link

giordano commented Sep 18, 2024

That's not an assertion unless you run a debug build

That's what I did to be able to go through this with lldb, yes.

I'm not familiar with libcurl nor with Downloads.jl internals, I'm just trying to figure out how to resolve the issue, so please bear with me. Only thing I could come up with on the julia side was

diff --git a/src/Curl/Multi.jl b/src/Curl/Multi.jl
index d2be032..888246b 100644
--- a/src/Curl/Multi.jl
+++ b/src/Curl/Multi.jl
@@ -162,6 +162,7 @@ function socket_callback(
             return -1
         end
         multi = unsafe_pointer_to_objref(multi_p)::Multi
+        multi.handle == C_NULL && return -1
         if watcher_p != C_NULL
             old_watcher = unsafe_pointer_to_objref(watcher_p)::FDWatcher
             @check curl_multi_assign(multi.handle, sock, C_NULL)

at https://github.com/JuliaLang/Downloads.jl/blob/89d3c7dded535a77551e763a437a6d31e4d9bf84/src/Curl/Multi.jl#L164 to return immediately out of the callback if the handle is null. Does this make sense? Is the -1 return value sensible? At least with this change I don't see any segmentation fault and all tests of Downloads.jl are successful.

@bagder
Copy link
Member

bagder commented Sep 18, 2024

Is the -1 return value sensible?

That means the callback signals an error back. I don't think it matters much if you return ok or error since this happens when shutting down a multi handle.

@giordano
Copy link

That means the callback signals an error back

Thanks. That was my idea, yes.

I don't think it matters much if you return ok or error since this happens when shutting down a multi handle.

I see. But the rest of the change (don't proceed if the handle is null) is what you'd expect from an application to handle this situation?

@bagder
Copy link
Member

bagder commented Sep 18, 2024

But the rest of the change (don't proceed if the handle is null) is what you'd expect from an application to handle this situation?

Not really. The socket update is still valid. If it tells you it removes a socket, that is still done independently of which easy handle it is told for. If your application cares about sockets and what activities to wait for, then it should act on this update as well.

@bagder
Copy link
Member

bagder commented Sep 18, 2024

Unless of course it removes all knowledge of them anyway since it is closing down the multi handle, and then you can of course ignore it.

cgwalters added a commit to cgwalters/ostree that referenced this issue Sep 18, 2024
Because curl_multi_cleanup may invoke callbacks, we effectively have
some circular references going on here. See discussion in

curl/curl#14860

Basically what we do is the socket callback libcurl may invoke into a no-op when
we detect we're finalizing. The data structures are owned by this object and
not by the callbacks, and will be destroyed below. Note that
e.g. g_hash_table_unref() may itself invoke callbacks, which is where
some data is cleaned up.

Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters
Copy link

(Since sometimes it needs to be said: @bagder thanks for all your years of maintaining curl, it is appreciated!)

Unless of course it removes all knowledge of them anyway since it is closing down the multi handle, and then you can of course ignore it.

I haven't dug deep into this yet, but it seems correct to say that there was a behavior change in the semantics of callbacks invoked during curl_easy_cleanup?

I could imagine that some libcurl users depended on freeing data based on those callbacks (especially CURL_POLL_REMOVE). But in our case we didn't, the data is torn down regardless right after invoking curl_easy_cleanup), and this ordering/semantic change definitely broke us.

I did ostreedev/ostree#3307 which fixes this problem, and I know this might appear as an instance of Hyrum's Law but...I'd appreciate it if some consideration was giving to reverting #14862 until fixes have had a chance to propagate for ostree (which has quite a bit of users) - even just 3 months would be very helpful.

@cgwalters
Copy link

Hmm something I honestly don't understand at all is why libcurl is invoking the socket callback at shutdown time with action=1 where we have #define CURL_POLL_IN 1 - but sockp=0x0, which we've been treating for a long time as "we need to add this socket" - that's where we also invoke curl_multi_assign which is what was triggering the problematic behavior. The sample API docs seem to unconditionally dereference the passed sockp, hmm.

I guess I'd need to dig through other consumers of this API to see if we're doing something wrong but...at a quick glance e.g. systemd seems to do something very similar.

#0  0x00007ffff7d5bea8 in curl_multi_assign () at /lib64/libcurl.so.4
#1  0x00007ffff7f5d0d6 in addsock (s=13, easy=0x494970, action=1, fetcher=0x478460) at src/libostree/ostree-fetcher-curl.c:520
#2  0x00007ffff7f5d201 in sock_cb (easy=0x494970, s=13, what=1, cbp=0x478460, sockp=0x0) at src/libostree/ostree-fetcher-curl.c:541
#3  0x00007ffff7d5b143 in Curl_multi_pollset_ev () at /lib64/libcurl.so.4
#4  0x00007ffff7d1d49c in cpool_update_shutdown_ev.lto_priv () at /lib64/libcurl.so.4
#5  0x00007ffff7d1d642 in cpool_discard_conn () at /lib64/libcurl.so.4
#6  0x00007ffff7d1e794 in cpool_close_and_destroy_all.lto_priv () at /lib64/libcurl.so.4
#7  0x00007ffff7d5c13b in curl_multi_cleanup () at /lib64/libcurl.so.4
#8  0x00007ffff7f5bedd in _ostree_fetcher_finalize (object=0x478460) at src/libostree/ostree-fetcher-curl.c:183

@bagder
Copy link
Member

bagder commented Sep 19, 2024

I'd appreciate it if some consideration was giving to reverting https://github.com/curl/curl/pull/14862I'd appreciate it if some consideration was giving to reverting #14862

That change was merged to prevent application crashes - and it has already proven to do so. How is this prevention bad for your application? Presumably you don't run debug builds of libcurl in production?

bell-sw pushed a commit to bell-sw/alpaquita-aports that referenced this issue Sep 19, 2024
[ commit afc1a416e6889ed8198a9411ba09b0cb60e977cb ]

Upstream issue: curl/curl#14860
Upstream patch: curl/curl@48f61e7
@cgwalters
Copy link

cgwalters commented Sep 19, 2024

That change was merged to prevent application crashes - and it has already proven to do so. How is this prevention bad for your application? Presumably you don't run debug builds of libcurl in production?

This isn't about debug builds of curl - the change caused curl_multi_assign's use of GOOD_MULTI_HANDLE to return an error code during curl_multi_cleanup() instead of proceeding.

I did verify myself (as others did in ostreedev/ostree#3299) that it's that specific curl patch that causes ostree to start hitting an internal assertion error - reverting it on top of 8.10.1 fixes things.

To elaborate more, the assertion we're hitting isn't one from curl, it's an internal assertion we have around the association of our sockets.

With the updated curl, curl_multi_assign switches from performing its task to returning an error when invoked during a curl_multi_cleanup. Now previously, we were ignoring the return value from that function - more on that below¹. During teardown then we get a CURL_POLL_REMOVE but with socketp=NULL which we were not expecting, and the reason is because curl_multi_assign didn't set up the association.

I still haven't yet done a full analysis of what other projects are doing here - the fix I did I'm pretty confident in, but there may be a more "curl idiomatic" fix.

In any case though:

That change was merged to prevent application crashes - and it has already proven to do so.

Yes, now we're in an unfortunate situation where a change fixes one app and breaks another. I don't know that anyone is in a position to weigh julia versus ostree objectively 😄 so I won't try...but am I correct in understanding that the julia crash was itself provoked by a refactoring of the curl_multi_cleanup() shutdown ordering? If that's the case, I think this all argues there's quite a lot of subtlety involved in that and some "best practice" guidelines for the docs seem like a good followup? As I mentioned in the ostree fix one thing I think is happening here is conceptually one could have the data associated "owned" by the callback or owned externally. We always have external ownership, so we don't need the callbacks invoked at shutdown time. Maybe a guideline on that is to unset the callbacks before invoking curl_multi_cleanup()?

¹ Yes, I also changed that but it's a "shouldn't fail" situation so like many other ones if we had checked it before it certainly would have been an assertion, which would just change the crash scenario for us.

raspbian-autopush pushed a commit to raspbian-packages/ostree that referenced this issue Oct 1, 2024
Because curl_multi_cleanup may invoke callbacks, we effectively have
some circular references going on here. See discussion in

curl/curl#14860

Basically what we do is the socket callback libcurl may invoke into a no-op when
we detect we're finalizing. The data structures are owned by this object and
not by the callbacks, and will be destroyed below. Note that
e.g. g_hash_table_unref() may itself invoke callbacks, which is where
some data is cleaned up.

Signed-off-by: Colin Walters <walters@verbum.org>
Origin: upstream, 2024.8, commit:4d755a85225ea0a02d4580d088bb8a97138cb040
Bug: ostreedev/ostree#3299
Bug-Debian: https://bugs.debian.org/1082121

Gbp-Pq: Name curl-Make-socket-callback-during-cleanup-into-no-op.patch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

6 participants