Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: File descriptor / Assertion error when using curl via nvchecker #15725

Closed
christian-heusel opened this issue Dec 11, 2024 · 11 comments

Comments

@christian-heusel
Copy link
Contributor

christian-heusel commented Dec 11, 2024

I did this

I have a few local configs for lilydjwg/nvchecker which I regularily run via systemd-timers and the following commandline invocation:

$ for nvcheckerfile in ~/Documents/shared_projects/{packages,aur-packages}/packages.toml ~/Documents/shared_projects/arch-infrastructure/nvchecker.toml; do nvchecker -c $nvcheckerfile; done
[I 12-11 17:40:21.160 core:416] rofi-emoji: updated from 3.4.0 to 4.0.0
[I 12-11 17:40:22.541 core:416] molly-guard: updated from 0.8.1 to 0.8.4 url=https://sources.debian.org/src/molly-guard/0.8.4/

Since today this gives me the following errors every few runs (although the assert is more prominently faced):

Assertion 'close_nointr(fd) != -EBADF' failed at src/basic/fd-util.c:75, function safe_close(). Aborting.

OR 

Unexpected error 9 on netlink descriptor 10.

I have also seen the error with the pacman package manager once so far.

I then looked into the issue in a bit more detail and found that bisecting the issue between 8.11.0 and 8.11.1 lead me to 92124838c ("socketpair: fix enabling USE_EVENTFD") as a culprit which in turn fixes enabling of the functionality in 23fe1a52d ("socketpair: add eventfd and use SOCK_NONBLOCK for socketpair()"). Reverting 9212483 on top of reliably fixes the issue on my machine.

I also found that the issue does not reproduce every time and seems to be more prevalent when I clear my DNS cache via resolvectl flush-caches.

The backtrace for the crash is the following:

(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007fcc816a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
#2  0x00007fcc8164c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007fcc816334c3 in __GI_abort () at abort.c:79
#4  0x00007fcc79cc3596 in log_assert_failed (text=text@entry=0x7fcc79ce8f80 "close_nointr(fd) != -EBADF", 
    file=file@entry=0x7fcc79ce6e5b "src/basic/fd-util.c", line=line@entry=75, func=func@entry=0x7fcc79ceafa8 <__func__.35> "safe_close")
    at ../systemd/src/basic/log.c:996
#5  0x00007fcc79cd1cb1 in safe_close (fd=9) at ../systemd/src/basic/fd-util.c:75
#6  0x00007fcc79ccff2d in varlink_clear (v=0x7fcc44000d80) at ../systemd/src/libsystemd/sd-varlink/sd-varlink.c:612
#7  0x00007fcc79ce2c1c in varlink_destroy (v=0x7fcc44000d80) at ../systemd/src/libsystemd/sd-varlink/sd-varlink.c:651
#8  sd_varlink_unref.isra.0 (p=0x7fcc44000d80) at ../systemd/src/libsystemd/sd-varlink/sd-varlink.c:657
#9  0x00007fcc79cc7517 in sd_varlink_unrefp (p=0x7fcc4dffc178) at ../systemd/src/systemd/sd-varlink.h:279
#10 _nss_resolve_gethostbyname4_r (name=name@entry=0x562ab267ad00 "sources.debian.org", pat=pat@entry=0x7fcc4dffc3b0, buffer=0x7fcc4dffc4c0 "", buflen=1024, 
    errnop=0x7fcc4dffd638, h_errnop=h_errnop@entry=0x7fcc4dffd688, ttlp=0x0) at ../systemd/src/nss-resolve/nss-resolve.c:219
#11 0x00007fcc81756891 in get_nss_addresses (name=<optimized out>, req=<optimized out>, tmpbuf=0x7fcc4dffc4b0, res=0x7fcc4dffc3b0) at getaddrinfo.c:652
#12 gaih_inet (name=<optimized out>, service=<optimized out>, req=<optimized out>, pai=0x7fcc4dffc380, naddrs=<synthetic pointer>, tmpbuf=0x7fcc4dffc4b0)
    at getaddrinfo.c:1185
#13 __GI_getaddrinfo (name=<optimized out>, service=<optimized out>, service@entry=0x7fcc4dffce5c "443", hints=<optimized out>, hints@entry=0x562ab267ac90, 
    pai=pai@entry=0x7fcc4dffce50) at getaddrinfo.c:2391
#14 0x00007fcc7f82fd1f in Curl_getaddrinfo_ex (nodename=<optimized out>, servname=0x7fcc4dffce5c "443", hints=0x562ab267ac90, result=0x562ab267ac88)
    at /usr/src/debug/curl/curl/lib/curl_addrinfo.c:121
#15 getaddrinfo_thread (arg=arg@entry=0x562ab267ac58) at /usr/src/debug/curl/curl/lib/asyn-thread.c:311
#16 0x00007fcc7f83119d in curl_thread_create_thunk (arg=<optimized out>) at /usr/src/debug/curl/curl/lib/curl_threads.c:59
#17 0x00007fcc816a339d in start_thread (arg=<optimized out>) at pthread_create.c:447
#18 0x00007fcc8172849c in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

If you think this is a bug in systemd rather than curl please tell me, so far it looked plausible that it's an issue with the curl codebase. Also if you have any idea about a minimal reproducer I'm happy to test things.

I expected the following

I expected these command to just continue working as usual.

curl/libcurl version

curl --version
curl 8.11.1 (x86_64-pc-linux-gnu) libcurl/8.11.1 OpenSSL/3.4.0 zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.0 nghttp2/1.64.0 nghttp3/1.6.0
Release-Date: 2024-12-11
Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp ws wss
Features: alt-svc AsynchDNS brotli GSS-API HSTS HTTP2 HTTP3 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM PSL SPNEGO SSL threadsafe TLS-SRP UnixSockets zstd

operating system

Arch Linux (rolling)

uname -a:

Linux meterpeter 6.13.0-rc2-1-mainline #1 SMP PREEMPT_DYNAMIC Sun, 08 Dec 2024 23:31:29 +0000 x86_64 GNU/Linux

systemd version: 257-1

@christian-heusel
Copy link
Contributor Author

It seems like it crashes here: https://github.com/systemd/systemd/blob/b83847eb1321ee79e1ab073b79e3001320b839a4/src/basic/fd-util.c#L59-L79

Also the previous systemd version (256.9-1) exhibits the same issue.

@vszakats
Copy link
Member

vszakats commented Dec 11, 2024

9212483 enabled a pre-existing codebase that was left disabled (due to typo) since its inception (in 8.9.0).

It'd probably be more useful to fix the bug, or if this is not possible, drop this
functionality if broken. Reverting 9212483 only hides the issue.

edit: workaround for CMake: -DHAVE_EVENTFD=0. For ./configure probably: export ac_cv_func_eventfd=no

@eworm-de
Copy link
Contributor

@heftig
Copy link
Contributor

heftig commented Dec 11, 2024

I'm guessing the issue is that Curl_eventfd assigns the same FD to the read and write end of the wakeup "socket pair", and that results in close getting called twice with the same FD number.

If nss-resolve opens its varlink socket between these two calls to close and gets that very same FD number assigned, curl closes this socket and nss-resolve explodes.

Otherwise, curl gets an EBADF from the second close, which it ignores. You can see this using strace.

@heftig
Copy link
Contributor

heftig commented Dec 11, 2024

I think the bug probably involves the wakeup_close calls in asyn-thread.c.

@vszakats
Copy link
Member

@panjf2000 Would you mind taking a look?

@panjf2000
Copy link
Contributor

Sure, I'll handle it.

@heftig
Copy link
Contributor

heftig commented Dec 12, 2024

Thinking about it, it should be possible to duplicate the file descriptor to get the same behavior with two FDs you can close separately. However, I don't think there's a variant of dup that duplicates the FD to an unassigned number and atomically sets CLOEXEC.

(And avoiding unnecessarily allocating file descriptors is probably prudent.)

@bagder
Copy link
Member

bagder commented Dec 12, 2024

I don't think we need to dupe it. Can't we just add an #ifdef for the eventfd situation and avoid closing one of them then?

panjf2000 added a commit to panjf2000/curl that referenced this issue Dec 12, 2024
When employing eventfd for socketpair, there is only one file
descriptor. Closing that fd twice might result in fd corruption.
Thus, we should avoid closing the eventfd twice, following the
pattern in lib/multi.c.

Fixes curl#15725
@panjf2000
Copy link
Contributor

panjf2000 commented Dec 12, 2024

Hi @christian-heusel, could you compile and run this patch #15727 on your machine to see if it fixes the issue? Thanks!

@christian-heusel
Copy link
Contributor Author

christian-heusel commented Dec 12, 2024

This seems to fix the crash, thanks for coming up with a fix so quickly 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

6 participants