Skip to content

8.8.0: Test 1553 (IMAP cleanup before a connection was created) appears flaky on some Gentoo systems #13782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Kangie opened this issue May 25, 2024 · 9 comments

Comments

@Kangie
Copy link
Contributor

Kangie commented May 25, 2024

I am unable to have test 1553 pass consistently on Gentoo in 8.8.0, however we have not had this issue with previous versions.

I raised this briefly during the 8.8.0-older-mbedtls testing, but Stefan and Daniel were unable to reproduce so I put it down to a weird failure as it'd already passed on another of my dev machines with a very similar configuration. One of the Gentoo automated build machines (tinderboxes) and a Gentoo developer have noted this issue to me, so I figured it worth raising here.

I have one workstation where this test consistently passes. I have another where it passes if I run the tests as my user or if I run them with the portage network-sandbox feature disabled (fine for testing, but not sustainable).

We can disable this test in the Gentoo ebuild however I'm interested in why this was not an issue for us in the past and appears inconsistent.

build.log

See also

I did this

srcdir=/var/tmp/portage/net-misc/curl-8.8.0/work/curl-8.8.0/tests /usr/bin/perl -I/var/tmp/portage/net-misc/cu
rl-8.8.0/work/curl-8.8.0/tests /var/tmp/portage/net-misc/curl-8.8.0/work/curl-8.8.0/tests/runtests.pl -a -s -n
 -v -a -k -am -p -j64 !241 !1083
Using curl: ../src/curl
CMD (0): ../src/curl --version 1>log/curlverout.log 2>log/curlvererr.log
********* System characteristics ******** 
* curl 8.8.0 (x86_64-pc-linux-gnu) 
* libcurl/8.8.0 OpenSSL/3.2.1 zlib/1.3.1 c-ares/1.28.1 libpsl/0.21.5 nghttp2/1.62.0
* Features: alt-svc AsynchDNS HSTS HTTP2 HTTPS-proxy IPv6 Largefile libz NTLM PSL SSL threadsafe TLS-SRP UnixS
ockets
* Disabled: 
* Host: localhost
* System: Linux localhost 6.8.8-gentoo--bouncy-mouse #1 SMP PREEMPT_DYNAMIC Wed May  8 14:03:09 AEST 2024 x86_
64 AMD Ryzen 9 5950X 16-Core Processor AuthenticAMD GNU/Linux
* OS: linux
* Jobs: 64
* Env: 
* Seed: 241509
* Servers: HTTP-IPv6 HTTP-unix FTP-IPv6 
*****************************************

. . .

./libtest/lib1553 imap://non-existing-host.haxx.se:43035/1553 > log/4/stdout1553 2> log/4/stderr1553
 1553: exit FAILED
== Contents of files in the log/4/ dir after test 1553
=== Start of file server.cmd
 Testnum 1553
=== End of file server.cmd
=== Start of file stderr1553
 URL: imap://non-existing-host.haxx.se:43035/1553
 /var/tmp/portage/net-misc/curl-8.8.0/work/curl-8.8.0/tests/libtest/lib1553.c:93 ABORTING TEST, since it seems that it would have run forever (60057 ms > 60000 ms)

I expected the following

TESTDONE: 1700 tests were considered during 39 seconds.
TESTDONE: 1365 tests out of 1365 reported OK: 100%

curl/libcurl version

curl 8.8.0

operating system

Gentoo Linux

@Kangie Kangie changed the title 8.8.0: Test 1553 (IMAP connection timeout) appears flaky on some Gentoo systems 8.8.0: Test 1553 (IMAP cleanup before a connection was created) appears flaky on some Gentoo systems May 26, 2024
@dfandrich
Copy link
Contributor

dfandrich commented May 26, 2024 via email

@Kangie
Copy link
Contributor Author

Kangie commented May 27, 2024

@thesamesam bisected, first bad commit

2d2c27e5a319665fdfbe75422dac55ff3a2218b6

Happens regardless of parallelism.

@thesamesam
Copy link
Contributor

thesamesam commented May 27, 2024

(Indeed fails still with -j1 and -j1 + running solely that test to be sure.)

Yeah, before 2d2c27e (cc @icing), it worked for me with our packaging network sandbox enabled. Afterwards, it requires it to be off.

We can skip the test if it does indeed need some level of network access for e.g. the DNS query to fail, but we wanted to report this given it's both a new failure (when the test hasn't changed) and it felt a bit off. If you can confirm this is an OK result of that commit, skipping downstream works for us. But happy to probe more too if that's useful.

icing added a commit to icing/curl that referenced this issue May 29, 2024
- sprodic failes of test1553 are reported in curl#13782. These seem only
  to happen in very specific environments, but point to a race
  somewhere in curl's async resolver handling.
- add debug logging to test1553 to get more output
- add debug logging to the c-ares resolve monitoring, e.g.
  listing sockets and timeouts returned
@icing
Copy link
Contributor

icing commented May 29, 2024

@Kangie please run #13824 with a --enable-debug in your setup. You should see more output in stderr1553. Interesting is what the c-ares resolver monitoring logs here. Lines like:

Info: ares_getsock() -> 1, fd=6, timeout=9998

If there is a race between curl collecting this and cares setting up the connections, I'd expect a 0 to be returned here.

@Kangie
Copy link
Contributor Author

Kangie commented May 29, 2024

Runner 25878 starting
CMD (32000): ./libtest/lib1553 imap://non-existing-host.haxx.se:39863/1553 > log/1/stdout1553 2> log/1/stderr1553
* kill pid for imap-filt => 25880
RUN: Process with pid 25879 signalled to die
RUN: Process with pid 25879 forced to die with SIGKILL
Runner 25878 exiting
startnew: perl -I/var/tmp/portage/net-misc/curl-8.8.0/work/curl-8.8.0/tests /var/tmp/portage/net-misc/curl-8.8.0/work/curl-8.8.0/tests/ftpserver.pl --pidfile "log/1/server/imap_server.pid" --logfile "log/1/imap_server.log" --logdir "log/1" --portfile "log/1/server/imap_server.port" --srcdir "/var/tmp/portage/net-misc/curl-8.8.0/work/curl-8.8.0/tests" --proto imap --ipv4 --port 0 --addr "127.0.0.1"
PINGPONG runs on port 39863 (log/1/server/imap_server.port)
RUN: ../src/curl --max-time 13 --silent --verbose --globoff "imap://127.0.0.1:39863/verifiedserver" 2>log/1/imap_verify.log
RUN: Verifying our test imap server took 0 seconds
RUN: IMAP server is PID 25879 port 39863
* pid imap => 25879 25879
./libtest/lib1553 imap://non-existing-host.haxx.se:39863/1553 > log/1/stdout1553 2> log/1/stderr1553
 1553: exit FAILED
== Contents of files in the log/1/ dir after test 1553
=== Start of file imap_server.log
 22:48:42.254355 IMAP server listens on port IPv4/39863
 22:48:42.254612 logged pid 25879 in log/1/server/imap_server.pid
 22:48:42.254694 Awaiting input
 22:48:43.242094 ====> Client connect
 22:48:43.242286 > "        _   _ ____  _     [CR][LF]    ___| | | |  _ \| |    [CR][LF]   / __| | | | |_) | |    [CR][LF]  | (__| |_| |  _ {| |___ [CR][LF]   \___|\___/|_| \_\_____|[CR][LF]* OK curl IMAP server ready to serve[CR][LF]"
 22:48:43.242712 < "A001 CAPABILITY"
 22:48:43.242785 > "A001 BAD Command[CR][LF]"
 22:48:43.242979 < "A002 LIST "verifiedserver" *"
 22:48:43.243042 LIST_imap got "verifiedserver" *
 22:48:43.243101 > "* LIST () "/" "WE ROOLZ: 25879"[CR][LF]"
 22:48:43.243155 > "A002 OK LIST Completed[CR][LF]"
 22:48:43.243200 return proof we are we
 22:48:43.284798 < "A003 LOGOUT"
 22:48:43.284874 > "* BYE curl IMAP server signing off[CR][LF]"
 22:48:43.284936 > "A003 OK LOGOUT completed[CR][LF]"
 22:48:43.325581 MAIN sockfilt said DISC
 22:48:43.325661 ====> Client disconnected
 22:48:43.325745 Awaiting input
=== End of file imap_server.log
=== Start of file imap_sockfilt.log
 22:48:42.363352 Running IPv4 version
 22:48:42.363478 Listening on port 39863
 22:48:42.363591 Wrote pid 25880 to log/1/server/imap_sockfilt.pid
 22:48:42.363695 Wrote port 39863 to log/1/server/imap_server.port
 22:48:42.363739 Received PING (on stdin)
 22:48:43.351568 ====> Client connect
 22:48:43.351832 Received DATA (on stdin)
 22:48:43.351870 > 178 bytes data, server => client
 22:48:43.351902 '        _   _ ____  _     \r\n    ___| | | |  _ \| |    \r\n '
 22:48:43.351935 '  / __| | | | |_) | |    \r\n  | (__| |_| |  _ {| |___ \r\n  '
 22:48:43.351966 ' \___|\___/|_| \_\_____|\r\n* OK curl IMAP server ready to se'
 22:48:43.352008 'rve\r\n'
 22:48:43.352115 < 17 bytes data, client => server
 22:48:43.352156 'A001 CAPABILITY\r\n'
 22:48:43.352336 Received DATA (on stdin)
 22:48:43.352369 > 18 bytes data, server => client
 22:48:43.352400 'A001 BAD Command\r\n'
 22:48:43.352460 < 30 bytes data, client => server
 22:48:43.352495 'A002 LIST "verifiedserver" *\r\n'
 22:48:43.352652 Received DATA (on stdin)
 22:48:43.352684 > 33 bytes data, server => client
 22:48:43.352714 '* LIST () "/" "WE ROOLZ: 25879"\r\n'
 22:48:43.352753 Received DATA (on stdin)
 22:48:43.352786 > 24 bytes data, server => client
 22:48:43.352817 'A002 OK LIST Completed\r\n'
 22:48:43.394255 < 13 bytes data, client => server
 22:48:43.394332 'A003 LOGOUT\r\n'
 22:48:43.394427 Received DATA (on stdin)
 22:48:43.394481 > 36 bytes data, server => client
 22:48:43.394532 '* BYE curl IMAP server signing off\r\n'
 22:48:43.394596 Received DATA (on stdin)
 22:48:43.394651 > 26 bytes data, server => client
 22:48:43.394701 'A003 OK LOGOUT completed\r\n'
 22:48:43.435053 ====> Client disconnect
 22:48:43.435215 Received ACKD (on stdin)
=== End of file imap_sockfilt.log
=== Start of file server.cmd
 Testnum 1553
=== End of file server.cmd
=== Start of file stderr1553
 URL: imap://non-existing-host.haxx.se:39863/1553
 22:48:43.457144 == Info: !!! WARNING !!
 22:48:43.457183 == Info: This is a debug build of libcurl, do not use in production.
 22:48:43.457237 == Info: STATE: INIT => SETUP handle 0x56433f055788; line 1971
 22:48:43.457283 == Info: STATE: SETUP => CONNECT handle 0x56433f055788; line 1987
 22:48:43.457351 == Info: Added connection 0. The cache now contains 1 members
 22:48:43.457475 == Info: STATE: CONNECT => RESOLVING handle 0x56433f055788; line 2013
 22:48:43.457530 == Info: ares_getsock() -> 1, fd=7, timeout=3523
 /var/tmp/portage/net-misc/curl-8.8.0/work/curl-8.8.0/tests/libtest/lib1553.c:100 ABORTING TEST, since it seems that it would have run forever (60057 ms > 60000 ms)
 22:49:43.514109 == Info: multi_done[RESOLVING]: status: 0 prem: 1 done: 0
 22:49:43.514178 == Info: multi_done, not reusing connection=0, forbid=0, close=1, premature=1, conn_multiplex=0
 22:49:43.514245 == Info: The cache now contains 0 members
 22:49:43.514279 == Info: Curl_disconnect(conn #0, dead=1)
 22:49:43.514314 == Info: Closing connection
 22:49:43.514343 == Info: Expire cleared
 Test ended with result 125
=== End of file stderr1553
* kill pid for imap => 25879
TESTDONE: 1 tests were considered during 61 seconds.
TESTDONE: 0 tests out of 1 reported OK: 0%

TESTFAIL: These test cases failed: 1553 

@icing
Copy link
Contributor

icing commented May 29, 2024

This looks as if c-ares was set up correctly, but did not get a reply on the non-existing hostname for a whole minute.

icing added a commit to icing/curl that referenced this issue May 29, 2024
- determine the actual poll timeout *after* all sockets
  have been collected. Protocols and connection filters may
  install new timeouts during collection.
- add debug logging to test1533 where the mistake was noticed
- refs curl#13782
@icing
Copy link
Contributor

icing commented May 29, 2024

The issue should be fixed with #13825. The bug was indeed introduced with 2d2c27e.

test1533 sets a timeout of 60 seconds and expects the multi_wait() to return before that. This is supposed to happen when the resolver finds out that the hostname does not exist. But in your test sendbox, I assume that c-ares queries just end up in null.

What happens then is that polling gets no events and returns when the timeout expires. But the timeout calculation was no longer correct with the change. c-ares wants to be invoked again after some 3500ms, but that did not take effect. So the poll waited the full 60 seconds and the test failed.

Now, the test should take some 3+ seconds and succeed. With a working resolver, it would of course be faster.

@Kangie
Copy link
Contributor Author

Kangie commented May 29, 2024

<3. I'll be at my desk soon and report back.

Thanks for digging into this, I really appreciate it.

@Kangie
Copy link
Contributor Author

Kangie commented May 30, 2024

Perfect.

TESTDONE: 1363 tests out of 1363 reported OK: 100%

gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue May 30, 2024
This release fixes an issue where under certain circumstances
(in particular when invoked within the Gentoo sandbox), `multi_wait()`
would incorrectly calculate the timeout, resulting in test failures.

Allow setting `--enable-debug` via USE=debug with an appropriate warning
to the ebuild and live ebuild.

Closes: https://bugs.gentoo.org/932660
See-also: curl/curl#13782
Signed-off-by: Matt Jolly <kangie@gentoo.org>
@bagder bagder closed this as completed in c809666 May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants