Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFTP retries too quickly on mingw-w64 (test 286) #12040

Closed
dfandrich opened this issue Oct 5, 2023 · 5 comments
Closed

TFTP retries too quickly on mingw-w64 (test 286) #12040

dfandrich opened this issue Oct 5, 2023 · 5 comments
Labels

Comments

@dfandrich
Copy link
Contributor

I did this

This CI run shows test 286 (TFTP send of boundary case 512 byte file) failing with a timeout. The logs shows it sending the write request and (presumably) getting the response, then sending the first (and only) data block and waiting for the ACK, but the ACK is missing (possibly due to the server being slow). It then retries and goes through this in a loop 50 times before giving up.

This is what I expect should happen in this case except for one thing: it only waits about 40 µsec between retries. The entire retry attempt (50 retries) takes about 3 msec. Something is clearly wrong with the retry timer here.

What might be relevant here is that this is the very first data block whose ACK is missing. Could it be the right retry time hasn't been set yet and so it retries for 0 seconds?

I expected the following

curl should way 6 seconds between retries. That should be enough for even an extremely slow server to respond with an ACK, and the test should succeed.

curl/libcurl version

8.4.0-DEV

operating system

MSYS_NT-10.0-14393 APPVYR-WIN 3.0.7-338.x86_64 2019-07-11 10:58 UTC x86_64 Msys

@bagder
Copy link
Member

bagder commented Oct 6, 2023

Created #12047 for researching this

@dfandrich
Copy link
Contributor Author

I tried applying this patch to delay the first data block ACK to see if my speculation about whether that is the trigger was true, but on Linux everything worked as expected and test 286 succeeded. It would be great if someone could try this on msys2 and see if it triggers the bug or not.

diff --git a/tests/server/tftpd.c b/tests/server/tftpd.c
index 670897c0d..c41aaf3e4 100644
--- a/tests/server/tftpd.c
+++ b/tests/server/tftpd.c
@@ -1271,6 +1271,7 @@ send_ack:
       goto abort;
     }
     write_behind(test, pf->f_convert);
+    sleep(90); /* DEBUGGING #12040 */
     for(;;) {
 #ifdef HAVE_ALARM
       alarm(rexmtval);

@bagder bagder added the TFTP label Oct 8, 2023
@dfandrich
Copy link
Contributor Author

I caught this test failing (CMake, mingw-w64, gcc 9, Debug x64, Schannel, Static, Unity) with your extra logging added. Here's the start of the log (it continues in the same way for 2 msec until the retries are up):

 19:37:45.933323 == Info: STATE: CONNECTING => PROTOCONNECT handle 0xe21398; line 2103
 19:37:45.933339 == Info: set timeouts for state 0; Total  300000, retry 6 maxtry 50
 19:37:45.933383 == Info: STATE: PROTOCONNECT => DO handle 0xe21398; line 2135
 19:37:45.933388 == Info: TFTP_STATE_START
 19:37:45.933522 == Info: STATE: DO => DOING handle 0xe21398; line 2218
 19:37:46.147742 == Info: TFTP_STATE_START
 19:37:46.147749 == Info: Connected for transmit
 19:37:46.147754 == Info: set timeouts for state 2; Total  0, retry 72 maxtry 50
 19:37:46.148169 == Info: tftp_state_timeout: 912868768 > 0 + 72
 19:37:46.148187 == Info: TFTP_STATE_TX
 19:37:46.148190 == Info: Timeout waiting for block 2 ACK.  Retries = 1
 19:37:46.148227 == Info: tftp_state_timeout: 65023 > 0 + 72
 19:37:46.148243 == Info: TFTP_STATE_TX
 19:37:46.148246 == Info: Timeout waiting for block 2 ACK.  Retries = 2
 19:37:46.148290 == Info: tftp_state_timeout: 65023 > 0 + 72
 19:37:46.148306 == Info: TFTP_STATE_TX
 19:37:46.148309 == Info: Timeout waiting for block 2 ACK.  Retries = 3
 19:37:46.148341 == Info: tftp_state_timeout: 65023 > 0 + 72
 19:37:46.148357 == Info: TFTP_STATE_TX
…

The logging shows state->rx_time always set to 0, despite that variable being set two lines after this log message is output. What's also interesting is the varied value for current; 912868768 the first time then 65023 for all subsequent calls. The actual value should have been 1697053066 in all cases.

I don't see any relationship between the various time values above on the bit level (although it's interesting that 65023 has 15 of 16 binary 1 bits). Both numbers are set with time(2) and they're both using time_t as the receiving type, which is the correct one. But, it's hard to believe the time(2) call would be returning such wildly incorrect values.

@dfandrich
Copy link
Contributor Author

One more data point: this test has been failing only on the CMake, mingw-w64, gcc 9, Debug x64, Schannel, Static, Unity build, and it started failing the day after the job was created. It's failing at the rate of 30.2% of runs. It's likely to be an issue specific in that one environment, perhaps, dare I suggest it, a compiler bug. I don't think it's the only build using gcc 9, though. I can't tell what version of gcc 9 but most of the Circle CI builds are on Ubuntu 20.04 which is also running a gcc 9 (9.3.0), although it could be a problem affecting the specific point revision it's using.

Whatever the root cause, I see no evidence that it's a problem with curl. My inclination is to just ignore the results of this test on this one build and be done with it.

dfandrich added a commit that referenced this issue Oct 13, 2023
This test fails sometimes when retries occur due to what may just just
be a compiler bug. The test results are ignored on the one CI job where
it occurs because there seems to be nothing we can do to fix it.

Fixes #12040
Closes #12106
@bagder
Copy link
Member

bagder commented Oct 13, 2023

My inclination is to just ignore the results of this test on this one build and be done with it.

It gets my vote as well. 👍

@dfandrich dfandrich changed the title TFTP retries to quickly on mingw-w64 (test 286) TFTP retries too quickly on mingw-w64 (test 286) Oct 13, 2023
zuoxiaofeng pushed a commit to zuoxiaofeng/curl that referenced this issue Nov 28, 2023
This test fails sometimes with a super fast retry loop due to what may
just be a compiler bug. The test results are ignored on the one CI job
where it occurs because there seems to be nothing we can do to fix it.

Fixes curl#12040
Closes curl#12106
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

2 participants