Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests, tweak server checks, lock file handling and timers #14835

Closed
wants to merge 15 commits into from

Conversation

icing
Copy link
Contributor

@icing icing commented Sep 9, 2024

  • when server are killed by a test case, do not wait for the server lock file to go away. These tests are mostly about client timeouts and the server will hang until killed.
  • when killing a server successfully, check for a remaining lock file, log its existence and remove it.
  • lower the delay timings on SLOWDOWN by half
  • add SLOWDOWNDATA server command to only slow down the FTP data bytes, not the control ones.
  • lower some timeout values

In addition, with Windows failures in test445, make sws.c's CONNECT handling non-blocking with a 1 second timeout.

Update: in addition, now always run the verifyserver checks before a testrun. This was avoided on non-torture runs and in https cases and I saw several Windows CI failures that just could not connect to the server.

@icing icing added the tests label Sep 9, 2024
tests/data/test1086 Outdated Show resolved Hide resolved
tests/data/test1112 Outdated Show resolved Hide resolved
tests/data/test1117 Outdated Show resolved Hide resolved
tests/data/test303 Outdated Show resolved Hide resolved
tests/ftpserver.pl Outdated Show resolved Hide resolved
tests/ftpserver.pl Outdated Show resolved Hide resolved
tests/runner.pm Outdated
@@ -1066,7 +1066,10 @@ sub singletest_clean {
}
}

waitlockunlock($serverlogslocktimeout);
my @killtestservers = getpart("client", "killserver");
if(!@killtestservers) { # we are not killing it anyway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case where a test uses multiple servers but is only killing one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is indeed a single test case that does that 1561 (and there might be future ones).

I'll submit a change in ordering here:

  • execute the postcommanddelay first
  • kill all specified servers, removing their lock files when necessary
  • wait for any remaining servers to release their locks
  • record the test timing

tests/runner.pm Outdated Show resolved Hide resolved
tests/server/sws.c Show resolved Hide resolved
tests/server/sws.c Show resolved Hide resolved
@icing
Copy link
Contributor Author

icing commented Sep 10, 2024

Moved the lock cleanup into the killservers() function with some patterns to match different lock file names used by servers. There were still warnings in Windows CI jobs about timed out checks on these.

Some test cases where the curl client times out and the server is needing a kick take very long time on Windows CI runs. Below is a sample form a run that ultimately timed out after 14 minutes:

2035: 20.213s
1272: 21.48s
1561: 24.402s
1238: 24.445s
445: 24.46s
1117: 25.807s
1244: 26.845s
313: 26.889s
312: 27.955s
3207: 28.63s
1112: 29.87s
1086: 33.425s
1097: 41.569s
303: 43.356s

@bagder
Copy link
Member

bagder commented Sep 10, 2024

When I test for example 303 on my system, I see that it always waits the five seconds when killing the servers in the end which seems like a terrible waste. Since the servers can handle being terminated anyway, I suggest we shorten the wait from 5 seconds to .5 seconds =>

diff --git a/tests/processhelp.pm b/tests/processhelp.pm
index 48f385040..99402b520 100644
--- a/tests/processhelp.pm
+++ b/tests/processhelp.pm
@@ -309,14 +309,14 @@ sub killpid {
                 }
             }
         }
     }
 
-    # Allow all signalled processes five seconds to gracefully die.
+    # Allow all signalled processes half a second to gracefully die.
     if(@signalled) {
-        my $twentieths = 5 * 20;
-        while($twentieths--) {
+        my $loops = 10;
+        while($loops--) {
             for(my $i = scalar(@signalled) - 1; $i >= 0; $i--) {
                 my $pid = $signalled[$i];
                 if(!pidexists($pid)) {
                     print("RUN: Process with pid $pid gracefully died\n")
                         if($verbose);

@icing
Copy link
Contributor Author

icing commented Sep 10, 2024

When I test for example 303 on my system, I see that it always waits the five seconds when killing the servers in the end which seems like a terrible waste. Since the servers can handle being terminated anyway, I suggest we shorten the wait from 5 seconds to .5 seconds =>

No need, really. With my last commit here, we reap the zombies that cause the wait.

@bagder
Copy link
Member

bagder commented Sep 10, 2024

Ah right, that really does the trick.

Test 303 goes from 20 seconds to 11 for me with this PR.

- when server are killed by a test case, do not wait for
  the server lock file to go away. These tests are mostly
  about client timeouts and the server will hang until killed.
- when killing a server successfully, check for a remaining
  lock file, log its existence and remove it.
- lower the delay timings on SLOWDOWN by half
- add SLOWDOWNDATA server command to only slow down the
  FTP data bytes, not the control ones.
- lower some timeout values
This makes behaviour reproducable under Windows with its weird
auto-retry attempts.
- remove obsolete comments in test files
- revert timing reductions in 1117, 303 for reliability
- add comment to explain why sws.c does a non-blocking connect()
- sws.c: revert socket to blocking after successful connect
- runner.pm: reorder logic of delay/killserver/lockwait, so that
  the test first kills specified servers and then waits for lock
  file release on any remaining ones.
Check all known variations of lock files produced for a server
for cleanup. Exact name mappings are not kept, but candidates
should not overlap.
Some processes we kill are the test runners children and go into
zombie state until we await their passing. Do that in the loop
the monitors the signaled children, so the zombie can pass on.
For a range of servers there was no responsiveness test done at
the start of a test case UNLESS the tests ran in torture mode.

Remove these shortcuts and always check responsiveness if a server
is believed to be running.
not only http and trust that the stunnel is fine.
@icing icing changed the title tests, tweak lock file handling and timers tests, tweak server checks, lock file handling and timers Sep 12, 2024
@bagder bagder closed this in e70c22b Sep 13, 2024
@charles2910
Copy link
Contributor

Quick feedback from Debian here:

We were going to investigate this flakiness on the FTP tests this weekend (it was really showing up for us all the time), but we saw this MR and I gave a try and apparently it solved the problems for us (so thanks @icing!) - both gitlab CI and locally. Tomorrow we will upload the new patch version and we'll get back to you guys with the final answer about the tests on Debian's infrastructure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

4 participants