-
-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests, tweak server checks, lock file handling and timers #14835
Conversation
tests/runner.pm
Outdated
@@ -1066,7 +1066,10 @@ sub singletest_clean { | |||
} | |||
} | |||
|
|||
waitlockunlock($serverlogslocktimeout); | |||
my @killtestservers = getpart("client", "killserver"); | |||
if(!@killtestservers) { # we are not killing it anyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the case where a test uses multiple servers but is only killing one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is indeed a single test case that does that 1561 (and there might be future ones).
I'll submit a change in ordering here:
- execute the
postcommanddelay
first - kill all specified servers, removing their lock files when necessary
- wait for any remaining servers to release their locks
- record the test timing
Moved the lock cleanup into the killservers() function with some patterns to match different lock file names used by servers. There were still warnings in Windows CI jobs about timed out checks on these. Some test cases where the curl client times out and the server is needing a kick take very long time on Windows CI runs. Below is a sample form a run that ultimately timed out after 14 minutes:
|
When I test for example 303 on my system, I see that it always waits the five seconds when killing the servers in the end which seems like a terrible waste. Since the servers can handle being terminated anyway, I suggest we shorten the wait from 5 seconds to .5 seconds => diff --git a/tests/processhelp.pm b/tests/processhelp.pm
index 48f385040..99402b520 100644
--- a/tests/processhelp.pm
+++ b/tests/processhelp.pm
@@ -309,14 +309,14 @@ sub killpid {
}
}
}
}
- # Allow all signalled processes five seconds to gracefully die.
+ # Allow all signalled processes half a second to gracefully die.
if(@signalled) {
- my $twentieths = 5 * 20;
- while($twentieths--) {
+ my $loops = 10;
+ while($loops--) {
for(my $i = scalar(@signalled) - 1; $i >= 0; $i--) {
my $pid = $signalled[$i];
if(!pidexists($pid)) {
print("RUN: Process with pid $pid gracefully died\n")
if($verbose); |
No need, really. With my last commit here, we reap the zombies that cause the wait. |
Ah right, that really does the trick. Test 303 goes from 20 seconds to 11 for me with this PR. |
- when server are killed by a test case, do not wait for the server lock file to go away. These tests are mostly about client timeouts and the server will hang until killed. - when killing a server successfully, check for a remaining lock file, log its existence and remove it. - lower the delay timings on SLOWDOWN by half - add SLOWDOWNDATA server command to only slow down the FTP data bytes, not the control ones. - lower some timeout values
This makes behaviour reproducable under Windows with its weird auto-retry attempts.
- remove obsolete comments in test files - revert timing reductions in 1117, 303 for reliability - add comment to explain why sws.c does a non-blocking connect() - sws.c: revert socket to blocking after successful connect - runner.pm: reorder logic of delay/killserver/lockwait, so that the test first kills specified servers and then waits for lock file release on any remaining ones.
Check all known variations of lock files produced for a server for cleanup. Exact name mappings are not kept, but candidates should not overlap.
Some processes we kill are the test runners children and go into zombie state until we await their passing. Do that in the loop the monitors the signaled children, so the zombie can pass on.
For a range of servers there was no responsiveness test done at the start of a test case UNLESS the tests ran in torture mode. Remove these shortcuts and always check responsiveness if a server is believed to be running.
d1764bb
to
ec6618d
Compare
not only http and trust that the stunnel is fine.
Quick feedback from Debian here: We were going to investigate this flakiness on the FTP tests this weekend (it was really showing up for us all the time), but we saw this MR and I gave a try and apparently it solved the problems for us (so thanks @icing!) - both gitlab CI and locally. Tomorrow we will upload the new patch version and we'll get back to you guys with the final answer about the tests on Debian's infrastructure. |
In addition, with Windows failures in test445, make
sws.c
's CONNECT handling non-blocking with a 1 second timeout.Update: in addition, now always run the verifyserver checks before a testrun. This was avoided on non-torture runs and in https cases and I saw several Windows CI failures that just could not connect to the server.