-
-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two test servers, one lock #11231
Labels
Comments
The Maybe the easiest (?) approach is to have all test servers create their own "lock file" in a dedicated per-runner lock-directory and have |
@dfandrich do you want to take a stab at converting over to a multi-file system or should I? |
Parallel testing changes the test timing so much that this issue causes many
more flaky tests. Since this is basically blocking proper use of that feature,
I'll tackle this one to get that unblocked. I'll still make a PR with parallel
testing available shortly since it's still useful in this state, even if
imperfect.
|
dfandrich
added a commit
that referenced
this issue
Jun 4, 2023
Parallel testing is enabled by using a nonzero value for the -j option to runtests.pl. Performant values seem to be about 7*num CPU cores, or 1.3*num CPU cores if Valgrind is in use. Flaky tests due to improper log locking (bug #11231) are exacerbated while parallel testing, so it is not enabled by default yet. Ref: #10818 Closes #11246
dfandrich
added a commit
that referenced
this issue
Jun 5, 2023
Parallel testing is enabled by using a nonzero value for the -j option to runtests.pl. Performant values seem to be about 7*num CPU cores, or 1.3*num CPU cores if Valgrind is in use. Flaky tests due to improper log locking (bug #11231) are exacerbated while parallel testing, so it is not enabled by default yet. Fixes #10818 Closes #11246
dfandrich
added a commit
that referenced
this issue
Jun 7, 2023
Logs are written by several servers and all of them must be finished writing before the test results can be determined. This means each server must have its own lock file rather than sharing a single one, which is how it was done up to now. Previously, the first server to complete a test would clear the lock before the other server was done, which caused flaky tests. Lock files are now all found in their own directory, so counting locks equals counting the files in that directory. The result is that the proxy logs are now reliably written which actually changes the expected output for two tests. Fixes #11231 Closes #11259
bch
pushed a commit
to bch/curl
that referenced
this issue
Jul 19, 2023
Parallel testing is enabled by using a nonzero value for the -j option to runtests.pl. Performant values seem to be about 7*num CPU cores, or 1.3*num CPU cores if Valgrind is in use. Flaky tests due to improper log locking (bug curl#11231) are exacerbated while parallel testing, so it is not enabled by default yet. Fixes curl#10818 Closes curl#11246
bch
pushed a commit
to bch/curl
that referenced
this issue
Jul 19, 2023
Logs are written by several servers and all of them must be finished writing before the test results can be determined. This means each server must have its own lock file rather than sharing a single one, which is how it was done up to now. Previously, the first server to complete a test would clear the lock before the other server was done, which caused flaky tests. Lock files are now all found in their own directory, so counting locks equals counting the files in that directory. The result is that the proxy logs are now reliably written which actually changes the expected output for two tests. Fixes curl#11231 Closes curl#11259
ptitSeb
pushed a commit
to wasix-org/curl
that referenced
this issue
Sep 25, 2023
Parallel testing is enabled by using a nonzero value for the -j option to runtests.pl. Performant values seem to be about 7*num CPU cores, or 1.3*num CPU cores if Valgrind is in use. Flaky tests due to improper log locking (bug curl#11231) are exacerbated while parallel testing, so it is not enabled by default yet. Fixes curl#10818 Closes curl#11246
ptitSeb
pushed a commit
to wasix-org/curl
that referenced
this issue
Sep 25, 2023
Logs are written by several servers and all of them must be finished writing before the test results can be determined. This means each server must have its own lock file rather than sharing a single one, which is how it was done up to now. Previously, the first server to complete a test would clear the lock before the other server was done, which caused flaky tests. Lock files are now all found in their own directory, so counting locks equals counting the files in that directory. The result is that the proxy logs are now reliably written which actually changes the expected output for two tests. Fixes curl#11231 Closes curl#11259
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The test servers use a single file
serverlogs.lock
to indicate when a test server is still writing its logs to prevent the test harness from looking at them too soon. However, many tests use more than one server running in parallel (e.g. http and http-proxy) and the first one to finish clears the lock for both of them. This means that the test harness can look at the logs before they are all ready. I believe this is what causes many of the flaky test failures we often see in the CI tests.For example, in one particular test failure I looked at in test 1904, the
proxy.input
file started with the line[DISCONNECT]
. This should never happen because there is no client connected when the test starts. Furthermore, test 1904 does NOT setconnection-monitor
without which[DISCONNECT]
should not be printed. What must have happened was that a previous test setconnection-monitor
but the other server cleared the log lock before it had a chance to write that final message and only write it after the log files were deleted and the next test (1904) had already started.Many other flaky test failures occur because a final
QUIT
command is not logged. This may also be explained if another server clears the log lock (causing the test harness to look at all the log files too soon) before that server gets a chance to write the final command.More evidence for this is the server log line:
indicating that the lock file is expected to exist but was somehow deleted without the server doing it. The most likely explanation is that the other server deleted it.
One problem with this theory in the general case is that this kind of problem also happens on tests that only have a single server (e.g. ftp test 135). Some of these use
sockfilt
in addition to the main test server, but it doesn't appear thatsockfilt
engages in the log file locking scheme.One solution to this problem is to allow multiple lock files, probably one fixed lock file per server, and have the test harness look for and wait for them all. Another is to have all servers set
fcntl(1)
locks on a single file, but this may not be portable enough and is probably why the file approach was chosen from the start.The text was updated successfully, but these errors were encountered: