-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BuildGrid tests failing #1582
Comments
Making this block 2.0 as we have to have functional remote execution at least with BuildGrid. |
Looks like this is as easy as diff --git a/.github/run-ci.sh b/.github/run-ci.sh
index a860be708..dde7102da 100755
--- a/.github/run-ci.sh
+++ b/.github/run-ci.sh
@@ -97,7 +97,7 @@ function runServiceTest() {
# Lazily ensure that the script exits when a command fails
#
-set -x
+set -e
if [ -z "${test_names}" ]; then
runTest "lint" (i.e. someone mixed up the arguments) |
Regarding the actual failure, what is happening is that buildgrid returns an error and buildstream crashes loudly (AssertionError) when that happens. Yeah, it shouldn't happen in the testsuite but bst shouldn't crash that loudly when it happens. In testing with a real build the error returned has a status code 13 (INTERNAL according to https://github.com/grpc/grpc/blob/master/doc/statuscodes.md), and looking at the bgd logs I see "blob not found" errors. The "default" buildgrid configuration we use has a 2G in-memory CAS, so I wouldn't be surprised it isn't enough (especially if it is pruned agressively). I don't think that would happen with the testsuite, but I guess it may happen. |
Thanks, I'm going to apply this fix to both master and |
Note that
The exception appears to be raised by the following line in response = local_cas.CaptureTree(request) Which does not expect to receive an exception (there is no Testing the individual test can be done like this:
docker-compose --env-file .github/common.env \
--file .github/compose/ci.buildgrid.yml \
up \
--renew-anon-volumes --remove-orphans
ARTIFACT_CACHE_SERVICE=http://localhost:50052 \
SOURCE_CACHE_SERVICE=http://localhost:50052 \
REMOTE_EXECUTION_SERVICE=http://localhost:50051 \
tox -- --remote-execution \
tests/remoteexecution/remotecache.py::test_remote_autotools_build And the test passes - running the test multiple times here is now futile (always passes), observing the buildgrid container output shows that no new builds are performed - which is expected since the build was already successful and stored in CAS. Errors without failuresRunning locally, when running all of the RE tests in Interesting different error from same testAfter running the tests multiple times whilst keeping the RE cluster up and running, I was able to cause
This one seems to have stray stdout/stderr being inappropriately interspersed (instead of observed by buildstream/logged through buildstream) from some other process, presumably we're not properly capturing the stdout/stderr of
This stray error message shows up after the (properly logged) error:
Which is raised by Reproducible errorIt seems that the following test consistently causes there to be a leaked background thread:
Even if the RE cluster has been up and running and built everything... every time we run the above test we get the following output in the RE cluster, and this seems unexpected since we should have already built this content in previous runs:
Bisection resultUsing the Further, I can confirm that reverting e97d16b (which reverts cleanly at this time) from master, causes the remote execution tests to pass cleanly again. By reverting the patch and undoing the revert to Digging a bit deeper:
One working theory is that there is something about this change to |
I cannot explain why the commit in question happens to introduce this test failure, however I was able to get down to the root of the problem and fix this in #1605 |
The errors occurring due to leaked gRPC threads are now fixed with #1605, but we still have the following failure:
Bisection resultsI've managed to isolate the breakage to the commit 220dc78, which essentially updates the docker images, which supports @abderrahim's observation below (#1582 (comment)) that this only happens on certain platforms.
Debugging buildbox-casdLooking at the gRPC error and gRPC source code, I was suspecting that this was being caused by a client side exception being raised by gRPC internal code, as the string However @juergbi suspected it was a server side error instead, and as usual... he was right ! Using a surgical strike to reproduce the error under the fedora docker image and running
The original exception has to be one of the two exceptions which can be thrown from CASClient::upload, in the depths of The exception is later caught in grpc core, the stack looks like this:
In order to observe the actual exception, I recompiled buildbox common/casd and added
|
FTR, this failure was there for some time. I've only ever seen it happen on fedora, but couldn't reproduce it locally. It is mostly triggered by test_track_recurse. |
This should be fixed by https://gitlab.com/BuildGrid/buildbox/buildbox-casd/-/merge_requests/265 (and https://gitlab.com/BuildGrid/buildgrid/-/merge_requests/771). I've validated this locally but not yet in CI. This will require an update of buildbox-casd in the BuildStream docker images (when the fix is merged). |
The BuildBox MRs have been merged. There will likely be new git tags later today or tomorrow. |
While trying to merge #1570 I came across a few problems:
To reproduce this, run the following locally in a BuildStream checkout:
Notice there are various failures in the buildgrid test, but the return status of the script is
0
Some relevant snips from the log:
The text was updated successfully, but these errors were encountered: