-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when blob file miss,rbe client is stuck #1501
Comments
Supplement to Question 1: Why doesn‘t verify the existence of output's blob files in the |
I've experienced this issue with autoscaling - worker would scale down and bazel client would get stuck waiting forever. |
Hi,80degreeswest.
In addition, I'd like to ask you how you deal with this problem. |
Hi, @80degreeswest |
Yes I use cas+execute workers. I see this problem when my workers scale down. We use bazel 5.3.1. To work around it you can enable graceful shutdown, which is available in v2.6.1. This config will wait x seconds for any executions in progress to finish before shutting down the worker. Obviously not going to help if your worker is already broken but it will solve the issue in case of normal shutdown. https://github.com/bazelbuild/bazel-buildfarm/blob/main/examples/config.yml#L128. |
buildfarm version: 2.4.0
android rbe: 0.57.0.4865132
buildfarm configuration:
A pod in the workers is faulty,so i have to delete the pod and re-create it with empty cache dir. But
ContentAddressableStorage:
andActionCache:
data in redis aren't delete. Then when android rbe client is stuck when it use buildfarm server to remote build, and client error log is:I found that remote-api only considers the possible
NOT_FOUND
status returned by the GetActionResult, GetTree, and WaitExecution interfaces, but doesn't consider theNOT_FOUND
status of the download interface.And then I set
ensureOutputsPresent: true
to test(deleting blob files separately),but the first build is still stuck after delete blob file,and the result ofGetActionResult
is still 200. And the second build is success,and the result ofGetActionResult
isNotFound
.Question 2: why the first build is still stuck when set
ensureOutputsPresent: true
?Question 3: I think can the CAS data stored by workers be stored in the Redis,similar to a bidirectional linked list, use the worker address as the key, the value is a cas list?
The text was updated successfully, but these errors were encountered: