Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when blob file miss,rbe client is stuck #1501

Open
DarkMatterV opened this issue Oct 16, 2023 · 6 comments
Open

when blob file miss,rbe client is stuck #1501

DarkMatterV opened this issue Oct 16, 2023 · 6 comments

Comments

@DarkMatterV
Copy link

buildfarm version: 2.4.0
android rbe: 0.57.0.4865132
buildfarm configuration:

  • server: 3 k8s pods
  • shard workers: more than 10 k8s pods, and execute workers act as CAS workers

A pod in the workers is faulty,so i have to delete the pod and re-create it with empty cache dir. But ContentAddressableStorage: and ActionCache: data in redis aren't delete. Then when android rbe client is stuck when it use buildfarm server to remote build, and client error log is:

cas.go:1399] Error downloading {blob file hash}/{blob file size}: rpc error: code = NotFound desc = No workers found.
cas.go:1408] Internal tool error - matching map entry

I found that remote-api only considers the possible NOT_FOUND status returned by the GetActionResult, GetTree, and WaitExecution interfaces, but doesn't consider the NOT_FOUND status of the download interface.

Question 1: maybe server must ensure that the result of GetActionResult is consistent with the result of download?

And then I set ensureOutputsPresent: true to test(deleting blob files separately),but the first build is still stuck after delete blob file,and the result of GetActionResult is still 200. And the second build is success,and the result of GetActionResult is NotFound.
Question 2: why the first build is still stuck when set ensureOutputsPresent: true?

Question 3: I think can the CAS data stored by workers be stored in the Redis,similar to a bidirectional linked list, use the worker address as the key, the value is a cas list?

@DarkMatterV
Copy link
Author

Supplement to Question 1: Why doesn‘t verify the existence of output's blob files in the GetActionResult interface? Maybe just randomly select a worker from the cas work list to judge whether a single cas has blob file.

@80degreeswest
Copy link
Collaborator

I've experienced this issue with autoscaling - worker would scale down and bazel client would get stuck waiting forever.

@DarkMatterV
Copy link
Author

DarkMatterV commented Oct 19, 2023

I've experienced this issue with autoscaling - worker would scale down and bazel client would get stuck waiting forever.

Hi,80degreeswest.
Are your execute worker and cas worker together (this indicates that the blob file is lost, but the redis data is not synchronized)?
Which version of bazel do you use?

I have previously tested this scenario with bazel 6.1 and it performed successfully. Of course, the android rbe we used was stuck.

In addition, I'd like to ask you how you deal with this problem.

@DarkMatterV
Copy link
Author

Hi, @80degreeswest
I noticed that #976 could solve my problem, but not merge.
I tested the efficiency of adding check disk storage before and after. It doesn't seem to add much time at the moment.

@80degreeswest
Copy link
Collaborator

Yes I use cas+execute workers. I see this problem when my workers scale down. We use bazel 5.3.1. To work around it you can enable graceful shutdown, which is available in v2.6.1. This config will wait x seconds for any executions in progress to finish before shutting down the worker. Obviously not going to help if your worker is already broken but it will solve the issue in case of normal shutdown. https://github.com/bazelbuild/bazel-buildfarm/blob/main/examples/config.yml#L128.

@80degreeswest
Copy link
Collaborator

I'm not sure what the state of that PR is. @luxe may be able to provide some more detail on if it would make sense to re-visit it. @shirchen do you have that change deployed in your cluster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants