Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove empty/dup digest in findMissingBlobs calculation #1342

Merged
merged 8 commits into from
May 24, 2023

Conversation

amishra-u
Copy link
Contributor

@amishra-u amishra-u commented May 18, 2023

Issue: While testing ensureOutputPresent flag for buildfarm I found that the getActionResult most of the time returns empty digests under key stdout and stderr. The getActionResult returns not_found error if findMissingBlobs don't return empty list. As empty digest are never stored it will be always present in the missing digest list.
The old implementation (querying each worker) also filters out empty digests.
Solution: Query only nonEmptyDigests for findMissingBlobs api.

Issue: I found another problem in the ensureOutputPresent function. When creating a list of digests from the getActionResult output, duplicates are not being removed. This issue leads to failures when trying to retrieve workers for these digests from the backplane. The ImmutableMap class throws an IllegalArgumentException when duplicate keys are encountered, resulting in an "unknown" error. Code related to this issue in the following locations:

  1. List of digests creation: AbstractServerInstance.java, line 272
  2. Fetching workers from the backplane: JedisCasWorkerMap.java, line 216

Solution: Make sure list of digests are unique while querying missing blobs.

@amishra-u amishra-u requested a review from werkt as a code owner May 18, 2023 04:56
@werkt
Copy link
Collaborator

werkt commented May 18, 2023

Can you provide a rationale for this? We ignore empty blobs because, by definition, all empty blobs are not missing.

@amishra-u amishra-u marked this pull request as draft May 18, 2023 18:56
@amishra-u
Copy link
Contributor Author

Can you provide a rationale for this? We ignore empty blobs because, by definition, all empty blobs are not missing.

Thanks for looking into. Updated diff summary. I converted this diff to draft as I am still running few experiments.

@amishra-u
Copy link
Contributor Author

Can you provide a rationale for this? We ignore empty blobs because, by definition, all empty blobs are not missing.

It seems that you already knew about this issue 3 years back.

:)

@amishra-u amishra-u marked this pull request as ready for review May 19, 2023 09:36
@amishra-u
Copy link
Contributor Author

@werkt can you please review it. Thanks in advance.

@werkt werkt changed the title Remove empty digest in findMissingBlobs calculation Remove empty/dup digest in findMissingBlobs calculation May 23, 2023
Copy link
Collaborator

@werkt werkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of quick fixes and looks good.

@werkt werkt merged commit 307a515 into bazelbuild:main May 24, 2023
2 checks passed
werkt pushed a commit that referenced this pull request Jun 28, 2023
### Problem 
When workers die, their stored references are not removed from the backplane. This creates the possibility that new workers may come up with the same IP address or use an IP address previously used by another terminated host. As a result, the backplane becomes unreliable, requiring us to query each worker individually to find missing blobs. Clearly, this approach is not scalable since any problems encountered by a single worker can significantly impact the performance of the buildfarm.

### Past Work
We made code modifications for the `findMissingBlobs` function to exclusively query the backplane, prs: #1310, #1333, and #1342. This update implemented the `findMissingViaBackplane` flag. However, the above issues made the `findMissingViaBackplane` flag ineffective.

### Solution
To address the issue of imposter workers, updated code to compare the start time of each worker (first_registered_at) with the insertion time of the digest. Any worker whose start time is later than the digest insertion time is considered an imposter worker. Also, the code removes imposter workers associated with the digest in the same function call.

**first_registered_at**: Added new field first_registered_at to the worker data type. This field stores the initial start time of the worker. Worker  informs the backplane about its start time, which is the same as the creation time of the cache directory (where all digests are stored) on the worker's disk.

**digest insert time**: The digest insertion time is calculated using the Time to Live (TTL) of the digest and the casExpire time. The formula for determining the digest insertion time is now() - configured casExpire + remaining ttl. In the current implementation, each worker updates the TTL of the digest upon completing the write operation. This means that the cas insert time in the backplane corresponds to the time when the last worker finished writing the digest on its disk.

### Testing
Deployed the change to our buildfarm staging, and ran full monorepo build. To make sure that the code change solve terminated worker problem, terminated bunch of workers in the middle of build. This caused temporary not_found `error`, which eventually faded away (fmb call autocorrect blob location).
<img width="1385" alt="Screenshot 2023-06-21 at 12 36 47 PM" src="https://github.com/bazelbuild/bazel-buildfarm/assets/119983081/62fcf8e0-847a-4632-b49e-cef2c17321dc">
In the above graph terminated workers during first build.


### Future Improvement
The above solution might not work if user updates `cas_expire` time between two deployments as algorithm to calculate `digest_insert_time` depends to `cas_expire` time. 

closes #1371
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants