Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle inaccurate information in backplane for fmb call #1381

Merged
merged 15 commits into from
Jun 28, 2023

Conversation

amishra-u
Copy link
Contributor

@amishra-u amishra-u commented Jun 20, 2023

Problem

When workers die, their stored references are not removed from the backplane. This creates the possibility that new workers may come up with the same IP address or use an IP address previously used by another terminated host. As a result, the backplane becomes unreliable, requiring us to query each worker individually to find missing blobs. Clearly, this approach is not scalable since any problems encountered by a single worker can significantly impact the performance of the buildfarm.

Past Work

We made code modifications for the findMissingBlobs function to exclusively query the backplane, prs: #1310, #1333, and #1342. This update implemented the findMissingViaBackplane flag. However, the above issues made the findMissingViaBackplane flag ineffective.

Solution

To address the issue of imposter workers, updated code to compare the start time of each worker (first_registered_at) with the insertion time of the digest. Any worker whose start time is later than the digest insertion time is considered an imposter worker. Also, the code removes imposter workers associated with the digest in the same function call.

first_registered_at: Added new field first_registered_at to the worker data type. This field stores the initial start time of the worker. Worker informs the backplane about its start time, which is the same as the creation time of the cache directory (where all digests are stored) on the worker's disk.

digest insert time: The digest insertion time is calculated using the Time to Live (TTL) of the digest and the casExpire time. The formula for determining the digest insertion time is now() - configured casExpire + remaining ttl. In the current implementation, each worker updates the TTL of the digest upon completing the write operation. This means that the cas insert time in the backplane corresponds to the time when the last worker finished writing the digest on its disk.

Testing

Deployed the change to our buildfarm staging, and ran full monorepo build. To make sure that the code change solve terminated worker problem, terminated bunch of workers in the middle of build. This caused temporary not_found error, which eventually faded away (fmb call autocorrect blob location).
Screenshot 2023-06-21 at 12 36 47 PM
In the above graph terminated workers during first build.

Future Improvement

The above solution might not work if user updates cas_expire time between two deployments as algorithm to calculate digest_insert_time depends to cas_expire time.

closes #1371

@amishra-u amishra-u requested a review from werkt as a code owner June 20, 2023 06:40
@amishra-u amishra-u changed the title Handle inaccurate information in backplane Handle inaccurate information in backplane for fmb call Jun 20, 2023
@amishra-u amishra-u marked this pull request as draft June 20, 2023 23:19
@amishra-u amishra-u marked this pull request as ready for review June 21, 2023 20:20
// ignore this, the worker will update the backplane eventually
} else if (status.getCode() != Code.DEADLINE_EXCEEDED
&& SHARD_IS_RETRIABLE.test(status)) {
// why not, always
workers.addLast(worker);
} else {
log.log(
configs.getServer().isEnsureOutputsPresent() ? Level.SEVERE : Level.WARNING,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a severe error, even if we're checking for an AC presence via EOP. SEVERE logs are reserved for bugs or consistency failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With bwob flag enabled this one failure causes whole build to fail, also I believe the EOP flag was added to support bwob feature. So I thought SEVERE might be the better classification.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build failures because of expected behavior or conditions are still not SEVERE. Please change this and the
log elevation in bytestream to remaing a warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted log level as per your suggestion.

List<String> workerSet = client.call(jedis -> state.storageWorkers.mget(jedis, workerNames));

return workerSet.stream()
.filter(Objects::nonNull)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would you get null entries from a workerSet? (also calling it workerSet and being a list is a little confusing)

Copy link
Contributor Author

@amishra-u amishra-u Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redis.hmget returns null if the field doesn't exist in the map. Also updated unit test to cover this case.

https://redis.io/commands/hmget/

For every field that does not exist in the hash, a nil value is returned. Because non-existing keys are treated as empty hashes, running HMGET against a non-existing key will return a list of nil values.

Renamed from workerSet to workerList, thanks!

@amishra-u
Copy link
Contributor Author

amishra-u commented Jun 23, 2023

@werkt Can you please take a look again?

I ran the shadow experiment for two days, there was no not_found error for read api, also overall there were less timeouts.

For us avg (with 120 workers) latency for FMB call reduced 15 times. With change 4.5ms, without this change 64ms
Screenshot 2023-06-23 at 10 35 13 AM

@@ -660,7 +661,7 @@ public ListenableFuture<Iterable<Digest>> findMissingBlobs(
// risk of returning expired workers despite filtering by active workers below. This is because
// the strategy may return workers that have expired in the last 30 seconds. However, checking
// workers directly is not a guarantee either since workers could leave the cluster after being
// queried. Ultimitely, it will come down to the client's resiliency if the backplane is
// queried. Ultimately, it will come down to the client's resiliency if the backplane is
// out-of-date and the server lies about which blobs are actually present. We provide this
// alternative strategy for calculating missing blobs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a refactor, whether before or after landing, to properly partition FMB into backplane and worker segments for clarity. If it's to be after, please follow up with an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created followup task, I will work on this next. #1393

Copy link
Collaborator

@werkt werkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the biggest fan of passing epoch seconds around instead of Instant, but this seems well compartmentalized. Looking forward to the refactor.

@werkt werkt merged commit ccee33b into bazelbuild:main Jun 28, 2023
2 checks passed
@sbalabanov
Copy link

A naive question: why don't workers have a unique id changing with each restart so we can uniquely identify them without referring to IP and start time?

@amishra-u
Copy link
Contributor Author

amishra-u commented Jun 29, 2023

A naive question: why don't workers have a unique id changing with each restart so we can uniquely identify them without referring to IP and start time?

That was also an option. But I chose this approach because of these reasons.

  1. In our existing setup, whenever a host restarts, it starts with a clean disk. However, it is feasible to have a setup where the disk is persisted during deployment. In such cases, we would need to implement a mechanism to differentiate between old and new hosts.
  2. This approach cleanups outdated entries from Redis proactively, rather than waiting for the ttl to expire.

@amishra-u amishra-u deleted the fmb branch June 29, 2023 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix Inaccurate information issue in Backplane
3 participants