Handle inaccurate information in backplane for fmb call #1381

amishra-u · 2023-06-20T06:40:34Z

Problem

When workers die, their stored references are not removed from the backplane. This creates the possibility that new workers may come up with the same IP address or use an IP address previously used by another terminated host. As a result, the backplane becomes unreliable, requiring us to query each worker individually to find missing blobs. Clearly, this approach is not scalable since any problems encountered by a single worker can significantly impact the performance of the buildfarm.

Past Work

We made code modifications for the findMissingBlobs function to exclusively query the backplane, prs: #1310, #1333, and #1342. This update implemented the findMissingViaBackplane flag. However, the above issues made the findMissingViaBackplane flag ineffective.

Solution

To address the issue of imposter workers, updated code to compare the start time of each worker (first_registered_at) with the insertion time of the digest. Any worker whose start time is later than the digest insertion time is considered an imposter worker. Also, the code removes imposter workers associated with the digest in the same function call.

first_registered_at: Added new field first_registered_at to the worker data type. This field stores the initial start time of the worker. Worker informs the backplane about its start time, which is the same as the creation time of the cache directory (where all digests are stored) on the worker's disk.

digest insert time: The digest insertion time is calculated using the Time to Live (TTL) of the digest and the casExpire time. The formula for determining the digest insertion time is now() - configured casExpire + remaining ttl. In the current implementation, each worker updates the TTL of the digest upon completing the write operation. This means that the cas insert time in the backplane corresponds to the time when the last worker finished writing the digest on its disk.

Testing

Deployed the change to our buildfarm staging, and ran full monorepo build. To make sure that the code change solve terminated worker problem, terminated bunch of workers in the middle of build. This caused temporary not_found error, which eventually faded away (fmb call autocorrect blob location).

In the above graph terminated workers during first build.

Future Improvement

The above solution might not work if user updates cas_expire time between two deployments as algorithm to calculate digest_insert_time depends to cas_expire time.

closes #1371

…present

werkt · 2023-06-21T23:28:36Z

src/main/java/build/buildfarm/instance/shard/ShardInstance.java

                  // ignore this, the worker will update the backplane eventually
                } else if (status.getCode() != Code.DEADLINE_EXCEEDED
                    && SHARD_IS_RETRIABLE.test(status)) {
                  // why not, always
                  workers.addLast(worker);
                } else {
+                  log.log(
+                      configs.getServer().isEnsureOutputsPresent() ? Level.SEVERE : Level.WARNING,


this is not a severe error, even if we're checking for an AC presence via EOP. SEVERE logs are reserved for bugs or consistency failures.

With bwob flag enabled this one failure causes whole build to fail, also I believe the EOP flag was added to support bwob feature. So I thought SEVERE might be the better classification.

Build failures because of expected behavior or conditions are still not SEVERE. Please change this and the
log elevation in bytestream to remaing a warning.

Reverted log level as per your suggestion.

werkt · 2023-06-21T23:31:55Z

src/main/java/build/buildfarm/instance/shard/RedisShardBackplane.java

+    List<String> workerSet = client.call(jedis -> state.storageWorkers.mget(jedis, workerNames));
+
+    return workerSet.stream()
+        .filter(Objects::nonNull)


why would you get null entries from a workerSet? (also calling it workerSet and being a list is a little confusing)

Redis.hmget returns null if the field doesn't exist in the map. Also updated unit test to cover this case.

https://redis.io/commands/hmget/

For every field that does not exist in the hash, a nil value is returned. Because non-existing keys are treated as empty hashes, running HMGET against a non-existing key will return a list of nil values.

Renamed from workerSet to workerList, thanks!

amishra-u · 2023-06-23T17:38:54Z

@werkt Can you please take a look again?

I ran the shadow experiment for two days, there was no not_found error for read api, also overall there were less timeouts.

For us avg (with 120 workers) latency for FMB call reduced 15 times. With change 4.5ms, without this change 64ms

werkt · 2023-06-25T12:35:04Z

src/main/java/build/buildfarm/instance/shard/ShardInstance.java

@@ -660,7 +661,7 @@ public ListenableFuture<Iterable<Digest>> findMissingBlobs(
    // risk of returning expired workers despite filtering by active workers below. This is because
    // the strategy may return workers that have expired in the last 30 seconds. However, checking
    // workers directly is not a guarantee either since workers could leave the cluster after being
-    // queried. Ultimitely, it will come down to the client's resiliency if the backplane is
+    // queried. Ultimately, it will come down to the client's resiliency if the backplane is
    // out-of-date and the server lies about which blobs are actually present. We provide this
    // alternative strategy for calculating missing blobs.



This needs a refactor, whether before or after landing, to properly partition FMB into backplane and worker segments for clarity. If it's to be after, please follow up with an issue.

Created followup task, I will work on this next. #1393

werkt

Not the biggest fan of passing epoch seconds around instead of Instant, but this seems well compartmentalized. Looking forward to the refactor.

sbalabanov · 2023-06-28T21:56:01Z

A naive question: why don't workers have a unique id changing with each restart so we can uniquely identify them without referring to IP and start time?

amishra-u · 2023-06-29T00:23:43Z

A naive question: why don't workers have a unique id changing with each restart so we can uniquely identify them without referring to IP and start time?

That was also an option. But I chose this approach because of these reasons.

In our existing setup, whenever a host restarts, it starts with a clean disk. However, it is feasible to have a setup where the disk is persisted during deployment. In such cases, we would need to implement a mechanism to differentiate between old and new hosts.
This approach cleanups outdated entries from Redis proactively, rather than waiting for the ttl to expire.

amishra-u requested a review from werkt as a code owner June 20, 2023 06:40

amishra-u force-pushed the fmb branch from f14525d to b55a3e4 Compare June 20, 2023 06:56

amishra-u changed the title ~~Handle inaccurate information in backplane~~ Handle inaccurate information in backplane for fmb call Jun 20, 2023

amishra-u marked this pull request as draft June 20, 2023 23:19

amishra-u marked this pull request as ready for review June 21, 2023 20:20

amishra-u added 12 commits June 21, 2023 14:14

Handle inaccurate information in backplane

23e7b14

revert bazel version

cba9aab

Run formatter

73a2111

Remove local machine test changes

2f2cf9b

Adjust blob location in backplane

935c943

Run formatter

65c89b6

Make not_found error on worker SEVERE if ensureOutputPresent flag is …

7d6ad4a

…present

Rename getWorkersStarttime method and fix error

5aa8bcd

Run formatter

0488b96

minor correction

715f839

add logs

edf7b86

minor log update

8a60900

amishra-u force-pushed the fmb branch from c8f6672 to 8a60900 Compare June 21, 2023 21:15

werkt reviewed Jun 21, 2023

View reviewed changes

incorporate feedback

0fa802d

werkt reviewed Jun 25, 2023

View reviewed changes

Incorporate feedback

eefedb1

werkt approved these changes Jun 27, 2023

View reviewed changes

Merge branch 'main' into fmb

e0dcc0d

werkt merged commit ccee33b into bazelbuild:main Jun 28, 2023
2 checks passed

amishra-u deleted the fmb branch June 29, 2023 04:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle inaccurate information in backplane for fmb call #1381

Handle inaccurate information in backplane for fmb call #1381

amishra-u commented Jun 20, 2023 •

edited

werkt Jun 21, 2023

amishra-u Jun 22, 2023

werkt Jun 25, 2023

amishra-u Jun 27, 2023

werkt Jun 21, 2023

amishra-u Jun 22, 2023 •

edited

amishra-u commented Jun 23, 2023 •

edited

werkt Jun 25, 2023

amishra-u Jun 27, 2023

werkt left a comment

sbalabanov commented Jun 28, 2023

amishra-u commented Jun 29, 2023 •

edited

Handle inaccurate information in backplane for fmb call #1381

Handle inaccurate information in backplane for fmb call #1381

Conversation

amishra-u commented Jun 20, 2023 • edited

Problem

Past Work

Solution

Testing

Future Improvement

werkt Jun 21, 2023

Choose a reason for hiding this comment

amishra-u Jun 22, 2023

Choose a reason for hiding this comment

werkt Jun 25, 2023

Choose a reason for hiding this comment

amishra-u Jun 27, 2023

Choose a reason for hiding this comment

werkt Jun 21, 2023

Choose a reason for hiding this comment

amishra-u Jun 22, 2023 • edited

Choose a reason for hiding this comment

amishra-u commented Jun 23, 2023 • edited

werkt Jun 25, 2023

Choose a reason for hiding this comment

amishra-u Jun 27, 2023

Choose a reason for hiding this comment

werkt left a comment

Choose a reason for hiding this comment

sbalabanov commented Jun 28, 2023

amishra-u commented Jun 29, 2023 • edited

amishra-u commented Jun 20, 2023 •

edited

amishra-u Jun 22, 2023 •

edited

amishra-u commented Jun 23, 2023 •

edited

amishra-u commented Jun 29, 2023 •

edited