Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Periodically Refresh Active Storage Workers With StartTime #1549

Merged
merged 16 commits into from
Jan 3, 2024

Conversation

amishra-u
Copy link
Contributor

Problem

When the findMissingBlobsViaBackplane flag is enabled, every fmb or getac call (when the ensureOutputPresent flag is set) retrieves the Worker start time. Currently, each getWorkerStartTime triggers a Redis call. As a result, the Redis shard responsible for storing the worker list becomes hot-shard. At peak request this becomes bottleneck and significantly slows down the BuildfarmServer.

Solution

There is already a mechanism to periodically refresh the set of active workers every 3 seconds. Made the changes to include addition worker information on periodic call. Instead of depending on Redis for fetching worker start times, we can retrieve the worker start time from the local map which is refreshed periodically.

Unrelated change: Adjusted log level from Info to Finer to record the correction of the worker list for the digest. It was polluting the log.

@amishra-u
Copy link
Contributor Author

@werkt Can you please take a look.

Comment on lines 596 to 597
.setSeconds(shardWorker.getFirstRegisteredAt() / 1000)
.setNanos((int) ((shardWorker.getFirstRegisteredAt() % 1000) * 1000000))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use unit conversion utilities for these values, rather than encoding magic numbers

Copy link
Contributor Author

@amishra-u amishra-u Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here and other places

log.log(Level.INFO, format("adjusting locations for the digest %s", digest));
log.log(Level.FINE, format("adjusting locations for the digest %s", digest));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate this change if you want it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created new PR
#1573

@@ -275,7 +275,7 @@ message ShardWorker {
}

message WorkerChange {
message Add {}
message Add { google.protobuf.Timestamp firstRegisteredAt = 1; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to effectiveAt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -144,7 +143,8 @@ public class RedisShardBackplane implements Backplane {
private @Nullable RedisClient client = null;

private Deadline storageWorkersDeadline = null;
private final Set<String> storageWorkerSet = Collections.synchronizedSet(new HashSet<>());
private final Map<String, ShardWorker> storageWorkers =
Collections.synchronizedMap(new HashMap<>());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a ConcurrentHashMap for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 594 to 603
long firstRegisteredSeconds = MILLISECONDS.toSeconds(shardWorker.getFirstRegisteredAt());
int firstRegisteredNanos =
(int)
(MILLISECONDS.toNanos(shardWorker.getFirstRegisteredAt())
- SECONDS.toNanos(firstRegisteredSeconds));
Timestamp registrationTime =
Timestamp.newBuilder()
.setSeconds(firstRegisteredSeconds)
.setNanos(firstRegisteredNanos)
.build();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor this into a single function that takes a ms count and returns a timestamp. Use the var name effectiveAt, or even better, don't retain this in a variable, and put it directly into the WorkerChange.

return new HashSet<>(storageWorkerSet);
public Set<String> getStorageWorkers() throws IOException {
refreshStorageWorkersIfExpired();
return new HashSet<>(storageWorkers.keySet());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was missing before, but add a comment here that indicates that we're making a copy of the set for return, and not providing access to the shared set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment on method.

worker -> {
ShardWorker workerInfo = storageWorkers.get(worker);
if (workerInfo != null) {
workerAndStartTime.put(worker, workerInfo.getFirstRegisteredAt() / 1000L);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit conversion missing here.

@@ -275,7 +275,7 @@ message ShardWorker {
}

message WorkerChange {
message Add {}
message Add { google.protobuf.Timestamp effectiveAt = 1; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't inline this, give the block the same spacing as everything else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatter complains if I don't make it inline. This might be why all types with a single property are also defined inline in this file.

message DisableScaleInProtectionRequest { string instance_name = 1; }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, good to know we have linting protection for proto.

Copy link
Collaborator

@werkt werkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, can you address the comment add since you're in there?

@@ -275,7 +275,7 @@ message ShardWorker {
}

message WorkerChange {
message Add {}
message Add { google.protobuf.Timestamp effectiveAt = 1; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, good to know we have linting protection for proto.

@amishra-u
Copy link
Contributor Author

amishra-u commented Dec 26, 2023

Looks good, can you address the comment add since you're in there?

Created another PR to update the proto styling. Please merge this one.
#1594

@werkt werkt merged commit e51f1e1 into bazelbuild:main Jan 3, 2024
2 checks passed
@amishra-u amishra-u deleted the redis_hot_shard branch January 3, 2024 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants