-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Periodically Refresh Active Storage Workers With StartTime #1549
Conversation
@werkt Can you please take a look. |
.setSeconds(shardWorker.getFirstRegisteredAt() / 1000) | ||
.setNanos((int) ((shardWorker.getFirstRegisteredAt() % 1000) * 1000000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use unit conversion utilities for these values, rather than encoding magic numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here and other places
log.log(Level.INFO, format("adjusting locations for the digest %s", digest)); | ||
log.log(Level.FINE, format("adjusting locations for the digest %s", digest)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate this change if you want it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created new PR
#1573
@@ -275,7 +275,7 @@ message ShardWorker { | |||
} | |||
|
|||
message WorkerChange { | |||
message Add {} | |||
message Add { google.protobuf.Timestamp firstRegisteredAt = 1; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this to effectiveAt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@@ -144,7 +143,8 @@ public class RedisShardBackplane implements Backplane { | |||
private @Nullable RedisClient client = null; | |||
|
|||
private Deadline storageWorkersDeadline = null; | |||
private final Set<String> storageWorkerSet = Collections.synchronizedSet(new HashSet<>()); | |||
private final Map<String, ShardWorker> storageWorkers = | |||
Collections.synchronizedMap(new HashMap<>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a ConcurrentHashMap for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
long firstRegisteredSeconds = MILLISECONDS.toSeconds(shardWorker.getFirstRegisteredAt()); | ||
int firstRegisteredNanos = | ||
(int) | ||
(MILLISECONDS.toNanos(shardWorker.getFirstRegisteredAt()) | ||
- SECONDS.toNanos(firstRegisteredSeconds)); | ||
Timestamp registrationTime = | ||
Timestamp.newBuilder() | ||
.setSeconds(firstRegisteredSeconds) | ||
.setNanos(firstRegisteredNanos) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor this into a single function that takes a ms count and returns a timestamp. Use the var name effectiveAt
, or even better, don't retain this in a variable, and put it directly into the WorkerChange.
return new HashSet<>(storageWorkerSet); | ||
public Set<String> getStorageWorkers() throws IOException { | ||
refreshStorageWorkersIfExpired(); | ||
return new HashSet<>(storageWorkers.keySet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was missing before, but add a comment here that indicates that we're making a copy of the set for return, and not providing access to the shared set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment on method.
worker -> { | ||
ShardWorker workerInfo = storageWorkers.get(worker); | ||
if (workerInfo != null) { | ||
workerAndStartTime.put(worker, workerInfo.getFirstRegisteredAt() / 1000L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unit conversion missing here.
@@ -275,7 +275,7 @@ message ShardWorker { | |||
} | |||
|
|||
message WorkerChange { | |||
message Add {} | |||
message Add { google.protobuf.Timestamp effectiveAt = 1; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't inline this, give the block the same spacing as everything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatter complains if I don't make it inline. This might be why all types with a single property are also defined inline in this file.
message DisableScaleInProtectionRequest { string instance_name = 1; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, good to know we have linting protection for proto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, can you address the comment add since you're in there?
@@ -275,7 +275,7 @@ message ShardWorker { | |||
} | |||
|
|||
message WorkerChange { | |||
message Add {} | |||
message Add { google.protobuf.Timestamp effectiveAt = 1; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, good to know we have linting protection for proto.
Created another PR to update the proto styling. Please merge this one. |
Problem
When the
findMissingBlobsViaBackplane
flag is enabled, every fmb or getac call (when the ensureOutputPresent flag is set) retrieves the Worker start time. Currently, eachgetWorkerStartTime
triggers a Redis call. As a result, the Redis shard responsible for storing the worker list becomes hot-shard. At peak request this becomes bottleneck and significantly slows down the BuildfarmServer.Solution
There is already a mechanism to periodically refresh the set of active workers every 3 seconds. Made the changes to include addition worker information on periodic call. Instead of depending on Redis for fetching worker start times, we can retrieve the worker start time from the local map which is refreshed periodically.
Unrelated change: Adjusted log level from Info to Finer to record the correction of the worker list for the digest. It was polluting the log.