Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a blob existence cache to RemoteExecutionCache #13166

Closed
wants to merge 1 commit into from

Conversation

ulfjack
Copy link
Contributor

@ulfjack ulfjack commented Mar 6, 2021

Bazel currently performs a full findMissingBlob call for every input of
every action that is not an immediate cache hit. This causes a
significant amount of network traffic, resulting in long build times on
low-bandwidth, high-latency connections (i.e., work from home).

In particular, for a build of TensorFlow w/ remote execution over a
slow-ish network, we see multi-minute delays on actions due to these
findMissingBlob calls, and the client is not able to saturate a
50-machine remote execution cluster.

This change carefully de-duplicates findMissingBlob calls as well as
file uploads to the remote execution cluster.

It is based on a previous simpler change that did not de-duplicate
concurrent calls. However, we have found that concurrent calls with
the same inputs are common - i.e., an action completes that unblocks a
large number of actions that have significant overlaps in their input
sets (e.g., protoc link completion unblocks all protoc compile
actions), and we were still seeing multi-minute delays on action
execution.

With this change, a TensorFlow build w/ remote execution is down to ~20
minutes on a 50-machine cluster, and is able to fully saturate the
cluster.

We have also seen significant improvements for remote Android app
builds.

Change-Id: Ic39347a7a7a8dc7cfd463d78f0a80e0d26a970bc

@google-cla google-cla bot added the cla: yes label Mar 6, 2021
Bazel currently performs a full findMissingBlob call for every input of
every action that is not an immediate cache hit. This causes a
significant amount of network traffic, resulting in long build times on
low-bandwidth, high-latency connections (i.e., work from home).

In particular, for a build of TensorFlow w/ remote execution over a
slow-ish network, we see multi-minute delays on actions due to these
findMissingBlob calls, and the client is not able to saturate a
50-machine remote execution cluster.

This change carefully de-duplicates findMissingBlob calls as well as
file uploads to the remote execution cluster.

It is based on a previous simpler change that did not de-duplicate
*concurrent* calls. However, we have found that concurrent calls with
the same inputs are common - i.e., an action completes that unblocks a
large number of actions that have significant overlaps in their input
sets (e.g., protoc link completion unblocks all protoc compile
actions), and we were still seeing multi-minute delays on action
execution.

With this change, a TensorFlow build w/ remote execution is down to ~20
minutes on a 50-machine cluster, and is able to fully saturate the
cluster.

We have also seen significant improvements for remote Android app
builds.

Progress on bazelbuild#12113.

Change-Id: I2b86076881f99d3d4a3ad8196abf9e2b5f6e0aa7
@ulfjack
Copy link
Contributor Author

ulfjack commented Mar 10, 2021

@coeuvre, can you take a look or find someone else who can?

@coeuvre coeuvre self-requested a review March 11, 2021 02:50
@coeuvre coeuvre added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Mar 11, 2021
Copy link
Member

@coeuvre coeuvre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

futures.add(uploadBlob(context, missingDigest, merkleTree, additionalInputs));
}
} else {
// This code deduplicates findMissingDigests calls as well as file uploads. It works by
Copy link
Member

@coeuvre coeuvre Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to distinguish owned and unowned futures?

I think we are doing too much here. Can we have a cache layer between RemoteExecutionCache and RemoteCacheClient which is used to deduplicate the uploads and owns the futures. All the futures are unowned from the perspective of RemoteExecutionCache. e.g.

class UploadCache {
    ListenableFuture<Void> upload(MerkleTree merkleTree, Map<Digest, Message> additionalInputs);
}

We can also hide the findMissingDigests call inside UploadCache and can potentially save the call if the requested uploads are already uploaded or are in progress.

AsyncTaskCache which is added recently to fix #12972 can be used to implement the cache.

FYI, we have the deduplication logic at ByteStreamUploader but isn't enabled since the forceUpload is set to true here. That being said, the deduplication is only available for gRPC cache client. This PR allows deduplication for all cache protocols. (This is only available for remote execution which only supports gRPC protocol)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, we have the deduplication logic at ByteStreamUploader but isn't enabled since the forceUpload is set to true here. That being said, the deduplication is only available for gRPC cache client. This PR allows deduplication for all cache protocols. (This is only available for remote execution which only supports gRPC protocol)

I’m prototyping remote execution support also with HTTP cache protocol. Because that allows using HTTP load balance infrastructure. I get performance on par with gRPC. Therefore I very much appreciate that this PR is generic and works also for HTTP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think AsyncTaskCache would be ideal here. The problem is that the findMissingBlobs calls are bulk calls: we have to deduplicate the individual entries, but we still want to perform bulk calls for performance. That requires passing a collection in and out of this class.

I think ByteStreamUploader is not a good place for a cache. The API to check for CAS entries is findMissingBlobs, which is a bulk API. ByteStreamUploader has a single file upload API - it can't really automatically turn a sequence of single-file uploads into a bulk check call. The caching has to be outside of ByteStreamUploader and itself needs a bulk API to be able to efficiently call findMissingBlobs. I also don't like that it unconditionally caches knowledge about an upload for the duration of the build - if the CAS loses entries faster than the build time, Bazel gets confused and fails for no good reason.

The distinction between owned and unowned futures is thus: there is any number of concurrent calls into this method, and we own some of the digests, but not all. We then make a bulk findMissingBlobs call for all owned digests, and then upload the ones that are missing remotely.

I'm happy to remove the flag and always enable this code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I am now agree with you that ByteStreamUploader can not be used to deduplicate findMissingBlobs calls.

Feel free to ignore: AsyncTaskCache can be used to deduplicate uploads just like what you achieved with ConcurrentHashMap combined with deduplication logic.

I'm happy to remove the flag and always enable this code.

Please remove the flag!

Can you please also add test cases for the deduplication?

@@ -132,7 +132,7 @@ public ExecutionResult execute(
additionalInputs.put(actionDigest, action);
additionalInputs.put(commandHash, command);

remoteCache.ensureInputsPresent(context, merkleTree, additionalInputs);
remoteCache.ensureInputsPresent(context, merkleTree, additionalInputs, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why set true here?

help =
"If set to true, Bazel deduplicates calls to the remote service to find missing "
+ "files and also deduplicates uploads.")
public boolean experimentalRemoteDeduplicateUploads;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have too much flags now. Are there concerns to introduce this flag and disable the feature by default? I think people always want this enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants