Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update remote execution API to v2 #178

Merged
merged 13 commits into from
Aug 28, 2018
Merged

Conversation

werkt
Copy link
Collaborator

@werkt werkt commented Aug 19, 2018

No description provided.

Copy link

@ola-rozenfeld ola-rozenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked through everything yet, but please implement Capabilities. In our service we currently return:

ServerCapabilities {
  deprecated_api_version {
    prerelease: "v1test"
  }
  low_api_version {
    major: 2
  }
  high_api_version {
    major: 2
  }
  execution_capabilities: {
    digest_function: SHA256,
    exec_enabled: // depends on user
  }
  cache_capabilities: {
    digest_function: {SHA256},
    action_cache_update_capabilities: {
      allow_update: // depends on user
    }
    max_batch_total_size_bytes: 10MB,
    symlink_absolute_path_strategy: DISALLOWED,
  }
}

Since you remove support for v1test here, you need to not return the deprecated_api_version (we're keeping both until all our clients switch). You probably also want to return some valid cache priority ranges, since you want to support that (not as critical for us).
I'm now sending out a Bazel patch that would query the Capabilities service, so it's better to add it right away. I will add the --cache_results_priority flag after that.

@ola-rozenfeld
Copy link

Also: I screwed up, sorry, you'll need bazelbuild/bazel#5934 as well. Only now discovered I forgot to define the java package in the semver.proto (our Go server didn't complain...).

@werkt
Copy link
Collaborator Author

werkt commented Aug 20, 2018

Forgive my ignorance, but how are you supporting 10MB batch sizes if the GRPC limit is 4MB? just over capped?

@ola-rozenfeld
Copy link

No, that's a good question -- I remember looking into this limit thing further and discovering it wasn't trivial, but I don't remember what I found. I'll get back to you on this! We might be doing it wrong.
In any case, 4mb will work for sure.

Copy link

@ola-rozenfeld ola-rozenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few nits:

try {
action = Action.parseFrom(actionBlob);
} catch (InvalidProtocolBufferException e) {
Preconditions.checkState(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think throw new IllegalStateException(INVALID_DIGEST); is clearer.

} else if (!requeueOnFailure) {
Command command = expectCommand(operation);
if (command == null) {
cancelOperation(operation.getName());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can happen with low probability race, on command getting evicted right after we validated the action? Or it cannot happen at all because you refresh the blob TTL on access?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low/load-induced probability race - this is the long tail operation queue, where operations may have been sitting for an indeterminate amount of time, waiting for a worker to be ready to execute.

.get()
.getTree(GetTreeRequest.newBuilder()
.setInstanceName(getName())
.setRootDigest(rootDigest)
.setPageSize(pageSize)
.setPageToken(pageToken)
.build());
directories.addAll(response.getDirectoriesList());
return response.getNextPageToken();
// new streaming interface doesn't really fit with what we're trying to do here...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, but this way you're not really implementing the streaming interface as it was intended. I mean, this is valid per spec, but you're not benefitting from the new interface, which just wants to stream the entire tree once and not make the client repeat individual getTree calls, unless a retriable error happened.

You might want to refactor the Instance API to just return the whole tree instead, and build on that to stream it properly in the service. It won't matter in practice if your default page size is sufficient, but will result in a performance boost for small page sizes, and should also make the code simpler, I think. But this could also be done in a separate patch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect of the new streaming interface on this is actually less than you think - I'm benefitting here because what used to take multiple calls to getTree with distinct page tokens can now be fed completely, ending on a single empty nextPageToken to complete the sequence. The only call to this method is in worker/operationqueue/Worker.java, where the intended effect was to fetch the entire tree, which this is, as you say, completely in spec for.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what used to take multiple calls to getTree with distinct page tokens can now be fed completely

Yes, that's what I don't see here -- at which point are you replacing a loop of getTree calls with a single getTree call? The only service implementation I see still calls .onNext() just once once before .onCompleted().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only call to this particular function is
src/test/java/build/buildfarm/server/BuildFarmServerTest.java:314
which is unmodified from the previous implementation because the instance interface hasn't changed (it just manifests as a single call in the case of a successful stream, and expects that a non-empty nextPageToken will be returned for any retriable error).

Gotcha regarding the service implementation, yeah I didn't change that to take advantage. Updating now.

// watcher processed completed state
return true;
}
if (operation == null || operation.getDone()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this even happen? I see the watcher returns falls for these cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred to be decoupled from any watcher implementation, in particular one that would be unconcerned with the null-ness or done-ness of the operation it is watching. In those cases, I can do nothing further in this function.

if (operation != null) {
responseObserver.onNext(operation);
}
if (operation == null || operation.getDone()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you move the operation == null condition into line 37, then you don't need to check for it in the rest of the function and the code gets slightly simpler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in the null operation case, I still need to complete the responseObserver stream. Moving it out of the try block here will also prevent coverage by the grpc induced cancellation/expiration handlers present in the catches.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, you're right! Disregard.

}

ByteString commandBlob = workerContext.getBlob(action.getCommandDigest());
if (actionBlob == null) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commandBlob

@@ -163,29 +163,16 @@ public void addContent(ByteString content, CASInsertionPolicy policy, Consumer<B
}

private void addFile(Path file, long size, CASInsertionPolicy policy) throws IOException {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need to pass the size externally? I think DigestUtil should handle computing the digest correctly, regardless of whether the file is a symlink or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, cleaning that up.

@@ -225,9 +211,9 @@ public void watchOperationAddsWatcher() throws InterruptedException {
watchers.put(operation.getName(), operationWatchers);

Predicate<Operation> watcher = (Predicate<Operation>) mock(Predicate.class);
when(watcher.test(eq(operation))).thenReturn(true);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only reason to mock a predicate is if you're going to verify it. Otherwise could just use a lambda.

}

private Action createSimpleAction() throws RetryException, InterruptedException {
DigestUtil digestUtil = new DigestUtil(DigestUtil.HashFunction.SHA256);
private Digest createAction(Function<Action, Action> onAction) throws RetryException, InterruptedException {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider refactoring to accepting the action builder as a parameter instead and using it as a "starter" in line 332, instead of Action.newBuilder().

@ola-rozenfeld
Copy link

Oh, and to the 4MB question: um, yes, it is totally 4MB, I checked it fails if I try to stuff more into it. Don't know now what made us change it to 10, but this will not work, and we should fix it :-)

String pageToken = request.getPageToken();
do {
String nextPageToken = instance.getTree(
request.getRootDigest(), pageSize, pageToken, directories);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to clear directories between every onNext. I don't see https://github.com/uber/bazel-buildfarm/blob/working/src/main/java/build/buildfarm/instance/AbstractServerInstance.java#L183 doing that.

Also: while this is better than before, it could be made even more efficient by refactoring instance.getTree to return everything at once instead of repeatedly calling it here. The problem with calling it repeatedly is that you're repeatedly fetching the entire tree, and then slicing out just one portion of it. It's still O(n^2) overall, even though it's all in memory so you could ignore it, I guess.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed the directories builder reset, thanks.

You should take another look at my implementation of things here - I'm not refetching the entire tree on each call to MemoryInstance::getTree - my page token contains a path hierarchy of digests and my restarts are logarithmic.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right, I should have looked closer at TreeIterator.java. Wow. OK, that works too. What we do is our token is a simple offset and we just fetch/stream everything. This makes the restart case slow, sure, but we're fine with that, because the general case is much, much simpler.

@werkt werkt merged commit 6fa5e73 into bazelbuild:master Aug 28, 2018
@werkt werkt deleted the remote-execution-v2 branch August 28, 2018 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants