Update remote execution API to v2 #178

werkt · 2018-08-19T17:06:14Z

No description provided.

ola-rozenfeld

I haven't looked through everything yet, but please implement Capabilities. In our service we currently return:

ServerCapabilities {
  deprecated_api_version {
    prerelease: "v1test"
  }
  low_api_version {
    major: 2
  }
  high_api_version {
    major: 2
  }
  execution_capabilities: {
    digest_function: SHA256,
    exec_enabled: // depends on user
  }
  cache_capabilities: {
    digest_function: {SHA256},
    action_cache_update_capabilities: {
      allow_update: // depends on user
    }
    max_batch_total_size_bytes: 10MB,
    symlink_absolute_path_strategy: DISALLOWED,
  }
}

Since you remove support for v1test here, you need to not return the deprecated_api_version (we're keeping both until all our clients switch). You probably also want to return some valid cache priority ranges, since you want to support that (not as critical for us).
I'm now sending out a Bazel patch that would query the Capabilities service, so it's better to add it right away. I will add the --cache_results_priority flag after that.

ola-rozenfeld · 2018-08-19T19:44:36Z

Also: I screwed up, sorry, you'll need bazelbuild/bazel#5934 as well. Only now discovered I forgot to define the java package in the semver.proto (our Go server didn't complain...).

werkt · 2018-08-20T19:55:24Z

Forgive my ignorance, but how are you supporting 10MB batch sizes if the GRPC limit is 4MB? just over capped?

ola-rozenfeld · 2018-08-20T20:17:05Z

No, that's a good question -- I remember looking into this limit thing further and discovering it wasn't trivial, but I don't remember what I found. I'll get back to you on this! We might be doing it wrong.
In any case, 4mb will work for sure.

ola-rozenfeld

Looks good! A few nits:

ola-rozenfeld · 2018-08-27T14:08:30Z

src/main/java/build/buildfarm/instance/AbstractServerInstance.java

+    try {
+      action = Action.parseFrom(actionBlob);
+    } catch (InvalidProtocolBufferException e) {
+      Preconditions.checkState(


nit: I think throw new IllegalStateException(INVALID_DIGEST); is clearer.

ola-rozenfeld · 2018-08-27T14:25:48Z

src/main/java/build/buildfarm/instance/memory/MemoryInstance.java

-          } else if (!requeueOnFailure) {
+        Command command = expectCommand(operation);
+        if (command == null) {
+          cancelOperation(operation.getName());


This can happen with low probability race, on command getting evicted right after we validated the action? Or it cannot happen at all because you refresh the blob TTL on access?

Low/load-induced probability race - this is the long tail operation queue, where operations may have been sitting for an indeterminate amount of time, waiting for a worker to be ready to execute.

ola-rozenfeld · 2018-08-27T14:32:30Z

src/main/java/build/buildfarm/instance/stub/StubInstance.java

        .get()
        .getTree(GetTreeRequest.newBuilder()
            .setInstanceName(getName())
            .setRootDigest(rootDigest)
            .setPageSize(pageSize)
            .setPageToken(pageToken)
            .build());
-    directories.addAll(response.getDirectoriesList());
-    return response.getNextPageToken();
+    // new streaming interface doesn't really fit with what we're trying to do here...


Wait, but this way you're not really implementing the streaming interface as it was intended. I mean, this is valid per spec, but you're not benefitting from the new interface, which just wants to stream the entire tree once and not make the client repeat individual getTree calls, unless a retriable error happened.

You might want to refactor the Instance API to just return the whole tree instead, and build on that to stream it properly in the service. It won't matter in practice if your default page size is sufficient, but will result in a performance boost for small page sizes, and should also make the code simpler, I think. But this could also be done in a separate patch.

The effect of the new streaming interface on this is actually less than you think - I'm benefitting here because what used to take multiple calls to getTree with distinct page tokens can now be fed completely, ending on a single empty nextPageToken to complete the sequence. The only call to this method is in worker/operationqueue/Worker.java, where the intended effect was to fetch the entire tree, which this is, as you say, completely in spec for.

what used to take multiple calls to getTree with distinct page tokens can now be fed completely

Yes, that's what I don't see here -- at which point are you replacing a loop of getTree calls with a single getTree call? The only service implementation I see still calls .onNext() just once once before .onCompleted().

The only call to this particular function is
src/test/java/build/buildfarm/server/BuildFarmServerTest.java:314
which is unmodified from the previous implementation because the instance interface hasn't changed (it just manifests as a single call in the case of a successful stream, and expects that a non-empty nextPageToken will be returned for any retriable error).

Gotcha regarding the service implementation, yeah I didn't change that to take advantage. Updating now.

ola-rozenfeld · 2018-08-27T15:10:20Z

src/main/java/build/buildfarm/instance/memory/MemoryInstance.java

+      // watcher processed completed state
+      return true;
+    }
+    if (operation == null || operation.getDone()) {


How can this even happen? I see the watcher returns falls for these cases.

I preferred to be decoupled from any watcher implementation, in particular one that would be unconcerned with the null-ness or done-ness of the operation it is watching. In those cases, I can do nothing further in this function.

ola-rozenfeld · 2018-08-27T15:11:32Z

src/main/java/build/buildfarm/server/ExecutionService.java

+        if (operation != null) {
+          responseObserver.onNext(operation);
+        }
+        if (operation == null || operation.getDone()) {


if you move the operation == null condition into line 37, then you don't need to check for it in the rest of the function and the code gets slightly simpler.

But in the null operation case, I still need to complete the responseObserver stream. Moving it out of the try block here will also prevent coverage by the grpc induced cancellation/expiration handlers present in the catches.

Oh, sorry, you're right! Disregard.

ola-rozenfeld · 2018-08-27T15:25:16Z

src/main/java/build/buildfarm/worker/MatchStage.java

+    }
+
+    ByteString commandBlob = workerContext.getBlob(action.getCommandDigest());
+    if (actionBlob == null) {


commandBlob

ola-rozenfeld · 2018-08-27T15:27:23Z

src/main/java/build/buildfarm/worker/UploadManifest.java

@@ -163,29 +163,16 @@ public void addContent(ByteString content, CASInsertionPolicy policy, Consumer<B
  }

  private void addFile(Path file, long size, CASInsertionPolicy policy) throws IOException {


Do you still need to pass the size externally? I think DigestUtil should handle computing the digest correctly, regardless of whether the file is a symlink or not.

nope, cleaning that up.

ola-rozenfeld · 2018-08-27T15:34:15Z

src/test/java/build/buildfarm/instance/memory/MemoryInstanceTest.java

@@ -225,9 +211,9 @@ public void watchOperationAddsWatcher() throws InterruptedException {
    watchers.put(operation.getName(), operationWatchers);

    Predicate<Operation> watcher = (Predicate<Operation>) mock(Predicate.class);
+    when(watcher.test(eq(operation))).thenReturn(true);


only reason to mock a predicate is if you're going to verify it. Otherwise could just use a lambda.

ola-rozenfeld · 2018-08-27T15:43:25Z

src/test/java/build/buildfarm/server/BuildFarmServerTest.java

  }

-  private Action createSimpleAction() throws RetryException, InterruptedException {
-    DigestUtil digestUtil = new DigestUtil(DigestUtil.HashFunction.SHA256);
+  private Digest createAction(Function<Action, Action> onAction) throws RetryException, InterruptedException {


Consider refactoring to accepting the action builder as a parameter instead and using it as a "starter" in line 332, instead of Action.newBuilder().

ola-rozenfeld · 2018-08-27T16:04:14Z

Oh, and to the 4MB question: um, yes, it is totally 4MB, I checked it fails if I try to stuff more into it. Don't know now what made us change it to 10, but this will not work, and we should fix it :-)

addFile does not need specialized behavior for symlink sizes

ola-rozenfeld · 2018-08-27T19:14:15Z

src/main/java/build/buildfarm/server/ContentAddressableStorageService.java

+    String pageToken = request.getPageToken();
+    do {
+      String nextPageToken = instance.getTree(
+          request.getRootDigest(), pageSize, pageToken, directories);


You want to clear directories between every onNext. I don't see https://github.com/uber/bazel-buildfarm/blob/working/src/main/java/build/buildfarm/instance/AbstractServerInstance.java#L183 doing that.

Also: while this is better than before, it could be made even more efficient by refactoring instance.getTree to return everything at once instead of repeatedly calling it here. The problem with calling it repeatedly is that you're repeatedly fetching the entire tree, and then slicing out just one portion of it. It's still O(n^2) overall, even though it's all in memory so you could ignore it, I guess.

Missed the directories builder reset, thanks.

You should take another look at my implementation of things here - I'm not refetching the entire tree on each call to MemoryInstance::getTree - my page token contains a path hierarchy of digests and my restarts are logarithmic.

Oh, you're right, I should have looked closer at TreeIterator.java. Wow. OK, that works too. What we do is our token is a simple offset and we just fetch/stream everything. This makes the restart case slow, sure, but we're fine with that, because the general case is much, much simpler.

Update remote execution API to v2

01ae0fa

googlebot added the cla: yes label Aug 19, 2018

werkt requested a review from ola-rozenfeld August 19, 2018 17:06

ola-rozenfeld reviewed Aug 19, 2018

View reviewed changes

Bump remote apis release for semver packaging

bcfab85

werkt added 4 commits August 20, 2018 22:44

Switch name of rex targets to match packaging

da04413

Switch HashFunction enum to rex DigestFunction

39bcb3c

Define CapabilitiesService, add to instance iface

759084f

Merge branch 'master' into remote-execution-v2

e1ada3a

ola-rozenfeld reviewed Aug 27, 2018

View reviewed changes

George Gensure added 6 commits August 27, 2018 14:33

Remove size supplied to UploadManifest for addFile

0e0c0e1

addFile does not need specialized behavior for symlink sizes

Nit fixing, throw for IllegalState is much clearer

a810573

Leverage getTree streaming interface for full tree

08792fb

Test for variable just assigned

ea98819

Remove unverified mock

43125aa

Switch to Action.Builder base supplier for fixture

7343298

ola-rozenfeld reviewed Aug 27, 2018

View reviewed changes

Move directories builder to restart per page

74db3cf

ola-rozenfeld approved these changes Aug 27, 2018

View reviewed changes

werkt merged commit 6fa5e73 into bazelbuild:master Aug 28, 2018

werkt deleted the remote-execution-v2 branch August 28, 2018 10:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update remote execution API to v2 #178

Update remote execution API to v2 #178

werkt commented Aug 19, 2018

ola-rozenfeld left a comment •

edited

Loading

ola-rozenfeld commented Aug 19, 2018

werkt commented Aug 20, 2018

ola-rozenfeld commented Aug 20, 2018

ola-rozenfeld left a comment

ola-rozenfeld Aug 27, 2018

ola-rozenfeld Aug 27, 2018

werkt Aug 27, 2018

ola-rozenfeld Aug 27, 2018

werkt Aug 27, 2018

ola-rozenfeld Aug 27, 2018

werkt Aug 27, 2018

ola-rozenfeld Aug 27, 2018

werkt Aug 27, 2018

ola-rozenfeld Aug 27, 2018

werkt Aug 27, 2018

ola-rozenfeld Aug 27, 2018

ola-rozenfeld Aug 27, 2018

ola-rozenfeld Aug 27, 2018

werkt Aug 27, 2018

ola-rozenfeld Aug 27, 2018

ola-rozenfeld Aug 27, 2018

ola-rozenfeld commented Aug 27, 2018

ola-rozenfeld Aug 27, 2018

werkt Aug 27, 2018

ola-rozenfeld Aug 27, 2018

		@@ -163,29 +163,16 @@ public void addContent(ByteString content, CASInsertionPolicy policy, Consumer<B
		}

		private void addFile(Path file, long size, CASInsertionPolicy policy) throws IOException {

Update remote execution API to v2 #178

Update remote execution API to v2 #178

Conversation

werkt commented Aug 19, 2018

ola-rozenfeld left a comment • edited Loading

Choose a reason for hiding this comment

ola-rozenfeld commented Aug 19, 2018

werkt commented Aug 20, 2018

ola-rozenfeld commented Aug 20, 2018

ola-rozenfeld left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ola-rozenfeld commented Aug 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ola-rozenfeld left a comment •

edited

Loading