Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify that stderr and stdout streams are resource names #116

Merged
merged 1 commit into from
May 4, 2020

Conversation

edbaunton
Copy link
Collaborator

In lieu of a more fully fledged service discovery layer, explicitly specify that the stderr and stdout stream names are in fact URIs from which one can Bytestream.Read.

@googlebot googlebot added the cla: yes Pull requests whose authors are covered by a CLA with Google. label Feb 11, 2020
@EricBurnett
Copy link
Collaborator

EricBurnett commented Feb 11, 2020

Could we go further and give an example? In particular, I'm interested in what protocol we use, and how things are formatted. I believe the scheme Bazel uses for bytestreams listed in the Build Event Stream is:

bytestream://endpoint/resource

So for example for a RBE bytestream, we'd use

bytestream://remotebuildexecution.googleapis.com/projects/foo/instances/bar/logstreams/00aabb...

Would there be any objections to using the exact same scheme here? In particular, the protocol is 'bytestream', and the API version is not included in the URI (there's no /v1/ component), only protocol/endpoint/resource are concretely specified.

Note that the resource itself is arbitrary - everyone is welcome to use any path resource they wish, no restrictions on components or use of logstreams or anything like that. It just must be readable from the specified endpoint, using the ByteStream protocol.

We could also leave the protocol off the URI, since it's listed in the API as being a bytestream URI anyways - that'd look I guess like //endpoint/resource or endpoint/resource. I feel less strongly here, but suspect that if we're including the endpoint in the URI, including 'bytestream:' at the beginning is reasonable too.

@edbaunton
Copy link
Collaborator Author

edbaunton commented Feb 11, 2020

I am not particularly experienced with gRPC, so please bear with me.

I definitely like the idea of having the bytestream:// part be explicit however I am curious to know how compatible that would be with something like an http proxy. I know this interoprerability hasn't been discussed anywhere else, so possibly out of scope for remote-apis in general (and up to the consumer to transform the URI). However, remote_execution does have the google.api.http specifications already.

My idea would be that someone could execute a remote command with an RE client then hand off to curl to stream stdout_stream_name and display to the user. This would require the http gateway (which would not understand bytestream:// presumably).

Now, Bytestream doesn't actually support that at all because it is missing the option (google.api.http) options from the proto; but if it did, it would be great if it would just be a case of replacing the protocol from bytestream:// to http://.

So, I would suggest that the worker is allowed to return any protocol (http, https, grpc, etc.) and it be up to the client to do its best to understand it (knowing that it will at least be bytestream).

Additionally, how does this play with the stripping of v1 from the endpoint? Is that a standard thing? Does the user have to add it back?

@EricBurnett
Copy link
Collaborator

Roughly speaking, there's two directions you could go with a URI/URL field:

  1. Return a URL - fully specified, including how the resource must be accessed. So if the worker passes back a HTTPS URL, then it must be accessed by HTTPS, in the way curl would.
  2. Return a URI - abstract specification of a bytestream resource, but up to you how you call it.

The first one is not really viable though, for reasons which I'll give you a longwinded explanation :). So it really has to be an abstract URI, and then the question is mostly just details of formatting, where "what bazel already standardized" is a pretty good choice.

The URI I listed above is an example of (2), the abstract specification - it tells you of a bytestream resource and the endpoint it's available from, but it's up to you how to call it. In practice you'd probably translate this URI to a gRPC request (details below), but you could in theory access it over REST instead. That does come with the cost that you have to translate the URI before you can use it, but that's actually going to be true for gRPC anyways so only really a cost if you'd be returning HTTPS URLs that would be accessed by HTTPS.

As for using one of these URIs via gRPC, it breaks down as (approximately):

  1. Unpack the URI into protocol / endpoint / resource
  2. Establish a gRPC connection to <endpoint>
  3. Populate a ReadRequest proto with the resource specified.
  4. Call ByteStream.Read() with the given ReadRequest.

I was mistaken in mentioning V1 - ByteStream as currently defined is a versionless protocol, so there's no version to include anywhere. But in the abstract, if you imagined there being a V2 of ByteStream, where this would come in is as a client you'd have to decide if you were calling ByteStream V1 or ByteStream V2, and link in the appropriate gRPC library. When you went to call it, it'd be adding the appropriate version to the underlying HTTP URL under the hood.

In practice, clients are usually bound to one specific version of a protocol, so whether a client could talk to a server would be a question of whether they speak compatible versions of the protocol. That's independent of the resources being referenced. But as I said, academic for ByteStream - it's a versionless API with no plans to ever have a V2, so there should be no cases where the client and server ever can speak incompatible versions.

I should note that bytestream://whatever assumes that your client will talk to the server using a compatible protocol, and doesn't specify which one(s). But I think that's a reasonable assumption - it's a safe bet that the resource will be available over gRPC; if you have a client that wants to access that resource over HTTP/REST instead (or something else), it's up to you to ensure that the server also makes it available over HTTP/REST. That's not too different in practice from e.g. ensuring your client is authorized to access it from that endpoint, which is required independent of protocol.


By contrast then, going back the idea of (1) - a fully-specified URL that might be gRPC or HTTP - then means the resource can be only accessed over one specific protocol, and if you want to access it over something else you either can't, or have to translate one URL into a different URL which is a more hacky solution and requires essentially the same assumption. It also has a glaring flaw: grpc URLs are not well defined. I mean, gRPC when transported over HTTPS is defined to have a URL like https://endpoint/google.bytestream.ByteStream/Read, but the resource itself is encoded as part of the request message in the payload, not the URL.

REST translations do have a well-defined URL, but as you pointed out, the ByteStream API doesn't actually embed a REST translation. So we'd have to define our own URLs, but then we're really just back to the "abstract" pattern except trying to embed both the fact that it should be accessed through ByteStream and which specific protocol (gRPC or HTTP or other) that should be done through - which to me is not actually better than just specifying the resource.

@edbaunton
Copy link
Collaborator Author

Ah ha, thank you for the explanation!

If the client has to deconstruct the URL in some cases to determine the components and the "endpoint" is likely to be the already existing configured bytestream service: should we just leave this as a resource name and allow an optionally specified endpoint addition to override the "default" Bytestream service being used elsewhere in the spec.

The bytestream:// prefix seems redundant if it will always be a Bytestream service.

@EricBurnett
Copy link
Collaborator

I'd also be fine defining it to be a resource available on the same endpoint as the CAS Bytestream interface, and so only the relative component. That's how we'd use it, at any rate (for us, projects/foo/instances/bar/logstreams/00aabb or similar.) I remember there being some objection to that assumption though in the monthly meeting? I think from @sstriker, though I may be mis-remembering.

I guess specifying that it may be either

`resource` - implied to be a complete resource path relative to the same endpoint as the CAS Bytestream

or

`//endpoint/resource` - a fully specified resource path with endpoint.

would serve that purpose, though I'd want to confirm there's anyone expecting to use the endpoint form before we commit to having two variants specified.

@edbaunton
Copy link
Collaborator Author

I think it was me who wanted to have a different endpoint :) I understand that not everyone would need that, so separating them into two components doesn't change the spec significantly. The implication being that the resource is available at the same place as your already configured CAS.

@sstriker
Copy link
Collaborator

sstriker commented Feb 21, 2020 via email

@EricBurnett
Copy link
Collaborator

Ok, thanks both!

Ed: ahh, apologies! I have a terrible memory sometimes. Can you clarify if you're voting for

  1. 'resource'
  2. resource' OR '//endpoint/resource'
  3. '//endpoint/resource'

I'm having trouble parsing from your response which you'd currently prefer to standardize.

Sander: I think one risk here is that if we standardize both 'resource' and '//endpoint/resource', but most clients only end up seeing 'resource' paths in practice, that's all they'll support in practice until updated to know the other form explicitly. Not sure how you feel about that.

@sstriker
Copy link
Collaborator

sstriker commented Feb 21, 2020 via email

@moroten
Copy link
Contributor

moroten commented Feb 22, 2020 via email

@EricBurnett
Copy link
Collaborator

@moroten : The current API only covers stdout and stderr, but in principle it could be extended to more, and if it were I'm comfortable saying they'd be expressed in a consistent way as these ones. I don't know of anyone working on the spec changes for that at present though, as we'd need to discuss how clients request files to be streamed back and whatnot.

@sstriker : The current API doesn't specify whether Services are hosted together or separately, and in principle they could all be different endpoints. No client I'm aware of has that flexibility at present however - bazel can currently support only two iirc, and we ourselves always set them to the same value.

I wonder if that suggests another path forward here: specify in this change that all URIs are relative form (resource), relative to "the endpoint that hosts streamed responses". And leave the endpoint entirely unspecified. That's not actually different than e.g. how the Execution API returns Digests but doesn't tell you the endpoint upon which the CAS can be found. So in practice, if anyone chooses to implement client(s) where the streaming resources are on a different endpoint as CAS resources, that's allowable at the spec level.

I realize it entirely sidesteps the question of what we expect, but I think that's in part because expectations vary. On the API side, it's possible today to host every API on a different endpoint; on the client side, bazel at least expects use of only one or two, and other clients have potentially different expectations but probably not full variability.

@edbaunton : does that also satisfy your use-case? Relative resources, but relative to an endpoint specified out-of-band and explicitly allowed to be something other than the CAS endpoint?


And then, in terms of cleaning up the broader endpoint confusion so client expectations can be made more consistent with each other and with services, there are two directions we could go as a community. One is to say that clients should expect all APIs available at a single endpoint, period. The other is to define a clear set of different 'virtual' endpoints (including the concrete Services themselves, like the Execution service, plus 'CAS blob bytestreams', 'streamed bytestreams', etc), specify how they're bucketed in terms of which are allowed to vary independently and which should be co-hosted, and recommend a consistent way to configure clients and/or let them discover it (via Capabilities probably).

To me that feels worthwhile to settle, but also a big enough thing we should probably make it its own bug and target it at V3. WDYT?

@illicitonion
Copy link
Contributor

@sstriker : The current API doesn't specify whether Services are hosted together or separately, and in principle they could all be different endpoints. No client I'm aware of has that flexibility at present however - bazel can currently support only two iirc, and we ourselves always set them to the same value.

FWIW, currently Pants allows specifying the CAS and Execution servers differently, and Scoot, Twitter's server implementation, currently assumes they will be set distinctly.

@EricBurnett
Copy link
Collaborator

@illicitonion Scoot expecting them to be different is interesting, but otherwise I think that fits the idea that we should hash out what are and independently variable outside the scope of this PR. Clearly there are interesting and different constraints to collect!

@edbaunton : if you don't have any concerns, mind updating the PR in line with the suggestion above? Copied for convenience:

specify in this change that all URIs are relative form (bare resource), relative to "the endpoint that hosts streamed responses".

In lieu of a more fully fledged service discovery layer, explicitly specify
that the stderr and stdout stream names are in fact resource names from which
one can Bytestream.Read. The actual endpoints are to be configured out of band.

Signed-off-by: Ed Baunton <ebaunton1@bloomberg.net>
@EricBurnett
Copy link
Collaborator

Thanks Ed! LGTM from me.

@edbaunton edbaunton changed the title Specify that stderr and stdout streams are URIs Specify that stderr and stdout streams are resource names Mar 12, 2020
@EricBurnett
Copy link
Collaborator

I've pulled out the discussion of "what to do in V3?" into its own bug: #132

@sstriker sstriker merged commit 0b2ba96 into bazelbuild:master May 4, 2020
@edbaunton edbaunton deleted the patch-3 branch May 4, 2020 12:35
santigl pushed a commit to santigl/remote-apis that referenced this pull request Aug 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Pull requests whose authors are covered by a CLA with Google.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants