Specify that stderr and stdout streams are resource names #116

edbaunton · 2020-02-11T15:51:38Z

In lieu of a more fully fledged service discovery layer, explicitly specify that the stderr and stdout stream names are in fact URIs from which one can Bytestream.Read.

EricBurnett · 2020-02-11T16:24:36Z

Could we go further and give an example? In particular, I'm interested in what protocol we use, and how things are formatted. I believe the scheme Bazel uses for bytestreams listed in the Build Event Stream is:

bytestream://endpoint/resource

So for example for a RBE bytestream, we'd use

bytestream://remotebuildexecution.googleapis.com/projects/foo/instances/bar/logstreams/00aabb...

Would there be any objections to using the exact same scheme here? In particular, the protocol is 'bytestream', and the API version is not included in the URI (there's no /v1/ component), only protocol/endpoint/resource are concretely specified.

Note that the resource itself is arbitrary - everyone is welcome to use any path resource they wish, no restrictions on components or use of logstreams or anything like that. It just must be readable from the specified endpoint, using the ByteStream protocol.

We could also leave the protocol off the URI, since it's listed in the API as being a bytestream URI anyways - that'd look I guess like //endpoint/resource or endpoint/resource. I feel less strongly here, but suspect that if we're including the endpoint in the URI, including 'bytestream:' at the beginning is reasonable too.

edbaunton · 2020-02-11T18:38:21Z

I am not particularly experienced with gRPC, so please bear with me.

I definitely like the idea of having the bytestream:// part be explicit however I am curious to know how compatible that would be with something like an http proxy. I know this interoprerability hasn't been discussed anywhere else, so possibly out of scope for remote-apis in general (and up to the consumer to transform the URI). However, remote_execution does have the google.api.http specifications already.

My idea would be that someone could execute a remote command with an RE client then hand off to curl to stream stdout_stream_name and display to the user. This would require the http gateway (which would not understand bytestream:// presumably).

Now, Bytestream doesn't actually support that at all because it is missing the option (google.api.http) options from the proto; but if it did, it would be great if it would just be a case of replacing the protocol from bytestream:// to http://.

So, I would suggest that the worker is allowed to return any protocol (http, https, grpc, etc.) and it be up to the client to do its best to understand it (knowing that it will at least be bytestream).

Additionally, how does this play with the stripping of v1 from the endpoint? Is that a standard thing? Does the user have to add it back?

EricBurnett · 2020-02-11T20:45:16Z

Roughly speaking, there's two directions you could go with a URI/URL field:

Return a URL - fully specified, including how the resource must be accessed. So if the worker passes back a HTTPS URL, then it must be accessed by HTTPS, in the way curl would.
Return a URI - abstract specification of a bytestream resource, but up to you how you call it.

The first one is not really viable though, for reasons which I'll give you a longwinded explanation :). So it really has to be an abstract URI, and then the question is mostly just details of formatting, where "what bazel already standardized" is a pretty good choice.

The URI I listed above is an example of (2), the abstract specification - it tells you of a bytestream resource and the endpoint it's available from, but it's up to you how to call it. In practice you'd probably translate this URI to a gRPC request (details below), but you could in theory access it over REST instead. That does come with the cost that you have to translate the URI before you can use it, but that's actually going to be true for gRPC anyways so only really a cost if you'd be returning HTTPS URLs that would be accessed by HTTPS.

As for using one of these URIs via gRPC, it breaks down as (approximately):

Unpack the URI into protocol / endpoint / resource
Establish a gRPC connection to <endpoint>
Populate a ReadRequest proto with the resource specified.
Call ByteStream.Read() with the given ReadRequest.

I was mistaken in mentioning V1 - ByteStream as currently defined is a versionless protocol, so there's no version to include anywhere. But in the abstract, if you imagined there being a V2 of ByteStream, where this would come in is as a client you'd have to decide if you were calling ByteStream V1 or ByteStream V2, and link in the appropriate gRPC library. When you went to call it, it'd be adding the appropriate version to the underlying HTTP URL under the hood.

In practice, clients are usually bound to one specific version of a protocol, so whether a client could talk to a server would be a question of whether they speak compatible versions of the protocol. That's independent of the resources being referenced. But as I said, academic for ByteStream - it's a versionless API with no plans to ever have a V2, so there should be no cases where the client and server ever can speak incompatible versions.

I should note that bytestream://whatever assumes that your client will talk to the server using a compatible protocol, and doesn't specify which one(s). But I think that's a reasonable assumption - it's a safe bet that the resource will be available over gRPC; if you have a client that wants to access that resource over HTTP/REST instead (or something else), it's up to you to ensure that the server also makes it available over HTTP/REST. That's not too different in practice from e.g. ensuring your client is authorized to access it from that endpoint, which is required independent of protocol.

By contrast then, going back the idea of (1) - a fully-specified URL that might be gRPC or HTTP - then means the resource can be only accessed over one specific protocol, and if you want to access it over something else you either can't, or have to translate one URL into a different URL which is a more hacky solution and requires essentially the same assumption. It also has a glaring flaw: grpc URLs are not well defined. I mean, gRPC when transported over HTTPS is defined to have a URL like https://endpoint/google.bytestream.ByteStream/Read, but the resource itself is encoded as part of the request message in the payload, not the URL.

REST translations do have a well-defined URL, but as you pointed out, the ByteStream API doesn't actually embed a REST translation. So we'd have to define our own URLs, but then we're really just back to the "abstract" pattern except trying to embed both the fact that it should be accessed through ByteStream and which specific protocol (gRPC or HTTP or other) that should be done through - which to me is not actually better than just specifying the resource.

edbaunton · 2020-02-13T23:10:34Z

Ah ha, thank you for the explanation!

If the client has to deconstruct the URL in some cases to determine the components and the "endpoint" is likely to be the already existing configured bytestream service: should we just leave this as a resource name and allow an optionally specified endpoint addition to override the "default" Bytestream service being used elsewhere in the spec.

The bytestream:// prefix seems redundant if it will always be a Bytestream service.

EricBurnett · 2020-02-21T18:57:34Z

I'd also be fine defining it to be a resource available on the same endpoint as the CAS Bytestream interface, and so only the relative component. That's how we'd use it, at any rate (for us, projects/foo/instances/bar/logstreams/00aabb or similar.) I remember there being some objection to that assumption though in the monthly meeting? I think from @sstriker, though I may be mis-remembering.

I guess specifying that it may be either

`resource` - implied to be a complete resource path relative to the same endpoint as the CAS Bytestream

or

`//endpoint/resource` - a fully specified resource path with endpoint.

would serve that purpose, though I'd want to confirm there's anyone expecting to use the endpoint form before we commit to having two variants specified.

edbaunton · 2020-02-21T19:17:02Z

I think it was me who wanted to have a different endpoint :) I understand that not everyone would need that, so separating them into two components doesn't change the spec significantly. The implication being that the resource is available at the same place as your already configured CAS.

sstriker · 2020-02-21T21:09:27Z

@eric Burnett <ericburnett@google.com> I wanted to clarify the language so that clients know what level of configurability they should support - if CAS, Execute, ActionCache and Remote Asset Fetch/Push and potentially log streams can live on different endpoints, they would need to be able to be configured. Otherwise servers that implement these services to live on separate endpoints won't be able to be used with clients that expect everything on the same endpoint. In short, I wanted to remove the ambiguity around service endpoints.

…

On Fri, Feb 21, 2020 at 7:57 PM Eric Burnett ***@***.***> wrote: I'd also be fine defining it to be a resource available on the same endpoint as the CAS Bytestream interface, and so only the relative component. That's how we'd use it, at any rate (for us, projects/foo/instances/bar/logstreams/00aabb or similar.) I remember there being some objection to that assumption though in the monthly meeting? I think from @sstriker <https://github.com/sstriker>, though I may be mis-remembering. I guess specifying that it may be either `resource` - implied to be a complete resource path relative to the same endpoint as the CAS Bytestream or `//endpoint/resource` - a fully specified resource path with endpoint. would serve that purpose, though I'd want to confirm there's anyone expecting to use the endpoint form before we commit to having *two* variants specified. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#116?email_source=notifications&email_token=AAQVXZ7T4WAVYWERXGXK6M3REAPZ5A5CNFSM4KTDMNP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMTXGUY#issuecomment-589787987>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQVXZ5PF2JMN3FCBC6DL2TREAPZ5ANCNFSM4KTDMNPQ> .

EricBurnett · 2020-02-21T21:22:38Z

Ok, thanks both!

Ed: ahh, apologies! I have a terrible memory sometimes. Can you clarify if you're voting for

'resource'
resource' OR '//endpoint/resource'
'//endpoint/resource'

I'm having trouble parsing from your response which you'd currently prefer to standardize.

Sander: I think one risk here is that if we standardize both 'resource' and '//endpoint/resource', but most clients only end up seeing 'resource' paths in practice, that's all they'll support in practice until updated to know the other form explicitly. Not sure how you feel about that.

sstriker · 2020-02-21T21:51:15Z

Thanks @eric Burnett <ericburnett@google.com>. I was thinking more of either saying: 1. We expect all of the services defined within the Remote Execution API and within the Remote Asset API to be exposed via a single endpoint, or 2. We expect each of the services defined within the Remote Execution API and within the Remote Asset API to potentially be exposed via separate endpoints, or 3. We expect services X, Y and Z defined within the Remote Execution API to be exposed via a single endpoint, and service A and B... We can take this to a separate issue though. With respect to streams, I agree with you that there is a risk of clients only really implementing 1, if we choose 2. In other words, if we want to ability to direct clients to specific endpoints for log streams, we need to make it 3.

…

On Fri, Feb 21, 2020 at 10:22 PM Eric Burnett ***@***.***> wrote: Ok, thanks both! Ed: ahh, apologies! I have a terrible memory sometimes. Can you clarify if you're voting for 1. 'resource' 2. resource' OR '//endpoint/resource' 3. '//endpoint/resource' I'm having trouble parsing from your response which you'd currently prefer to standardize. Sander: I think one risk here is that if we standardize both 'resource' and '//endpoint/resource', but most clients only end up seeing 'resource' paths in practice, that's all they'll support in practice until updated to know the other form explicitly. Not sure how you feel about that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#116?email_source=notifications&email_token=AAQVXZZM6Z73BXFNEUNBQJLREBAZ5A5CNFSM4KTDMNP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMUEBPQ#issuecomment-589840574>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQVXZZHYF2WNOP6HHMKUDDREBAZ5ANCNFSM4KTDMNPQ> .

moroten · 2020-02-22T08:12:20Z

Just to clarify, you are talking about stdout and stderr, but will the API support streaming other log files too? My long running actions sometimes write other debug log files which would be useful to stream. They are not written to stdout as that would spam the console output in Bazel. /Fredrik

EricBurnett · 2020-02-22T17:45:35Z

@moroten : The current API only covers stdout and stderr, but in principle it could be extended to more, and if it were I'm comfortable saying they'd be expressed in a consistent way as these ones. I don't know of anyone working on the spec changes for that at present though, as we'd need to discuss how clients request files to be streamed back and whatnot.

@sstriker : The current API doesn't specify whether Services are hosted together or separately, and in principle they could all be different endpoints. No client I'm aware of has that flexibility at present however - bazel can currently support only two iirc, and we ourselves always set them to the same value.

I wonder if that suggests another path forward here: specify in this change that all URIs are relative form (resource), relative to "the endpoint that hosts streamed responses". And leave the endpoint entirely unspecified. That's not actually different than e.g. how the Execution API returns Digests but doesn't tell you the endpoint upon which the CAS can be found. So in practice, if anyone chooses to implement client(s) where the streaming resources are on a different endpoint as CAS resources, that's allowable at the spec level.

I realize it entirely sidesteps the question of what we expect, but I think that's in part because expectations vary. On the API side, it's possible today to host every API on a different endpoint; on the client side, bazel at least expects use of only one or two, and other clients have potentially different expectations but probably not full variability.

@edbaunton : does that also satisfy your use-case? Relative resources, but relative to an endpoint specified out-of-band and explicitly allowed to be something other than the CAS endpoint?

And then, in terms of cleaning up the broader endpoint confusion so client expectations can be made more consistent with each other and with services, there are two directions we could go as a community. One is to say that clients should expect all APIs available at a single endpoint, period. The other is to define a clear set of different 'virtual' endpoints (including the concrete Services themselves, like the Execution service, plus 'CAS blob bytestreams', 'streamed bytestreams', etc), specify how they're bucketed in terms of which are allowed to vary independently and which should be co-hosted, and recommend a consistent way to configure clients and/or let them discover it (via Capabilities probably).

To me that feels worthwhile to settle, but also a big enough thing we should probably make it its own bug and target it at V3. WDYT?

illicitonion · 2020-02-25T17:23:40Z

@sstriker : The current API doesn't specify whether Services are hosted together or separately, and in principle they could all be different endpoints. No client I'm aware of has that flexibility at present however - bazel can currently support only two iirc, and we ourselves always set them to the same value.

FWIW, currently Pants allows specifying the CAS and Execution servers differently, and Scoot, Twitter's server implementation, currently assumes they will be set distinctly.

EricBurnett · 2020-03-06T18:22:32Z

@illicitonion Scoot expecting them to be different is interesting, but otherwise I think that fits the idea that we should hash out what are and independently variable outside the scope of this PR. Clearly there are interesting and different constraints to collect!

@edbaunton : if you don't have any concerns, mind updating the PR in line with the suggestion above? Copied for convenience:

specify in this change that all URIs are relative form (bare resource), relative to "the endpoint that hosts streamed responses".

In lieu of a more fully fledged service discovery layer, explicitly specify that the stderr and stdout stream names are in fact resource names from which one can Bytestream.Read. The actual endpoints are to be configured out of band. Signed-off-by: Ed Baunton <ebaunton1@bloomberg.net>

EricBurnett · 2020-03-10T20:10:27Z

Thanks Ed! LGTM from me.

EricBurnett · 2020-04-20T14:07:38Z

I've pulled out the discussion of "what to do in V3?" into its own bug: #132

)

edbaunton requested review from agoulti, bergsieker, buchgr and ola-rozenfeld as code owners February 11, 2020 15:51

googlebot added the cla: yes Pull requests whose authors are covered by a CLA with Google. label Feb 11, 2020

edbaunton force-pushed the patch-3 branch from 74f349f to 8a1021a Compare March 10, 2020 13:18

edbaunton force-pushed the patch-3 branch from 8a1021a to b6a19e6 Compare March 10, 2020 19:01

edbaunton changed the title ~~Specify that stderr and stdout streams are URIs~~ Specify that stderr and stdout streams are resource names Mar 12, 2020

EricBurnett mentioned this pull request Apr 20, 2020

V3: Clarify relationship between APIs and Endpoints #132

Open

sstriker approved these changes May 4, 2020

View reviewed changes

sstriker merged commit 0b2ba96 into bazelbuild:master May 4, 2020

edbaunton deleted the patch-3 branch May 4, 2020 12:35

santigl pushed a commit to santigl/remote-apis that referenced this pull request Aug 26, 2020

Handling empty directory input structure (bazelbuild#42) (bazelbuild#116

722563a

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify that stderr and stdout streams are resource names #116

Specify that stderr and stdout streams are resource names #116

edbaunton commented Feb 11, 2020

EricBurnett commented Feb 11, 2020 •

edited

Loading

edbaunton commented Feb 11, 2020 •

edited

Loading

EricBurnett commented Feb 11, 2020

edbaunton commented Feb 13, 2020

EricBurnett commented Feb 21, 2020

edbaunton commented Feb 21, 2020

sstriker commented Feb 21, 2020 via email

EricBurnett commented Feb 21, 2020

sstriker commented Feb 21, 2020 via email

moroten commented Feb 22, 2020 via email

EricBurnett commented Feb 22, 2020

illicitonion commented Feb 25, 2020

EricBurnett commented Mar 6, 2020

EricBurnett commented Mar 10, 2020

EricBurnett commented Apr 20, 2020

Specify that stderr and stdout streams are resource names #116

Specify that stderr and stdout streams are resource names #116

Conversation

edbaunton commented Feb 11, 2020

EricBurnett commented Feb 11, 2020 • edited Loading

edbaunton commented Feb 11, 2020 • edited Loading

EricBurnett commented Feb 11, 2020

edbaunton commented Feb 13, 2020

EricBurnett commented Feb 21, 2020

edbaunton commented Feb 21, 2020

sstriker commented Feb 21, 2020 via email

EricBurnett commented Feb 21, 2020

sstriker commented Feb 21, 2020 via email

moroten commented Feb 22, 2020 via email

EricBurnett commented Feb 22, 2020

illicitonion commented Feb 25, 2020

EricBurnett commented Mar 6, 2020

EricBurnett commented Mar 10, 2020

EricBurnett commented Apr 20, 2020

EricBurnett commented Feb 11, 2020 •

edited

Loading

edbaunton commented Feb 11, 2020 •

edited

Loading