Execute keep alive response #57

werkt · 2019-02-19T15:23:40Z

Clarify and document the behavior for using an operation response with done=false. This has not been documented with expected behavior by the server or client, and should be detailed as such.

Client behavior should be to ignore the update and continue to wait for responses.

Server behavior should be to emit this update on some reasonable interval - with or without updates, perhaps upon some (maximally limited) client request period, since this will vary by client and should be tunable to address connectivity issues over multi-hop routes.

ola-rozenfeld · 2019-05-09T16:40:22Z

Note: it's technically okay if the client doesn't necessarily wait for the end of the streamed responses. They can just poll using GetOperation instead, or call WaitExecution again.

Wait, by tunable you mean you want the client to be able to tell the server "update me every minute" or something? If so, how about we split this into two patches -- one would just add the comment that you suggested, something like:

The server should stream operation updates (with done=false) on reasonable intervals (e.g. each minute). The client is free to ignore any updates where done=false, and to keep waiting for the operation to complete.

And then we can have a separate feature request to add a Duration to ExecutionRequest for how often you need to stream back the updates. That duration can be limited by something in Capabilities, if we want to. WDYT?

bergsieker · 2019-06-03T20:20:36Z

@werkt Is there something yet to be done here, or can this issue be closed?

* Adding inputs metadata calculation (bazelbuild#45). * Addressing comments

ulfjack · 2020-09-08T17:33:05Z

It would be nice if the client had some guarantees around the timing of update responses. E.g., if the server sends responses every 60 seconds, then the client can use that to defer a timeout.

We are seeing a problem with clients getting stuck if the HTTP connection to the server is silently dropped by an intermediary (we suspect GCP's TCP load balancer). I added support for client-side keep-alives to Bazel. However, there was a discussion about setting a deadline for execute requests on the client.

Personally, I do not like the idea of setting - say - a hard one-minute deadline on Execute and then calling WaitExecution. This doesn't seem right to me. However, if the server guaranteed that it would send responses at some intervals, the client could set a timer, and reset the timer when it sees a response. That would allow it to stick with the original Execute request, rather than (intentionally) falling over to WaitExecution.

It can also apply timeouts at a finer grain than the entire Execute request - the expected process is that the server goes through several phases before and after actual execution, e.g., queuing, input staging, execution, output upload. The client can then apply timeouts to individual phases rather than the entire request (to some extent this is possible today, but not all servers provide phase updates).

EdSchouten · 2020-09-08T17:48:38Z

@ulfjack Sounds like a good idea!

FYI: Buildbarn’s bb_scheduler already writes an update back to the client exactly once every minute, both in the case of Execute and WaitExecution. So we don’t announce worker state changes immediately, but do allow clients to observe individual states if they take long enough.

Changing REv2 to require this is therefore fine from our end.

EricBurnett · 2020-09-08T17:58:04Z

Personally, I do not like the idea of setting - say - a hard one-minute deadline on Execute and then calling WaitExecution.

@ulfjack I think you're responding to a suggestion from me? I agree, a hard 1m deadline seems a little too low, and I'll note that we'd need WaitExecution to be handled exactly the same way since it suffers from the same risks. My preference would be 5-10m RPC-level timeouts alone, or 30-60m RPC timeout plus "more than X (2m?) since last message" preemption logic. (I like the latter better, but would take either.) Though note in either case we also have to have clients exempt DEADLINE_EXCEEDED from eating up retries still.

Either of those should still work with servers that don't send progressive messages, so I'd also be happy if we added e.g. "Servers SHOULD return a message every 60s or less..." to the API, which itself is backwards compatible since it's not a hard requirement.

ulfjack · 2020-09-08T18:26:13Z

@EricBurnett, yes. I think requiring it would have to be deferred to 3.0.

bergsieker added this to the Remote Execution API v2 milestone Mar 12, 2019

bergsieker mentioned this issue Mar 12, 2019

Execute keepalive client expectations #48

Closed

santigl pushed a commit to santigl/remote-apis that referenced this issue Aug 26, 2020

Adding inputs metadata calculation (bazelbuild#45). (bazelbuild#57)

79c7902

* Adding inputs metadata calculation (bazelbuild#45). * Addressing comments

ulfjack mentioned this issue Oct 9, 2020

add --experimental_remote_execute_timeout bazelbuild/bazel#12231

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execute keep alive response #57

Execute keep alive response #57

werkt commented Feb 19, 2019

ola-rozenfeld commented May 9, 2019

bergsieker commented Jun 3, 2019

ulfjack commented Sep 8, 2020

EdSchouten commented Sep 8, 2020

EricBurnett commented Sep 8, 2020

ulfjack commented Sep 8, 2020

Execute keep alive response #57

Execute keep alive response #57

Comments

werkt commented Feb 19, 2019

ola-rozenfeld commented May 9, 2019

bergsieker commented Jun 3, 2019

ulfjack commented Sep 8, 2020

EdSchouten commented Sep 8, 2020

EricBurnett commented Sep 8, 2020

ulfjack commented Sep 8, 2020