Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute keep alive response #57

Open
werkt opened this issue Feb 19, 2019 · 6 comments
Open

Execute keep alive response #57

werkt opened this issue Feb 19, 2019 · 6 comments

Comments

@werkt
Copy link
Collaborator

werkt commented Feb 19, 2019

Clarify and document the behavior for using an operation response with done=false. This has not been documented with expected behavior by the server or client, and should be detailed as such.

Client behavior should be to ignore the update and continue to wait for responses.

Server behavior should be to emit this update on some reasonable interval - with or without updates, perhaps upon some (maximally limited) client request period, since this will vary by client and should be tunable to address connectivity issues over multi-hop routes.

@ola-rozenfeld
Copy link
Contributor

Note: it's technically okay if the client doesn't necessarily wait for the end of the streamed responses. They can just poll using GetOperation instead, or call WaitExecution again.

Wait, by tunable you mean you want the client to be able to tell the server "update me every minute" or something? If so, how about we split this into two patches -- one would just add the comment that you suggested, something like:

The server should stream operation updates (with done=false) on reasonable intervals (e.g. each minute). The client is free to ignore any updates where done=false, and to keep waiting for the operation to complete.

And then we can have a separate feature request to add a Duration to ExecutionRequest for how often you need to stream back the updates. That duration can be limited by something in Capabilities, if we want to. WDYT?

@bergsieker
Copy link
Collaborator

@werkt Is there something yet to be done here, or can this issue be closed?

santigl pushed a commit to santigl/remote-apis that referenced this issue Aug 26, 2020
* Adding inputs metadata calculation (bazelbuild#45).

* Addressing comments
@ulfjack
Copy link
Collaborator

ulfjack commented Sep 8, 2020

It would be nice if the client had some guarantees around the timing of update responses. E.g., if the server sends responses every 60 seconds, then the client can use that to defer a timeout.

We are seeing a problem with clients getting stuck if the HTTP connection to the server is silently dropped by an intermediary (we suspect GCP's TCP load balancer). I added support for client-side keep-alives to Bazel. However, there was a discussion about setting a deadline for execute requests on the client.

Personally, I do not like the idea of setting - say - a hard one-minute deadline on Execute and then calling WaitExecution. This doesn't seem right to me. However, if the server guaranteed that it would send responses at some intervals, the client could set a timer, and reset the timer when it sees a response. That would allow it to stick with the original Execute request, rather than (intentionally) falling over to WaitExecution.

It can also apply timeouts at a finer grain than the entire Execute request - the expected process is that the server goes through several phases before and after actual execution, e.g., queuing, input staging, execution, output upload. The client can then apply timeouts to individual phases rather than the entire request (to some extent this is possible today, but not all servers provide phase updates).

@EdSchouten
Copy link
Collaborator

@ulfjack Sounds like a good idea!

FYI: Buildbarn’s bb_scheduler already writes an update back to the client exactly once every minute, both in the case of Execute and WaitExecution. So we don’t announce worker state changes immediately, but do allow clients to observe individual states if they take long enough.

Changing REv2 to require this is therefore fine from our end.

@EricBurnett
Copy link
Collaborator

Personally, I do not like the idea of setting - say - a hard one-minute deadline on Execute and then calling WaitExecution.

@ulfjack I think you're responding to a suggestion from me? I agree, a hard 1m deadline seems a little too low, and I'll note that we'd need WaitExecution to be handled exactly the same way since it suffers from the same risks. My preference would be 5-10m RPC-level timeouts alone, or 30-60m RPC timeout plus "more than X (2m?) since last message" preemption logic. (I like the latter better, but would take either.) Though note in either case we also have to have clients exempt DEADLINE_EXCEEDED from eating up retries still.

Either of those should still work with servers that don't send progressive messages, so I'd also be happy if we added e.g. "Servers SHOULD return a message every 60s or less..." to the API, which itself is backwards compatible since it's not a hard requirement.

@ulfjack
Copy link
Collaborator

ulfjack commented Sep 8, 2020

@EricBurnett, yes. I think requiring it would have to be deferred to 3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants