-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Execute keep alive response #57
Comments
Note: it's technically okay if the client doesn't necessarily wait for the end of the streamed responses. They can just poll using GetOperation instead, or call WaitExecution again. Wait, by tunable you mean you want the client to be able to tell the server "update me every minute" or something? If so, how about we split this into two patches -- one would just add the comment that you suggested, something like:
And then we can have a separate feature request to add a Duration to ExecutionRequest for how often you need to stream back the updates. That duration can be limited by something in Capabilities, if we want to. WDYT? |
@werkt Is there something yet to be done here, or can this issue be closed? |
* Adding inputs metadata calculation (bazelbuild#45). * Addressing comments
It would be nice if the client had some guarantees around the timing of update responses. E.g., if the server sends responses every 60 seconds, then the client can use that to defer a timeout. We are seeing a problem with clients getting stuck if the HTTP connection to the server is silently dropped by an intermediary (we suspect GCP's TCP load balancer). I added support for client-side keep-alives to Bazel. However, there was a discussion about setting a deadline for execute requests on the client. Personally, I do not like the idea of setting - say - a hard one-minute deadline on Execute and then calling WaitExecution. This doesn't seem right to me. However, if the server guaranteed that it would send responses at some intervals, the client could set a timer, and reset the timer when it sees a response. That would allow it to stick with the original Execute request, rather than (intentionally) falling over to WaitExecution. It can also apply timeouts at a finer grain than the entire Execute request - the expected process is that the server goes through several phases before and after actual execution, e.g., queuing, input staging, execution, output upload. The client can then apply timeouts to individual phases rather than the entire request (to some extent this is possible today, but not all servers provide phase updates). |
@ulfjack Sounds like a good idea! FYI: Buildbarn’s bb_scheduler already writes an update back to the client exactly once every minute, both in the case of Execute and WaitExecution. So we don’t announce worker state changes immediately, but do allow clients to observe individual states if they take long enough. Changing REv2 to require this is therefore fine from our end. |
@ulfjack I think you're responding to a suggestion from me? I agree, a hard 1m deadline seems a little too low, and I'll note that we'd need WaitExecution to be handled exactly the same way since it suffers from the same risks. My preference would be 5-10m RPC-level timeouts alone, or 30-60m RPC timeout plus "more than X (2m?) since last message" preemption logic. (I like the latter better, but would take either.) Though note in either case we also have to have clients exempt DEADLINE_EXCEEDED from eating up retries still. Either of those should still work with servers that don't send progressive messages, so I'd also be happy if we added e.g. "Servers SHOULD return a message every 60s or less..." to the API, which itself is backwards compatible since it's not a hard requirement. |
@EricBurnett, yes. I think requiring it would have to be deferred to 3.0. |
Clarify and document the behavior for using an operation response with done=false. This has not been documented with expected behavior by the server or client, and should be detailed as such.
Client behavior should be to ignore the update and continue to wait for responses.
Server behavior should be to emit this update on some reasonable interval - with or without updates, perhaps upon some (maximally limited) client request period, since this will vary by client and should be tunable to address connectivity issues over multi-hop routes.
The text was updated successfully, but these errors were encountered: