Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add --experimental_remote_execute_timeout #12231

Closed
wants to merge 4 commits into from

Conversation

luxe
Copy link

@luxe luxe commented Oct 8, 2020

Problem:

Bazel remote executions can experience indefinite wait times due to an unfulfilled remote execution request. This situation can happen for two reasons. First, the remote server may not honor the action timeout provided (this is more often a bug in the server's implementation as it violates the assumptions of REAPI). Second, an action timeout may not be specified and the remote server does not impose one (though it should).

Here is what you would see in these situations:

bazel test :any_target --remote_executor=grpc://localhost:8980 --remote_retries=0
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Invocation ID: f956394e-834d-4a07-aa73-b926aff0860c
INFO: Analyzed target :any_target (28 packages loaded, 3257 targets configured).
INFO: Found 1 test target...
[4 / 5] [Sched] Testing :any_target; 999999s

Solution:

In both cases, we can blame the server for this, and claim that bazel is operating as it should. However, we want to guarantee to our bazel users that their builds will not hang indefinitely. This is particularly important across CI jobs, where the specific build innovation information is not being monitored and a hung build can delay other jobs due to a much longer CI timeout. If we could provide a timeout for remote executions on the client side, we would detect these issues sooner, and even allow bazel to retry given the existing retry mechanism (--remote_retries).

Code:

We've added a new remote flag called: --experimental_remote_execute_timeout.
Only when this flag is used will a deadline be imposed on the grpc execute call.

This is the result:

bazel :any_target --remote_executor=grpc://localhost:8980 --experimental_remote_execute_timeout="5s"
INFO: Invocation ID: d3137cb6-659e-46ca-aa23-b8ff93005846
INFO: Analyzed target :any_target (0 packages loaded, 0 targets configured).
INFO: Found 1 test target...
ERROR: /home/laptop/Desktop/code/BUILD:5:8:  failed (Exit 34): DEADLINE_EXCEEDED: deadline exceeded after 4.999713178s. [remote_addr=localhost/127.0.0.1:8980]. Note: Remote connection/protocol failed with: execution failed DEADLINE_EXCEEDED: deadline exceeded after 4.999713178s. [remote_addr=localhost/127.0.0.1:8980]
Target :any_target up-to-date:
  bazel-bin/any_target
INFO: Elapsed time: 5.456s, Critical Path: 5.04s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully
:any_target                       NO STATUS

FAILED: Build did NOT complete successfully

@google-cla google-cla bot added the cla: yes label Oct 8, 2020
@luxe luxe marked this pull request as ready for review October 8, 2020 15:47
@luxe
Copy link
Author

luxe commented Oct 8, 2020

@werkt / @ulfjack for visibility

@dws
Copy link
Contributor

dws commented Oct 8, 2020

@jmmv for visibility

@google-cla
Copy link

google-cla bot commented Oct 8, 2020

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@ulfjack
Copy link
Contributor

ulfjack commented Oct 9, 2020

I posted about my personal preference here: bazelbuild/remote-apis#57 (comment)

Personally, I do not like the idea of setting - say - a hard one-minute deadline on Execute and then calling WaitExecution. This doesn't seem right to me. However, if the server guaranteed that it would send responses at some intervals, the client could set a timer, and reset the timer when it sees a response. That would allow it to stick with the original Execute request, rather than (intentionally) falling over to WaitExecution.

(Unfortunately, I have not had time to implement something like that yet, and it also requires server-side support.)

I've also documented some of my previous explorations here: #11782

I added support for client-side keep-alive here: #11957

It might be worth trying if enabling keep-alive helps with the problems you're seeing. Is that something you could try? If they do help, then you might be running into the same problem I've seen before. I still suspect that GCP's load balancer is silently dropping connections, but I can't confirm either way.

@aiuto aiuto added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels Oct 15, 2020
@philwo
Copy link
Member

philwo commented Oct 21, 2020

@coeuvre Is this useful / relevant to your recent work on the remote execution timeouts? 🤔

@coeuvre
Copy link
Member

coeuvre commented Oct 21, 2020

Hi Philipp, thanks for mentioning here.

This PR tries to solve two problems:

  1. Provide an upper limit on how long the execution calls can run for.
  2. Avoid Bazel hangs indefinitely on remote builds.

My recent work is primarily focus on problem 2 and the initial idea is same as this PR: add a hard timeout to remote execution calls. However, after discussed with Jakob internally, I believe this is not a good idea since execution times for actions vary and it's hard to pick an effective value. I am working on a design doc of another way to fix this and will public it once finished.

@luxe
Copy link
Author

luxe commented Oct 21, 2020

We were trying to address operation hangs by enforcing grpc timeouts on both sides (bazel & buildfarm).
buildfarm/buildfarm#543

The motivation now:
Assume you have a faulty remote execution server, that sometimes keeps connections alive, and does not finish the executions.
Should bazel take control of its own process lifetime? Or should it's lifetime be intrinsically tied to the server's implementation, causing it to be unable to terminate? It's a grpc best practice to set deadlines on client stubs. Should we follow that advice on the execute call? Or is REAPI's specification an exception to this?

It might be worth trying if enabling keep-alive helps with the problems you're seeing. Is that something you could try? If they do help, then you might be running into the same problem I've seen before. I still suspect that GCP's load balancer is silently dropping connections, but I can't confirm either way.

Thanks for this advice. I was having trouble getting it to work predictably with buildfarm. I might have to change buildfarm's keepalive settings- or see if its related to netty or something. This might be a better way to prevent connections from staying open. Maybe less granularity on particular calls, and I'm not sure if it would be the same problem when a server always keeps connections alive. Will reply as I learn more.

I believe this is not a good idea since execution times for actions vary and it's hard to pick an effective value. I am working on a design doc of another way to fix this and will public it once finished.

Makes sense. Our intention was to set the timeout at the longest running test execution + buffer time.

@ulfjack
Copy link
Contributor

ulfjack commented Oct 21, 2020

@luxe the BuildFarm source code on GitHub says that keep-alive isn't allowed by the server. It's a bit odd that gRPC's ServerBuilder class (https://grpc.github.io/grpc-java/javadoc/io/grpc/ServerBuilder.html) doesn't have a method for this. I think BuildFarm is actually using NettyServerBuilder, which has a permitKeepAliveTime method (https://grpc.github.io/grpc-java/javadoc/io/grpc/netty/NettyServerBuilder.html#permitKeepAliveTime-long-java.util.concurrent.TimeUnit-). Not sure if this helps.

@werkt
Copy link
Contributor

werkt commented Apr 11, 2022

the BuildFarm source code on GitHub says that keep-alive isn't allowed by the server.

@ulfjack Can you provide a link indicating the comment/code you're referring to here

I'd like to restart consideration for this change - being experimental and disabled by default, I see no reason why it shouldn't be available for specification.

@luxe Please rebase and resolve conflicts to enable further consideration.

@luxe
Copy link
Author

luxe commented Apr 15, 2022

Looks like there is already an ExperimentalGrpcRemoteExecutor.java that sets a deadline. We should try to leverage
that instead passing a new deadline option into the existing GrpcRemoteExecutor.java

@sgowroji
Copy link
Member

Hello @luxe, Could you please have a look on the buildkite presubmit check failures and any update on the above PR. Thanks!

@sgowroji sgowroji added the awaiting-user-response Awaiting a response from the author label Apr 20, 2022
@coeuvre
Copy link
Member

coeuvre commented Apr 20, 2022

ExperimentalGrpcRemoteExecutor sets deadline on each exec call. You can use it by setting --experimental_remote_execution_keepalive.

There are also --grpc_keepalive_time and --grpc_keepalive_timeout if server supports keepalive.

Closing.

@coeuvre coeuvre closed this Apr 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-response Awaiting a response from the author cla: yes team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants