Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout execution requirement doesn't work on starlark rules and genrule #12349

Open
uri-canva opened this issue Oct 26, 2020 · 16 comments
Open
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request

Comments

@uri-canva
Copy link
Contributor

uri-canva commented Oct 26, 2020

Description of the problem / feature request:

Support timeout execution requirement on starlark rules and genrules as a way to terminate stuck / hanging actions, so that their stderr / stdout can be printed out to aid debugging.

Feature requests: what underlying problem are you trying to solve with this feature?

Without such a feature when an action hangs, it can be hard to see the output it has printed so far, especially on non-interactive systems like CI. Another option might be to stream the output of the actions like requested in #7213, but that would require running actions sequentially, so it could only used when reproducing the issue.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Create a starlark rule and a genrule that never terminate, set execution_requirements = {"timeout": "5"} on the starlark rule, and tags = ["timeout:5"] on both, then run the build with --experimental_allow_tags_propagation, note that neither rule is terminated.

What operating system are you running Bazel on?

Both macOS and Linux

What's the output of bazel info release?

release 3.3.0

Have you found anything relevant by searching the web?

#7766
#8830

@jin jin added team-Core Skyframe, bazel query, BEP, options parsing, bazelrc type: feature request untriaged labels Oct 26, 2020
@ulfjack
Copy link
Contributor

ulfjack commented Oct 26, 2020

Local execution already enforces action timeouts where specified. However, there is no default timeout. You can add a tag "timeout:" to enforce a timeout for local execution (I think). That might be enough to cover the use case you have. It would also not be too difficult to add a flag to Bazel to set a default timeout for all locally executed actions.

I think you're proposing to add a feature to ActionExecutionInactivityWatchdog to print the stdout/stderr of currently executing actions after a certain 'timeout'. I am not sure I would do that. It would add another unrelated mechanism that doesn't seem like it carries its own weight.

@uri-canva
Copy link
Contributor Author

I see the timeout execution requirement in the source, but it doesn't seem to work. I tried execution_requirements = {"timeout": "5"} on a starlark action and tags = ["timeout:5"] or tags = ["timeout=5"] on both a starlark target and a genrule target, both with and without --experimental_allow_tags_propagation.

I couldn't find any documentation on it, am I using it wrong? I'll change the issue in the meantime.

@uri-canva uri-canva changed the title Watchdog feature to kill stuck / hanging actions timeout execution requirement on starlark rules and genrule Oct 26, 2020
@uri-canva uri-canva changed the title timeout execution requirement on starlark rules and genrule Support timeout execution requirement on starlark rules and genrule Oct 26, 2020
@uri-canva
Copy link
Contributor Author

@jin would you mind double checking the tags on this issue? Now that it's about making the timeout execution requirement work on starlark rules it might be a bug rather than a feature.

@uri-canva uri-canva changed the title Support timeout execution requirement on starlark rules and genrule Timeout execution requirement doesn't work on starlark rules and genrule Oct 26, 2020
@uri-canva
Copy link
Contributor Author

@ishikhman do you know if there's a known issue with the interaction between experimental_allow_tags_propagation and timeout?

@jin jin added team-Local-Exec Issues and PRs for the Execution (Local) team type: bug and removed team-Core Skyframe, bazel query, BEP, options parsing, bazelrc type: feature request labels Oct 27, 2020
@ulfjack
Copy link
Contributor

ulfjack commented Oct 27, 2020

That is odd. Do you have a simple reproduction that we could look at?

@larsrc-google
Copy link
Contributor

Can you show a bit more of what you're doing that's not working? A rule definition or a build log? I just ran across the timeout in connection with multiplex workers, and it certainly looks like timeouts are not treated right in either worker - the timeout is set on the process, but for workers the process lives forever, and nobody seems to call the methods that check the timeout under normal circumstances.

@meisterT
Copy link
Member

Please reply, if you have a repro case and we can reopen this issue.

@sushain97
Copy link
Contributor

sushain97 commented Feb 17, 2021

@meisterT

Here's a minimal reproduction under remote execution (bazel client v3.0.0):

BUILD.bazel:

load("sleep.bzl", "sleep")
sleep(name = "sleep")

sleep.bzl:

def _sleep(ctx):
    out = ctx.actions.declare_file("out")
    ctx.actions.run_shell(
        command = "sleep 5; touch {}".format(out.path),
        outputs = [out],
        execution_requirements = {
            "timeout": "1",
        },
    )
    return [DefaultInfo(files = depset([out]))]

sleep = rule(
    _sleep,
)

The action succeeds:

$ ./bazel build --config=remote //:sleep --experimental_remote_grpc_log=/tmp/bazel-grpc.log
INFO: Invocation ID: c5454b1d-28f6-4ea6-92d1-154cf2ba30ba
INFO: Analyzed target //:sleep (1 packages loaded, 2 targets configured).
INFO: Found 1 target...
Target //:sleep up-to-date:
  bazel-bin/out
INFO: Elapsed time: 7.529s, Critical Path: 5.25s
INFO: 1 process: 1 remote.
INFO: Build completed successfully, 2 total actions

Of course, this would happen if our remote execution implementation was faulty.

However, there's simply no timeout field in the ExecuteRequest:

$ grpc call remotebuild-storage.service.envoy:10080 google.bytestream.ByteStream/Read "resource_name: 'blobs/a7dc37d0fe4be154eca041e062b16ea147bb8831d148ae9eb29ff48606e0cb18/139'"
connecting to remotebuild-storage.service.envoy:10080
Received initial metadata from server:
date : Wed, 17 Feb 2021 06:41:00 GMT
server : envoy
x-envoy-upstream-service-time : 1
data: "\nE\n@a53b153b6206dc7f6212abff622be4ab8c7249aba882b7d6b68545e5d2b80615\020\217\002\022B\n@e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"

Rpc succeeded with OK status
$ printf "\nE\n@a53b153b6206dc7f6212abff622be4ab8c7249aba882b7d6b68545e5d2b80615\020\217\002\022B\n@e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" | protoc --decode_raw
1 {
  1: "a53b153b6206dc7f6212abff622be4ab8c7249aba882b7d6b68545e5d2b80615"
  2: 271
}
2 {
  1: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}

Full GRPC log: https://gist.github.com/sushain97/366d9efa18a529fb2c62cb18117d1ce0

A naive inspection of the relevant code doesn't reveal an obvious bug:

However, there are many layers of abstractions here so I'm probably missing something.

Related, I wasn't able to find any documentation for build action timeouts, whether specified as tags or execution requirements. Happy to be corrected but otherwise, this seems like a gap to fill?

@meisterT meisterT reopened this Feb 17, 2021
@meisterT meisterT assigned coeuvre and unassigned uri-canva Feb 17, 2021
@meisterT meisterT added team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed team-Local-Exec Issues and PRs for the Execution (Local) team labels Feb 17, 2021
@sventiffe
Copy link
Contributor

Grooming the backlog, is "more data needed" still accurate here?
Else, Chi, would you be able to triage this issue, please?

@uri-canva
Copy link
Contributor Author

I can also add more information if needed.

@coeuvre
Copy link
Member

coeuvre commented May 11, 2021

Remote execution doesn't support timeout execution requirement yet. I have the entry to revisit timeout behavior for remote executor on my TODO list and will address this during that effort.

@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) type: feature request and removed type: bug untriaged labels May 11, 2021
@uri-canva
Copy link
Contributor Author

It doesn't work on local execution either though, that's what this issue is about.

@coeuvre
Copy link
Member

coeuvre commented May 11, 2021

Can you please share a repro case with local execution?

@sluongng
Copy link
Contributor

sluongng commented Aug 23, 2022

@coeuvre @ulfjack I am running into this problem right now where my user have a buggy / flaky genrule that could hang the build forever. I really want to have a global flag that enforce a certain timeout duration on all actions (especially build actions) if possible.

It seems like in this issue, the solution is deviating toward using exec_requirements. After having review the execution requirement code at https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/actions/ExecutionRequirements.java;l=122;drc=a2e9f6e401c3fea92a6d515083553ffefdafc24a, Im pretty confident in saying that the TIMEOUT exec_requirement is currently only applied through src/main/java/com/google/devtools/build/lib/exec/StandaloneTestStrategy.java which only apply to test action and cannot be set via tags = ["timeout:<decimal>"]

Are you guys ok with changing this into a ParseableRequirement similar to CPU?


Here is a simple reproducible

> cat BUILD
genrule(
    name = "create_a",
    outs = ["a.txt"],
    cmd = """
    sleep infinity;
    touch $@
    """,
    tags = [
        "timeout:1",
    ],
)```

@linzhp
Copy link
Contributor

linzhp commented May 24, 2023

I really want to have a global flag that enforce a certain timeout duration on all actions (especially build actions) if possible.

We need this flag too. We've seen some actions taking half an hour on a CI machine, while only taking less than half a minute on a local machine. We are still not sure what's going on in those CI machines, but it would really help if we can timeout a local action so we can restart the build as early, possibly on a different machine.

@sbjorkle-qualcomm
Copy link

Remote execution doesn't support timeout execution requirement yet. I have the entry to revisit timeout behavior for remote executor on my TODO list and will address this during that effort.

@coeuvre, what's the status of getting support for custom action timeouts in remote action execution requirements?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request
Projects
None yet
Development

No branches or pull requests