Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel 0.21.0 flipped --incompatible_strict_action_env to true and broke itself #7026

Open
philwo opened this issue Jan 2, 2019 · 58 comments
Labels
not stale Issues or PRs that are inactive but not considered stale P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: process

Comments

@philwo
Copy link
Member

philwo commented Jan 2, 2019

Description of the problem / feature request:

Bazel 0.21.0 flipped --incompatible_strict_action_env to true via #6648, which broke CI on macOS and Windows CI. On macOS we can no longer find "md5" (which is in /sbin) and on Windows we can no longer find Python, PowerShell and other tools.

We didn't discover these breakages during testing, because apparently we don't actually test Bazel itself in the "Bazel@HEAD + Downstream Projects".

We'll revert this via a flag in the CI config files, but flag owners should work on fixing our tests (and taking user feedback into account whether this new behavior is actually working well for others) so that we can eventually remove the config entry again.

Relevant logs:

@philwo philwo added P1 I'll work on this now. (Assignee required) breakage labels Jan 2, 2019
@laszlocsomor
Copy link
Contributor

laszlocsomor commented Jan 2, 2019

Philipp, thanks for finding, filing, and working around this bug.

We (@philwo , @hlopko , and myself) argued about this flag today and realized that, while sh_* rules should probably not run python directly (why would a shell script need to do that?), they should be able to run their py_binary data-dependencies, which rightfully expect to have python / python.exe on the PATH.

It's unclear to me what should Bazel do. It seems that PATH's whole purpose is to contain the locations of these common tools, in which case PATH should be exempt from the "strict" env and be available to the action. On the other hand, PATH is inherently machine-specific and as such it poisons remote caches.

Another important observation @philwo made is that the "strict" action env is just as non-hermetic as the non-strict one; we just replace the clientenv's PATH with a hardcoded PATH string, but whatever is in those directories is out of our control (and not tracked by Bazel). And on one machine they might have python (and point to python2) but on another they might have no python at all or have python pointing to python3.

@benjaminp
Copy link
Collaborator

One option would be to make PATH work more like TMPDIR, where it's not part of the action key and injected by the spawn runner at the last moment. That would be a step towards fixing #6392, too.

meteorcloudy added a commit to bazelbuild/continuous-integration that referenced this issue Jan 2, 2019
@meteorcloudy
Copy link
Member

To prevent this from happening again, I filed bazelbuild/continuous-integration#434

@brandjon
Copy link
Member

brandjon commented Jan 4, 2019

they should be able to run their py_binary data-dependencies, which rightfully expect to have python / python.exe on the PATH. [...] On the other hand, PATH is inherently machine-specific and as such it poisons remote caches.

I can't speak for PATH in general, but from the point of view of the Python interpreter: The long-term solution is to define the interpreter as a toolchain rule, where a particular toolchain target is appropriate for a specific system / path installation. The toolchain target for the current host system could be generated by a repository rule.

Currently py_runtime is the closest thing we have to a toolchain. The Python stub script only runs the python on PATH if no py_runtime was specified via --python_top.

@laszlocsomor
Copy link
Contributor

@brandjon : Can you explain how executing a py_binary-generated target would work if the python interpreter is a toolchain? As I see it, running bazel-bin/foo/py/bin (which is currently a symlink to the stub python script, whose shebang line is #!/usr/bin/env python) requires an interpreter on the PATH.

@laszlocsomor
Copy link
Contributor

And this is only the Linux/macOS stub script. On Windows we have another stub that is a C++ program (called the "launcher"). @meteorcloudy could correct me on this, but IIRC the local python interpreter's path is embedded in the launcher.

@meteorcloudy
Copy link
Member

@laszlocsomor The native launcher also only uses python on PATH if no py_runtime was specified, basically it always uses the same python binary as the python stub template, see

return createWindowsExeLauncher(ruleContext, pythonBinary, executable, true);

That said, the python binary (including the native launcher for python) won't be a problem if python path is explicitly specified even with --incompatible_strict_action_env

@ulfjack ulfjack removed their assignment Jan 7, 2019
@ulfjack
Copy link
Contributor

ulfjack commented Jan 7, 2019

I think we need to change the default PATH on macos and Windows. It might be worthwhile to roll back the flag flip, except it may be too late now?

@laszlocsomor
Copy link
Contributor

@meteorcloudy : Thanks, that's comforting to know.
@ulfjack : I agree that it's probably too late to roll back the flag flip commit now. What do you think the default PATH should contain on macOS and Windows?

@filipesilva
Copy link

Regarding Windows:

@ulfjack
Copy link
Contributor

ulfjack commented Jan 7, 2019

I don't have a mac available right now, but I take it /sbin is on the PATH by default, so we should add it to the default path in Bazel as well.

It's
/Users/[user]/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:
according to https://www.reddit.com/r/MacOS/comments/5zabo3/question_what_is_the_default_path_for_macos/, although we should leave out the users home bin directory for hermeticity.

@ulfjack
Copy link
Contributor

ulfjack commented Jan 7, 2019

For a specific proposal, I'd say we should at least set it to /usr/bin:/bin:/usr/sbin:/sbin on macOS.

@meteorcloudy
Copy link
Member

meteorcloudy commented Jan 7, 2019

I assume the strict PATH env should contain the most common native tools available on that platform. For Windows, I think they should be:

  • C:\Windows and C:\Windows\System32, for common Windows cmd commands, eg. findstr.exe, whoami.exe, where.exe etc.
  • C:\Windows\System32\WindowsPowerShell\v1.0, for powershell.exe. @philwo pointed out different powershell versions are all installed in this same path, even if v1.0 is in the path.
  • C:\tools\msys64\usr\bin and C:\tools\msys64\bin, for bash bin tools. This will change if users specify a different bash executable by --shell_executable or shell toolchain.

But there are two issues to be considered on Windows:

  1. System root could change. Although the most common system root is C:\Windows, users could have different system root. If that's not C:\Windows, the default PATH env is wrong. One possible solution is to replace C:\Windows with %SYSTEMROOT%. We always set %SYSTEMROOT% for every action on Windows and it's a special env var set before launching the process, which won't affect any action cache. See
    String[] systemEnvironmentVars = {"SYSTEMROOT", "SYSTEMDRIVE"};
  2. File paths are case insensitive on Windows. Currently, if you change PATH from C:\foo\bar to C:\Foo\bar, the action will be re-executed, I think we should fix this.

@ulfjack
Copy link
Contributor

ulfjack commented Jan 8, 2019

We need to achieve three things here:

  1. Have a default that's generally usable for local builds
  2. The default must work with remote execution (e.g., no local usernames, machine-specific directories, etc.); this is to avoid cache thrashing / cross-user cache misses by default
  3. Be easy to override as needed (with the caveat that different settings across users can cause cache misses / thrashing)

I think the images that will be built / are already being built for remote execution will largely put binaries in the locations matching the default PATH.

(I don't think I'm saying anything new, just trying to clarify the situation. Using %SYSTEMROOT% in PATH should be fine; however, we do need to figure out how to handle %SYSTEMROOT% with remote execution to prevent remote cache thrashing.)

@laszlocsomor
Copy link
Contributor

Re @meteorcloudy :
I agree those paths (powershell, system32, msys2 bin and usr/bin) should be on the strict PATH, and nothing more.
Challenge 1 seems to be workaround-able.
I don't see why challenge 2 is a problem in the strict env case (or did you mean this for non-strict env builds?) What am I missing?

Also, it's not clear to me anymore who's waiting on whose input and who are decision makers on this bug. How can we move forward?

I'm happy to volunteer to add an incompatible flag enabling the strict PATH that @meteorcloudy proposed.

@alexeagle
Copy link
Contributor

Is anyone working to roll this flag forward again?

In node, it's common to run Bazel underneath some other workflow tool, and these can set a different environment every time (yarn sets $PATH with a new temp directory for each run). Users will observe non-incremental builds until they discover the need for this flag.

@laszlocsomor
Copy link
Contributor

If anyone is, it's @buchgr or @ishikhman .
(@buchgr is out for this week and the next though.)

@ishikhman
Copy link
Contributor

there were some discussions around this. @buchgr will be able to give a better update when he is back in a week.

lucamilanesio pushed a commit to GerritCodeReview/gerrit that referenced this issue Sep 20, 2019
'--experimental_strict_action_env' was a Bazel 0.20.0 option which got
renamed to '--incompatible_strict_action_env' and was planned to be the
default as of 0.21.0 [1]. However, this planned flip still hasn't
happened for at least the baseline of 1.0.0 (as of rc3), so we can't
remove it as of now [2].

In general, we should try to avoid having --incompatible_* or
--experimental_* flags in the bazelrc, as per explanation in Issue 11122
[3] as well as on the 'Backward Compatibility' page of Bazel [4]:

> Users should never run their production builds with --experimental_*
> or --incompatible_* flags.

However, for this case we have a compelling reason to have
'--incompatible_strict_action_env'.

[1]: bazelbuild/bazel#6648
[2]: bazelbuild/bazel#7026
[3]: https://bugs.chromium.org/p/gerrit/issues/detail?id=11122
[4]: https://docs.bazel.build/versions/0.29.1/backward-compatibility.html

Change-Id: Ie78708ffaeb1bfe9ebceb924939833f7c30eaeaf
@buchgr buchgr removed their assignment Jan 9, 2020
@UebelAndre
Copy link
Contributor

Any updates here? Can this be flipped again?

@philwo
Copy link
Member Author

philwo commented Nov 11, 2020

Unfortunately I’m not aware of any further work that has happened in this area. I think the situation is still unchanged from a year ago.

@UebelAndre
Copy link
Contributor

is there an RFC for how this might be addressed?

@alexeagle
Copy link
Contributor

tjgq added a commit to tjgq/bazel that referenced this issue Jun 1, 2022
I created this PR to trigger our CI and see if Bazel still fails to build without it.

See bazelbuild#7026 for context.
tjgq added a commit to tjgq/bazel that referenced this issue Jun 1, 2022
I created this PR to trigger our CI and see if Bazel still fails to build with this flag.

See bazelbuild#7026 for context.
@github-actions
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

@github-actions github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label Jun 13, 2023
@cpsauer
Copy link
Contributor

cpsauer commented Jun 14, 2023

@bazelbuild/triage, this issue is definitely still relevant, IMO, per Alex's comment above. Seems like this default should really get flipped.

(@bazelbuild/triage didn't seem to create a tag last I tried (no link) so I'm going to also tag @sgowroji and @Pavank1992 manually. Please coach me if you'd have preferred otherwise--and maybe update the bot's instructions)

@Pavank1992 Pavank1992 added not stale Issues or PRs that are inactive but not considered stale and removed stale Issues or PRs that are stale (no activity for 30 days) labels Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not stale Issues or PRs that are inactive but not considered stale P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: process
Projects
No open projects
Development

No branches or pull requests