Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--action_env ignored when cfg = "exec" is used #17062

Closed
Flamefire opened this issue Dec 21, 2022 · 24 comments
Closed

--action_env ignored when cfg = "exec" is used #17062

Flamefire opened this issue Dec 21, 2022 · 24 comments

Comments

@Flamefire
Copy link
Contributor

Flamefire commented Dec 21, 2022

Description of the bug:

We need to pass some environment variables like "$CPATH" to the compiler when building TensorFlow with Bazel. This is cumbersome itself and has led to hard-to-debug issues like #12059 in the past already.

Now we again see failures caused by action-env values not passed to the compiler invocation in TensorFlow 2.8.4 which I tracked down to tensorflow/tensorflow@07cbc7b which changes cfg = "host" to cfg = "exec"

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Build TensorFlow 2.8.4 with Bazel 4.2.2 passing --action-env=CPATH and observe that it is not passed to some compiler invocations resulting in e.g.:

In file included from bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/core/framework/dataset_options.pb.cc:4:
bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/core/framework/dataset_options.pb.h:10:10: fatal error: google/protobuf/port_def.inc: No such file or directory
   10 | #include <google/protobuf/port_def.inc>

So can you provide information on how to use --action-env (or similar) in such circumstances?

An explanation on what is actually being done with the change to "exec" from "host" would also be very welcome. In our case we are not cross-compiling so host, target and build machine are all the same.

I would clearly classify this behavior as a bug because the documentation states:

Specifies the set of environment variables available during the execution of all actions.

But obviously there are now actions where those are missing but they should be in "all actions"

Which operating system are you running Bazel on?

REHL 7

Flamefire referenced this issue in tensorflow/tensorflow Dec 21, 2022
"host" cfg is deprecated, and "exec" cfg saves ~10% compile time on kernels like polygamma.

PiperOrigin-RevId: 404540569
Change-Id: I2a8be13501a42dc9ad7fbee0ac20b6bdf3124262
@fmeum
Copy link
Collaborator

fmeum commented Dec 21, 2022

Does passing in CPATH via --host_action_env work? The user manual you link to does seem to be outdated, the full command-line reference mentions that --action_env applies to the target configuration whereas --host_action_env applies to host and exec configuration.

@Flamefire
Copy link
Contributor Author

Does passing in CPATH via --host_action_env work? The user manual you link to does seem to be outdated

Yes that seems to work. But I guess now I have to duplicate all action_env into host_action_env. We are using --distinct_host_configuration=false which looks like it should work. But now can't find it in the docs.

@fmeum
Copy link
Collaborator

fmeum commented Dec 21, 2022

@Flamefire --distinct_host_configuration is a no-op at HEAD. Whether to bring it back has already been discussed in the context of Tensorflow at some point, but I can't find the issue anymore. @meteorcloudy who I think was part of that discussion.

@meteorcloudy
Copy link
Member

@fmeum I believe it's #14848 (comment)
/cc @gregestren who was also in that discussion.

@Flamefire
Copy link
Contributor Author

Flamefire commented Dec 24, 2022

Seems my comment didn't got saved so again the most important bits as multiple people spent days trying to figure out why TF started to fail to compile due to the wrong/unset env variables (I finally found the commit that broke/changed it by bisecting over 2000 commits):

We are in, what should be, the most simple situation: Build a software (TensorFlow) with Bazel on the machine it is supposed to run on. On other build systems it is a variation of configure && make && make install but for Bazel which by default deletes all environment variables we need to pass commandline options to have them set (this is about e.g. $CPATH and $LIBRARY_PATH)

Previously --action_env has mostly worked as it was documented as "environment variables available during the execution of all actions". Above I linked the documentation of Bazel 4.2.2 which is the version of Bazel I've been using. @fmeum Mentioned that this is wrong. As behavior of Bazel changes considerably between versions it would be good to have correct documentation for each specific version and I would count this at least as a bug report against the 4.2.2 documentation if the observed behavior is intended and hence the new documentation (which doesn't seem to be specific to a version?) is correct.

However even before the change in TF it was brittle as the used environment depended on many other factors, for example:

I don't fully understand the difference between "host" and "exec" configurations, they sound very similar. Also the observed behavior surprises me: --action_env applies to "target" configurations per the new docs but reverting the TF commit (so changing cfg = "exec" back to cfg = "host") makes the environment variables be passed. But with cfg = "exec" I need to set them via --host_action_env. Why is that? The last part makes sense per the new docs but why --action_env does apply to "host" cfgs isn't clear to me. Has this changed? Or is this the expected effect of --distinct_host_configuration=false? In which Bazel version will this be a no-op?

Next question is why use_default_shell_env isn't set by default? The name implies that.

And finally I'd like to ask what the "correct" way would be to build something with Bazel on the machine it is to be run i.e. a local, **non-**crosscompilation build.

Things that come to mind:

  • Keep/Restore the behavior of --action_env applying to all actions and introduce a separate --target_action_env similar to --host_action_env
  • Have a flag like --no-distinct-configurations so that "host", "target" and "exec" are all the same avoiding potential rebuilds and issues like we have seen with unexpected changes to action environments

I would also greatly appreciate an easy way to find out why a specific C++ file is compiled (or library linked) with specific env variables (i.e. in a specific configuration). E.g. it would help if --subcommands would show the configuration name of an action and there would be a way to query how & why a specific file is built. It was very confusing that a library was build with missing env variables but calling bazel build on the seemingly obvious target had the env variables and hence worked. Turned out that the library was build again as part of a dependency of a dependency of a tool which had somewhere cfg = "exec". But I haven't found a way in the documentation to find that from a source tree and/or with bazel. So the only way I found feasible is find the commit which changed the behavior and check those changes.

I hope that helps and Merry Christmas! 🎅

@gregestren
Copy link
Contributor

Generally exec and host should be basically the same. I say basically because exec is intentionally supposed to generalize the concept of 'host. In particular, it's possible to have multiple execution architectures per build. exec supports this, while host doesn't.

I understand you're saying this doesn't matter for your use case, where all architectures are the same.

action_env should pass the same exact way to exec as it does to host. I suspect the reason you're not seeing that is --distinct_host_configuration=false, as you mention.

As @fmeum mentioned, --distinct_host_configuration=false is a no-op after a certain Bazel version (I'd have to check which). For any Bazel after that version I'd expect --action_env to behave equivalent for host and exec.

But if your baseline is an older version where --distinct_host_configuration=false works, that flag makes host an alias for the target configuration. So even though --action_env only applies to the target configuration, that kicks in here. We've never had a way to make exec an alias for the target configuration, hence the difference.

Most important, I think, is if you really need some equivalent of --distinct_host_configuration=false going forward. TensorFlow has argued for that kind of functionality. I think we could revive that by making an exec transition a no-op. And if it works for TensorFlow that's great. But I'd want to double check we're sure that the semantics that implies are okay. Particularly that any flags that should only apply to a host action (like, say, --compilation_mode=opt instead of --compilation_mode=dbg will not trigger in this mode. So there's some fundamental awkwardness there.

@Flamefire
Copy link
Contributor Author

For any Bazel after that version I'd expect --action_env to behave equivalent for host and exec.

You mean "behave equivalent for" as "have no effect on", correct? As I understood --host_action_env must be used for host and exec.

Most important, I think, is if you really need some equivalent of --distinct_host_configuration=false going forward. TensorFlow has argued for that kind of functionality. I think we could revive that by making an exec transition a no-op.

A small step would be to have --action_env (i.e. no prefix like host_) pass those variables to literally *all actions. Or a flag with that effect, e.g. --all_action_env. Our current workaround is to duplicate --action_env to --host_action_env which makes the commandline huge and may at some point hit a limit.

What would help us even more is to have no such "transitions", i.e. keep all environment variables from the current environment for all actions. We have modules which on load set up paths in $CPATH, $PATH, $LD_LIBRARY_PATH etc. as there are multiple versions of the same software installed side-by-side so only one can be "active" at a time which means not having a central location for all software installations, i.e. not installing everything into /usr or similar.

Particularly that any flags that should only apply to a host action (like, say, --compilation_mode=opt instead of --compilation_mode=dbg will not trigger in this mode. So there's some fundamental awkwardness there.

I don't understand that part. Can you elaborate or provide a link to the docs what flags only apply to some actions and the implications?

@gregestren
Copy link
Contributor

For any Bazel after that version I'd expect --action_env to behave equivalent for host and exec.

You mean "behave equivalent for" as "have no effect on", correct? As I understood --host_action_env must be used for host and exec.

Yes. According to my reading --action_env doesn't propagate to either the host or exec configs. While --host_action_env exclusively applies to the host / exec configs.

The only exception to this (the only place where host and exec could diverge) is when setting --distinct_host_configuration=false on a sufficiently old Bazel version.

A small step would be to have --action_env (i.e. no prefix like host_) pass those variables to literally *all actions. Or a flag with that effect, e.g. --all_action_env. Our current workaround is to duplicate --action_env to --host_action_env which makes the commandline huge and may at some point hit a limit.

What would help us even more is to have no such "transitions", i.e. keep all environment variables from the current environment for all actions. We have modules which on load set up paths in $CPATH, $PATH, $LD_LIBRARY_PATH etc. as there are multiple versions of the same software installed side-by-side so only one can be "active" at a time which means not having a central location for all software installations, i.e. not installing everything into /usr or similar.

Doing some spelunking, #4008 looks pretty relevant - an old bug that looks like it's asking for the same sort of thing. That was fixed by #12122 which led to the current --action_env / --host_action_env split. Which I understand you're saying is insufficient.

@meteorcloudy asked me at #12122 (comment) if we could just propagate --action-env to the host / exec config. I didn't give a direct answer because I don't consider myself an authority on the --action_env feature. I don't feel I fully understand the implications of such a change one way or another.

What we need is a clear owner of the --action_env feature who can evaluate that and recommend. @meteorcloudy any idea who that would be?

Particularly that any flags that should only apply to a host action (like, say, --compilation_mode=opt instead of --compilation_mode=dbg will not trigger in this mode. So there's some fundamental awkwardness there.

I don't understand that part. Can you elaborate or provide a link to the docs what flags only apply to some actions and the implications?

I don't think there's a clear API or documentation for this now. The authoritative source for now is looking at the code: https://cs.opensource.google/search?q=%22FragmentOptions%20getExec%22&sq=&ss=bazel.

For example https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/analysis/config/CoreOptions.java;l=985;drc=3d29b2e14f143cd56dfee4d49d991a6e330f3a85 has the line that propagates --host_action_env.

@keith
Copy link
Member

keith commented Jan 9, 2023

I don't think we should make --action_env apply to the exec configuration by default. I think similar to this issue #13839 the inability to use flags that only target the target configuration causes issues.

@fmeum
Copy link
Collaborator

fmeum commented Jan 9, 2023

@Flamefire Just in case you didn't know: You can use --action_env=FOOBAR to use the value of FOOBAR from the environment, which does help with command line length limits.

I agree with Keith that --action_env should not automatically apply to the exec configuration. That would end up causing plenty of unexpected rebuilds where they are not expected (looking at you, protoc).

@Flamefire
Copy link
Contributor Author

For example https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/analysis/config/CoreOptions.java;l=985;drc=3d29b2e14f143cd56dfee4d49d991a6e330f3a85 has the line that propagates --host_action_env.

As someone not involved in Bazel development that is hard to understand. There are 2 lines: actionEnvironment and hostActionEnvironment are both passed hostActionEnvironment. Now I'm non the wiser... Hence some documentation on those "contexts" and where and why and when envs apply. E.g. I never understood why use_default_action_env is not the default which has led to issues for us as variables were not forwarded and finding that this is what was missing was hard.

I don't think we should make --action_env apply to the exec configuration by default. I think similar to this issue #13839 the inability to use flags that only target the target configuration causes issues.

My suggestion was to have a uniform notation. E.g. there could be target_action_env and exec_action_env for specific options and action_env for "all" actions (as it was documented before and IMO makes sense naming-wise)

That would end up causing plenty of unexpected rebuilds where they are not expected (looking at you, protoc).

That would indeed be an issue. But is that situation different to passing all variables to action_env and host_action_env? Because that's what we do now and given the situation that there is/should be only a single configuration (build machine=target machine) I'd expect things to be build only once.

@uri-canva
Copy link
Contributor

Doing some spelunking, #4008 looks pretty relevant - an old bug that looks like it's asking for the same sort of thing. That was fixed by #12122 which led to the current --action_env / --host_action_env split.

I'm quite confused by that sequence of events. #4008 is an issue that happens because a lot of rules in the ecosystem don't support --action_env because they're implemented with run_shell, which by default doesn't use the action env (use_default_shell_env = False). In my mind the fix for that would be for run_shell to use the action env by default, not to introduce a --host_action_env.

@uri-canva
Copy link
Contributor

Maybe this issue is just another aspect of #3320?

@oquenchil
Copy link
Contributor

It looks like people who are most familiar with this area do not think this is a good idea so closing for now but please feel free to continue the discussion.

@Flamefire
Copy link
Contributor Author

I'm quite confused by that sequence of events. #4008 is an issue that happens because a lot of rules in the ecosystem don't support --action_env because they're implemented with run_shell, which by default doesn't use the action env (use_default_shell_env = False).

That's also what confused me: The option is named "use_default" and the default for this is "False"!

Maybe this issue is just another aspect of #3320?

Pretty much yes. It's at least another instance where env variables are not used in actions unexpectedly.

@fmeum @keith

I don't think we should make --action_env apply to the exec configuration by default. I think similar to this issue #13839 the inability to use flags that only target the target configuration causes issues.

I agree with Keith that --action_env should not automatically apply to the exec configuration. That would end up causing plenty of unexpected rebuilds where they are not expected (looking at you, protoc).

Yes maybe not "automatically apply". But there is a need for an option to pass env variables to all actions. This would even reduce the number of rebuilds because the environment is the same, wouldn't it?
It is just that --action_env would be the best fitting name for such an option but I'd also be happy with a new name just so we don't need to figure out again and again why a change in Bazel or TensorFlow now causes env variables to not be passed again which is very time consuming.

@fmeum
Copy link
Collaborator

fmeum commented Jan 20, 2023

This would even reduce the number of rebuilds because the environment is the same, wouldn't it?

I don't think it would as the target and exec configurations differ in other ways (on purpose) even if you align --action_env and --host_action.

@Flamefire I still think that the best solution for your use case is a --distinct_exec_configuration flag. With --nodistinct_exec_configuration, your --action_env would apply to all actions: It applies to all actions built in the target configuration, but with --nodistinct_exec_configuration, nothing ever transitions from the target to the exec configuration. That should resolve both issues you are experiencing:

  1. No need to duplicate --action_env as --host_action_env.
  2. No spurious rebuilds because compilers are built in the exec configuration.

What do you think about that?

Re use_default_shell_env: I personally think that --action_env isn't a good concept and more of an escape hatch as it potentially applies to all actions (bad for caching) but then again only to those that explicitly opt into it (bad when you really need it). Most custom actions inherently don't care about the environment (e.g. generating a Javadoc does not require CC or PATH) and there should be ways to more selectively inherit environment variables from the shell environment.

@Flamefire
Copy link
Contributor Author

@Flamefire I still think that the best solution for your use case is a --distinct_exec_configuration flag. With --nodistinct_exec_configuration, your --action_env would apply to all actions: It applies to all actions built in the target configuration, but with --nodistinct_exec_configuration, nothing ever transitions from the target to the exec configuration. That should resolve both issues you are experiencing:

Just so I understand this correctly: There are 3 configurations: host, exec and target. I don't understand the distinction between host and exec but that is what caused the issue I reported here so there seem to be differences.
So if that --nodistinct_exec_configuration never causes any transition so "--action_env would apply to all actions" then this is what we need. I just want to make sure that a spurious "host" exec somewhere does not cause the same problem again.

Re use_default_shell_env: I personally think that --action_env isn't a good concept and more of an escape hatch as it potentially applies to all actions (bad for caching) but then again only to those that explicitly opt into it (bad when you really need it). Most custom actions inherently don't care about the environment (e.g. generating a Javadoc does not require CC or PATH) and there should be ways to more selectively inherit environment variables from the shell environment.

We are building TensorFlow using Bazel in a fresh folder, so no cache there, and want it to apply to all actions (or just don't clear the environment in the first place) because surprisingly many tools do care about the environment which we e.g. noticed by the change in TensorFlow from exec_tools to tools (or vice versa) where the already installed protobuf compiler failed to execute due to missing libraries (caused by unset LD_LIBRARY-PATH). So yes not all actions care about all environment variables but IMO Bazel as the build system and even build file authors cannot always know which are actually required. For example we have PYTHONNOUSERSITE=1 set in our action_env because user modules can cause sudden build failures when installing software for the cluster. And some (build) tools from TensorFlow are Python modules and hence potentially affected by that but might be called by binaries that then use a Python module and so on.

@fmeum
Copy link
Collaborator

fmeum commented Jan 20, 2023

There is no host configuration anymore, everything that used to be host is now exec. However, as part of that change, --distinct_host_configuration went away, which is why you saw a regression.

I agree that BUILD file authors can't know which env variables are required, which is precisely why I believe --action_env is generally the wrong tool: It puts the burden on the Bazel end user to maintain a global list of variables that actions forgot to set themselves. For very few, very generic variables such as PATH and LD_LIBRARY_PATH I can see that being reasonable, but PYTHONNOUSERSITE is definitely something the Python rules should set (or at least set if configured in a certain rules_python-specific way).

@Flamefire
Copy link
Contributor Author

There is no host configuration anymore, everything that used to be host is now exec. However, as part of that change, --distinct_host_configuration went away, which is why you saw a regression.

I'm a bit confused here. In which way is there "no host configuration anymore"? Because I was using TensorFlow 2.8.4 with Bazel 4.2.2 which failed and after reverting tensorflow/tensorflow@07cbc7b, i.e. changing to exec="host" it worked again passing --action-env.

Or are you referring to some change in Bazel after 4.2.2 where using exec="host" would be an error? Or how was Bazel changed, i.e. what would happen now when using exec="host"?

If you could tell me which version introduced which change related to this it would really help our tooling around that. Thanks a lot!

@fmeum
Copy link
Collaborator

fmeum commented Jan 20, 2023

@gregestren would know best when exactly cfg = "host" became/becomes equivalent to cfg = "exec". That definitely wasn't the case in Bazel 4.

@keith
Copy link
Member

keith commented Jan 20, 2023

I agree that the default for use_default_shell_env should be reconsidered, I'm not sure if there's an issue for that or not. I agree with fabian that having a --distinct_exec_configuration is most likely to serve your needs

@uri-canva
Copy link
Contributor

It sounds like in general --action_env, --test_env and --repo_env should be avoided? Instead of them rules should receive input from the user using bazel native mechanisms such as build settings and attributes, and then set the appropriate environment when invoking executables based on them? Is this documented somewhere? We've been minimising our use of those flags for a while now but it would have been nice to have known this earlier in our bazel journey.

@fmeum
Copy link
Collaborator

fmeum commented Jan 21, 2023

--repo_env is quite useful as it still only applies to repo rules that declare a dependency on the particular environment variable and changes to the flag take effect immediately rather than after a server restart.

Having better doc on when to use either of these flags and when some other approach is better would be very useful. You could file a doc issue for that.

@gregestren
Copy link
Contributor

@gregestren would know best when exactly cfg = "host" became/becomes equivalent to cfg = "exec". That definitely wasn't the case in Bazel 4.

I believe Bazel 6.0.

--distinct_exec_configurationis definitely an option. I'm not sure how widespread the need is, but it clear is in this case.

Recently I had some more out of the box ideas on how we might determine what makes up exec configurations. But that's probably a conversation for another place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants