Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please allow remote execution to have its own Platform, separate from host and target. #5309

Closed
jmillikin-stripe opened this issue Jun 1, 2018 · 21 comments
Assignees
Labels
team-Configurability Issues for Configurability team type: feature request

Comments

@jmillikin-stripe
Copy link
Contributor

jmillikin-stripe commented Jun 1, 2018

Description of the problem / feature request:

Bazel distinguishes between the "target platform", which binaries are being compiled for, and "host platform", which binaries run from build actions are compiled for. The introduction of remote execution means that there's a third machine involved, the remote worker, which may be of a different platform. This confuses Bazel, because it assumes that all actions run on a platform compatible with the host.

Please add a --remote_platform flag (easy), with associated toolchain resolution plumbing (hard + lots of work + messy?), so that I can run bazel build --remote_platform=//some-platforms:linux on a MacOS machine and have Linux binaries be compiled when they're going to be run remotely.

What operating system are you running Bazel on?

The Bazel client is running on MacOS, the Bazel build worker is running on Linux.

@jin
Copy link
Member

jin commented Jun 1, 2018

cc @katre @buchgr

@katre katre self-assigned this Jun 1, 2018
@katre
Copy link
Member

katre commented Jun 1, 2018

Bazel currently has the execution platform, which is intended to represent both remote and local execution.

You can specify the available execution platforms in two ways:

This already is present, and already is working with toolchain resolution, with one important restriction: it only works with rules that support toolchain resolution. This currently includes many Skylark rules, and we are in the process of fixing bugs in the implementation for C++ toolchain resolution using the native rules.

Let me know what further questions you have, and where the existing documentation is deficient, so we can clear everything up and make this easier to use.

@jmillikin-stripe
Copy link
Contributor Author

Bazel currently has the execution platform, which is intended to represent both remote and local execution.

I understand that, but a single execution platform is not sufficient when actions within the same build might execute on different platforms.

Case 1: running Bazel with --genrule_strategy=remote, any tools used in that genrule must be compiled for the remote platform. They will fail to run if built for the host platform. Other tools, used in non-genrule actions, must be built for the host platform.

Case 2: Since Bazel makes the host/remote distinction by action instead of rule, it's possible for two actions in the same rule to run on different platforms. In this case, the current rule-level toolchain resolution will get confused and may resolve a toolchain that isn't usable for one of the actions.

@katre
Copy link
Member

katre commented Jun 1, 2018

Each configured target has a single execution platform, which is used for all actions generated by that configured target.

Bazel as a whole can have any number of registered execution platforms (via the command-line flag, or the WORKSPACE function), and will use toolchain resolution to choose the best one. The logic is documented here: https://docs.bazel.build/versions/master/toolchains.html#toolchain-resolution

Case 1: Genrule currently doesn't respect toolchain resolution. This is actually a pretty difficult problem, but one we intend to tackle shortly.

Case 2: This is not currently the case. There's no fundamental reason for it, except that the APIs don't support it, but currently every action generated from a single configured target has the same execution platform set.

We are open to relaxing that restriction in the future, but it will need a pretty compelling use case.

Do you have a specific problem that I can take a look at to help you solve the problem you are having?

@jmillikin-stripe
Copy link
Contributor Author

Each configured target has a single execution platform, which is used for all actions generated by that configured target.

This is not true in Bazel 0.13 (haven't updated to 0.14 yet) -- actions in the same target can execute on different platforms, depending on execution strategy:

def _test_rule(ctx):
  out_1 = ctx.actions.declare_file("test_rule_1.txt")
  out_2 = ctx.actions.declare_file("test_rule_2.txt")
  ctx.actions.run_shell(
      outputs = [out_1],
      command = "uname -a > " + out_1.path,
      mnemonic = "TestRule1",
  )
  ctx.actions.run_shell(
      outputs = [out_2],
      command = "uname -a > " + out_2.path,
      mnemonic = "TestRule2",
  ) 
  return [
      DefaultInfo(
          files = depset([out_1, out_2]),
      ),
  ]

test_rule = rule(_test_rule) 
$ bazel build --strategy=TestRule2=remote //:my_test_target
[...]
Target //:toolchain_target up-to-date:
  bazel-bin/test_rule_1.txt
  bazel-bin/test_rule_2.txt
$ cat bazel-bin/test_rule_1.txt
Darwin st-jmillikin1.local 17.4.0 Darwin Kernel Version 17.4.0: Sun Dec 17 09:19:54 PST 2017; root:xnu-4570.41.2~1/RELEASE_X86_64 x86_64
$ cat bazel-bin/test_rule_2.txt
Linux 6a8038c4ed0b 4.9.36-moby #1 SMP Wed Jul 12 15:29:07 UTC 2017 x86_64 GNU/Linux

Genrule currently doesn't respect toolchain resolution. This is actually a pretty difficult problem, but one we intend to tackle shortly.

This also appears to be untrue in 0.13 -- when I pass a target in the genrule's tools attr, that target's rule is called with a resolved toolchain matching the host platform. I've verified this toolchain changes when I manually set --host_platform.

Do you have a specific problem that I can take a look at to help you solve the problem you are having?

I want to run Bazel on a MacOS laptop, and have certain heavy actions (Java/Scala compilation) run on a distributed buildfarm via the remote execution protocol. The buildfarm workers are running on Linux.

@katre
Copy link
Member

katre commented Jun 1, 2018

I see where I was confused. You are right, strategies such as remote, sandboxed, etc can be set per-action. However, the execution platform is set per-target. I can see where this will cause problems when being mixed together. Do you have an error case I can use to try and fix this? I know @ulfjack is planning some changes to the entire strategy system, but I don't know how that will affect remote/local execution.

All rules that don't participate in toolchain resolution set the host platform as the execution platform, leading to the cases you saw. For these, a --default_execution_platform flag might make sense, to allow specifying the difference. Would that help with the problems you are seeing with genrule?

@jmillikin-stripe
Copy link
Contributor Author

I'm not sure that the set of execution platforms is related to this issue. As described in the original post, I think the best way to support the use case of a multi-platform distributed build is to have a --remote_platform flag that would tell Bazel to resolve toolchains for that platform when running actions remotely.

@katre
Copy link
Member

katre commented Jun 4, 2018

The core issue is that there isn't a single remote platform: your remote execution system could have several types of workers, and they can all be used in a build.

However, I definitely see the value of having a way to specify the default remote platform for legacy cases, and will look into implementing that shortly. That should help fix your issue with genrules, if I understand it correctly.

katre added a commit to katre/bazel that referenced this issue Jun 4, 2018
resolution is not used.

If the flag is not set, the execution platform will be the host
platform.

Fixes bazelbuild#5309.

RELNOTES: Adds the --legacy_fallback_execution_platform flag to specify
a fallback execution platform whentoolchain resolution is not used.

Change-Id: I5e91c209cf5f043e29fb512c0ef81385a44d4817
@ulfjack
Copy link
Contributor

ulfjack commented Jun 4, 2018

@katre, I don't think that'll work. Let me elaborate.

@jmillikin-stripe, can you confirm that this is what you want: You want to run certain actions remotely, even though the remote execution platform is incompatible with your local machine, e.g., local machine is a Mac, remote machine is a Linux machine.

This is at odds with Bazel's current design, although the existence of the --*_strategy flags allows you to effectively override where an individual action is executed.

In its current design, Bazel allows rules to introspect the execution platform on which actions will run. Rule analysis generates the action, and it has to generate the action in its final form, taking into account whatever peculiarities of the underlying platform. For example, an action running on Windows has to use windows-style paths (this is the biggest immediately visible difference) for all paths referenced in the action (binary to run, inputs, outputs, temporary locations, etc.).

There is currently a hard barrier between analyzing a rule and executing an action, and therefore, the execution platform has to be set before we analyze a rule. It also has to be set manually for two reasons - there may be any number of execution platforms, and we also don't want loading+analysis to depend on whether or not a remote execution system is known and whether we can talk to it.

At this time, the rules API only provides access to a single execution platform - there used to be no explicit model of the execution platform, but this is what was implicitly the case as part of the BuildConfiguration. As such, a single rule cannot generate actions for different execution platforms.

There are cases where the execution platforms are sufficiently similar that things happen to work, and the availability of the --*_strategy flags allows the user to remote execution to an strictly speaking incompatible execution without the rule knowing. However, this is not a viable long term strategy.

In the short term, I'd suggest that we provide a way to override an action's associated execution platform. This allows users to get Bazel to do what they say, even if it doesn't make sense from Bazel's point of view.

In the long term, I'd suggest that we allow rules to access multiple execution platforms within a single rule. We'll have to carefully think about the right APIs for this, and the right APIs may require 'rule fragments'. The concept behind a rule fragment is that rule authors can identify parts of the action graph that have execution platform consistency requirements, and enforce that all corresponding actions are configured with the same execution platform by declaring smaller 'rule fragments', each of which has one execution platform. By making this explicit, and by making each fragment independent (with Bazel explicitly controlling communication between fragments), we can allow Bazel to defer analysis of such a fragment until execution time, to retrofit it to the actually selected execution platform.

Rule fragments would also allow us to solve another issue with configurations related to output paths. Consider a java_binary rule with native code - the java_binary rule has to declare a dependency on the C++ configuration, even though the pure Java compilation part of the rule only requires the Java configuration. Even with configuration subsetting, this means that it's non-trivial to make the Java compilation use different output paths that do not contain the C++ configuration fragment.

I hope this all made sense.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 4, 2018

(We have been using some hacks to allow certain rules to generate actions for different execution platforms.)

@jmillikin-stripe
Copy link
Contributor Author

You want to run certain actions remotely, even though the remote execution platform is incompatible with your local machine, e.g., local machine is a Mac, remote machine is a Linux machine.

I want to run heavy processes, such as compilation, remotely. The remote workers' machines are incompatible with my local machine. I want to retain the ability to run certain rules locally, such as for tests that are not yet fully hermetic. These targets would be specified by tags like local.

This is at odds with Bazel's current design, although the existence of the --*_strategy flags allows you to effectively override where an individual action is executed.

To be clear, while I've documented Bazel's current behavior here, I don't actually want that behavior. I have no use case where executing a target's actions on different platforms is useful. Selecting local/remote strategy by action mnemonic, instead of by rule type, seems confusing and fragile to me.

It also has to be set manually for two reasons - there may be any number of execution platforms, and we also don't want loading+analysis to depend on whether or not a remote execution system is known and whether we can talk to it.

I think this requirement is broadly compatible with a --remote_platform flag. Or --remote_platforms if you want to have multiple ones. I agree that analysis should not depend on the reachability of remote workers, so Bazel shouldn't try to auto-discover the remote worker's platform. It should obey the flag set by the user.

In the long term, I'd suggest that we allow rules to access multiple execution platforms within a single rule.

This seems like it would be difficult for rule authors to use, and I'm worried that most open-source language rule implementations would not implement it properly.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 4, 2018

I think this requirement is broadly compatible with a --remote_platform flag. Or --remote_platforms if you want to have multiple ones. I agree that analysis should not depend on the reachability of remote workers, so Bazel shouldn't try to auto-discover the remote worker's platform. It should obey the flag set by the user.

What would the --remote_platform flag do, though? We can't make it available to rules because the execution platform is part of the BuildConfiguration - if we made the 'remote_platform' part of the BuildConfiguration, all rules would have access to the 'remote_platform' in addition to the execution platform. What should the rules do with that information?

Maybe the suggestion is that Bazel would pick one or the other as execution platform on a per rule basis and only make that one available to the rule? That'd obey the one execution platform per rule constraint. How would that interact with toolchains?

(On a related note, please please please do not call it --remote_platform. There is nothing 'remote' about it - it might be compatible with the local host, or we might be running it in a local docker container or in a local VM. If we have to, then call it --per_rule_execution_platform, or something.)

@katre
Copy link
Member

katre commented Jun 4, 2018

This morning I sent out #5322, which adds a flag to set the execution platform for rules that don't use toolchain resolution (including all legacy rules). This can't be any more fine-grained, there is no way during analysis to know whether a target/action will be executed remotely or locally.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 4, 2018

If we're using the execution platform to decide where to execute the actions for a rule, then we could have a per-rule selection of execution platforms.

@katre
Copy link
Member

katre commented Jun 4, 2018

@ulfjack We're not doing that currently, is that planned to be changed? I think it's a great idea but I am not sure I have time to make the change.

@jmillikin-stripe
Copy link
Contributor Author

What would the --remote_platform flag do, though? We can't make it available to rules because the execution platform is part of the BuildConfiguration - if we made the 'remote_platform' part of the BuildConfiguration, all rules would have access to the 'remote_platform' in addition to the execution platform. What should the rules do with that information?

Maybe the suggestion is that Bazel would pick one or the other as execution platform on a per rule basis and only make that one available to the rule? That'd obey the one execution platform per rule constraint. How would that interact with toolchains?

I would imagine that when Bazel is configuring a target, it would decide whether that target is executed entirely on "host" or entirely on "remote". Depending on this choice, it would use either --host_platform or --remote_platform to resolve toolchains.

(On a related note, please please please do not call it --remote_platform. There is nothing 'remote' about it - it might be compatible with the local host, or we might be running it in a local docker container or in a local VM. If we have to, then call it --per_rule_execution_platform, or something.)

It's named --remote_platform to match the existing flag it pairs with, --remote_executor. There are other names that might be used like --executor_platform or --worker_platform, but they have their own issues.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 4, 2018

@katre this issue is currently marked as a feature request, although there may be a regression here as well because of your work?

@ulfjack
Copy link
Contributor

ulfjack commented Jun 4, 2018

I am not completely certain about the --host_platform flag, but the "host" prefix in Bazel usually describes the execution platform, which may or may not be compatible with the local machine. Bazel can run builds without doing any actual work on the local machine whatsoever, in which case the "host" moniker is completely misleading. Adding --remote_platform makes the confusion even worse, because that one may actually refer to the local machine.

For example, you might have a 'host' configuration of linux, with all actions run on a remote machine, cross-compiling for a 'target' configuration of mac os, which happens to be the local machine. We really need to get our terminology straight.

@katre
Copy link
Member

katre commented Jun 4, 2018

The --host_platform flag specifically refers to the host that Bazel is running on, and is separate from the set of execution platforms available to the entire build, or the specific execution platform chosen for a particular configured target (although the host platform is also available for use as an execution platform).

There is definitely a lot of confusion here, unfortunately, which is why I'm working on new platform and toolchain features to simplify the configuration and make this more straightforward. Also, moving more native rules to use toolchain resolution will help.

@katre
Copy link
Member

katre commented Nov 20, 2018

Closing this due to it not being consistent with the current direction of the code.

@katre katre closed this as completed Nov 20, 2018
@aiuto aiuto removed team-Configurability Issues for Configurability team labels Feb 4, 2019
@uri-canva
Copy link
Contributor

This is now supported via exec groups: https://bazel.build/extending/exec-groups

Execution groups allow for multiple execution platforms within a single target. Each execution group has its own toolchain dependencies and performs its own toolchain resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Configurability Issues for Configurability team type: feature request
Projects
None yet
Development

No branches or pull requests

6 participants