Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Overhaul command line args for choosing a spawn strategy #4153

Closed
mmorearty opened this issue Nov 22, 2017 · 10 comments
Closed

Proposal: Overhaul command line args for choosing a spawn strategy #4153

mmorearty opened this issue Nov 22, 2017 · 10 comments
Labels
category: misc > misc P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request

Comments

@mmorearty
Copy link
Contributor

This is a proposal for a pretty big change to the command line arguments that are used to choose a build strategy. Please consider it to be a design doc of sorts. It's a strawman; probably needs improvement. It's based on our experiences at Asana, and a number of complex scenarios we have run into.

Introduction

There are currently a number of command line arguments that are used to choose which spawn strategy (e.g. standalone, sandboxed, worker, remote) is used for a given action. Each of these command line arguments serves a specific need; but taken as a whole, they (a) are somewhat difficult to learn, and (b) make it difficult to express some commonly desired workflows.

Here are some (all?) of the current flags:

  • --spawn_strategy=...: specify the overall default strategy
  • --genrule_strategy=...: specify the strategy to use for genrules, if different from spawn_strategy
  • --strategy=<mnemonic>=...: specify an override, the strategy to use with a specific mnemonic
  • --worker_sandboxing: doesn't actually choose a spawn strategy, but it's related — turns on sandboxing for persistent workers

Also, currently the strategy names local and standalone are synonyms. Apparently local is the old, deprecated name, and standalone is the preferred name meaning "a regular, local, non-sandboxed, non-worker build." In this document, if I use the word "local", I mean "any action that is running on the same machine that Bazel is running on." In other words, "local" is the opposite of "remote"; but in my terminology, a local build might be standalone or sandboxed or worker.

Some current problems

  • If you want to use a remote cache, and fall back to a local build (--spawn_strategy=remote --remote_rest_cache=...), the local build will always be standalone. There is no way to tell Bazel to fall back to worker or sandboxed.

  • If you specify worker but, for whatever reason, the action can't be run in worker, it always falls back to standalone. No way to specify sandboxed.

  • Worker sandboxing (--worker_sandboxing) is separate from other sandboxing (--spawn_strategy=sandboxed). This could lead to lack of clarity on the part of the developer: They might think they have turned on safe, fully sandboxed builds with --spawn_strategy=sandboxed, when in fact anything they run in a worker isn't actually safe.

  • In more complex scenarios, switching between sandboxed and standalone can require numerous changes to the command line. For example, as mentioned above, if you are using workers then it might require changing both the --spawn_strategy and the --worker_sandboxing arguments. Or, suppose you usually use remote, but you have several specific actions that you don't want to be cached remotely. So you might have a bazelrc file that looks like this:

      build --spawn_strategy=remote
      build --strategy=SpecialCase1=standalone
      build --strategy=SpecialCase2=standalone
      build --strategy=SpecialCase3=standalone
      build --strategy=SpecialCase4=standalone
      build --strategy=SpecialCase5=standalone
    

    Switching that to use sandboxing will require changing five lines. Putting standalone did not actually express your goal; your goal was actually "do the build locally, somehow; don't use remote."

  • Finally, a different kind of problem: Today's command-line flags can take a long time to get comfortable with. For example, the difference between --spawn_strategy and --strategy; questions such as "if an item is not in the cache, what sort of local build does it fall back to, and can I control that" (I learned by reading the source); and so on.

Strategy selection goals

I feel that it would be desirable for the strategy-related command line arguments had the following characteristics:

  • A flag can apply to either all mnemonics, or to only one. For example, eliminate the difference between --spawn_strategy and --strategy.
  • A flag expresses one characteristic of the strategy selection process; it does not entirely define what strategy to use. For example, "when doing any local build, use sandboxing if possible" (separate from specifying whether to use a worker), or "if the remote cache does not contain the artifacts, do some kind of local build" (without defining what kind).
  • The flags are a little easier for a newcomer to get a handle in, in terms of the relationships between them.

It would then be up to Bazel to take this set of guidelines, and resolve it to an appropriate build action.

Examples

A complex scenario might play out like this:

  1. The args say to allow downloads from the remote cache (maybe with just --remote_rest_cache=..., no more args needed). So Bazel queries the remote cache, but does not find the build artifacts.
  2. The args say local builds are allowed (--remote_local_fallback, on by default). So Bazel needs to decide which local strategy to use.
  3. The args say to allow persistent workers for any actions that support them (maybe --workers or something). But this action does not.
  4. The args say to use sandboxing when possible (maybe --sandboxing). So we do a local, sandboxed build.
  5. The args say that locally built, sandboxed artifacts are safe to upload to the remote cache (maybe --remote_upload_sandboxed_results). So Bazel uploads the build artifacts.

Another scenario, this time with a non-sandboxed local build:

  1. same as above
  2. same as above
  3. same as above
  4. The args allow sandboxing for most actions, but not this one (maybe --sandboxing plus --nosandboxing=ThisMnemonic). So, no sandboxing.
  5. Bazel does a standalone build.
  6. The args say that locally built, sandboxed artifacts are safe to upload to the remote cache, but un-sandboxed local builds are not safe (maybe --noremote_upload_standalone_results; the default). So Bazel does not upload the build artifacts.

Another scenario, with no remote cache, and with a worker:

  1. No remote cache is specified. So, we will do a local build.
  2. The args say to allow persistent workers for any actions that support them (maybe --workers or something). This action does support workers. So, great.
  3. The args say to use sandboxing when available (--sandboxing). So, we do the build in a sandboxed worker.

Strategy selection algorithm

The strategy selection approach is pretty clearly indicated by the above examples. To spell it out:

  1. If there is a remote cache, check to see if the artifacts are there already. they are, we're done. Otherwise, continue to step 2.
  2. Choose between remote or local execution. If remote execution is available, choose that, and we're done. Otherwise choose local, and continue to step 3.
  3. Choose worker vs. non-worker.
  4. Choose sandboxed vs. non-sandboxed.
  5. Do the local build. At this point we are doing some combination of (worker/non-worker) + (sandboxed/non-sandboxed).
  6. After the build, if there is not a remote cache, we're done. Otherwise, continue to step 7.
  7. If the build was sandboxed, and sandboxed artifacts are allowed to upload to the remote cache, then upload.
  8. If the build was non-sandboxed, and non-sandboxed artifacts are allowed to upload to the remote cache, then upload.

Suggested flags

Although I'm happy with the spirit of the following suggestions, I'm not sure if these are the best names for the flags.

Remote execution:

  • I don't really know; I don't have any experience with remote execution. But I think --remote_executor, with nothing more, would imply that remote execution should be used.

Remote caching:

  • --remote_rest_cache=...: Same as today. Implies allowing downloads from the remote cache, and falling back to a local build.
  • --[no]remote_upload_sandboxed_results (default true) and --[no]remote_upload_nonsandboxed_results (default false). Note, it says nonsandboxed instead of standalone, because e.g. non-sandboxed worker builds should fit in the remote_upload_nonsandboxed_results category.
  • --[no]remote_upload_worker_results (default true). (As an aside, It doesn't make any sense to add a --[no]remote_upload_nonworker_results, because that is implied by the presence of --remote_rest_cache.)

With those defaults for remote-cache uploads, a sandboxed worker build would upload; a non-sandboxed worker build would not.

Sandboxing:

  • --[no]sandboxing: This would mean: "For all mnemonics: If we are doing a local build, and if sandboxes are supported, do it in a sandbox." Or with "no" prefix, never use a sandbox. Applies to local worker worker in addition to local non-worker actions.
  • --[no]sandboxing=<mnemonic>: Same, but for a specific mnemonic.

(Open issue: Maybe it should be --strategy_sandboxing instead of --sandboxing)

Workers:

  • --[no]workers: This would mean: "For all mnemonics: If we are doing a local build, and if persistent workers are supported, do it in a persistent worker." Or with "no" prefix, never use a persistent worker.
  • --[no]workers=<mnemonic>: Same, but for a specific mnemonic.
  • --[no]worker_sandboxing: Whether to allow worker+sandbox. It's important to default to true (the opposite of today's default) in order to make the model consistent and easy to undertstand.

(Open issue: Maybe it should be --strategy_workers instead of --workers. But that would not affect --worker_sandboxing, which is in a separate option group.)

No need for strategy-specific "fallback" strategies

One result of all of the above is that the list of strategies now becomes a "flat" list, with no need for one strategy to know about any others. (The worker-with-sandboxing scenario is a little subtle; but that is basically just the worker strategy with an option for whether to use sandboxing.)

As the code is written today, many strategies -- remote, worker, and sandbox -- need a "fallback" capability right in the runner code (and they usually fall back to standalone). With this new approach, the selection process would happen before we get to the runner; the runner can just do its work.

Not only does this make the code simpler, but it also (as described extensively above) simplifies the mental model for the user. There is no need to ask questions like "what does a remote build fall back to"; it just falls back to an appropriate kind of local build.

In fact, to some degree, the concept of "strategies" becomes less important to the end user. To the user, it's more of just a multidimensional selection process of different characteristics of the build (remote execution yes/no, remote cache yes/no, worker yes/no, sandboxed yes/no). It's a little easier (for me anyway) to think of it that way than to think of it as "I'm using the remote strategy, which fell back to worker" or "I'm using the worker strategy, which fell back to standalone".

The "problem" scenarios from above

With these new flags, here is how the problem scenarios from above would be handled:

  • If you want to use a remote cache, and fall back to a local build, just use --remote_rest_cache=...; the rest of the selection will fall out naturally from other flags.
  • If you specify worker but, for whatever reason, the action can't be run in worker, it will fall back to sandboxed if that has been specified and is available.
  • Worker sandboxing is (by default) separate from other sandboxing. So if you turn on sandboxing, you have turned it on for everything.
  • Switching between sandboxed and standalone now just requires --[no]sandboxing, and nothing more.
@meteorcloudy
Copy link
Member

Wow, this looks like a very nice suggestion, thank you!
But I'm not the expert here, FYI, @philwo @buchgr

@meteorcloudy meteorcloudy added P2 We'll consider working on this in future. (Assignee optional) type: feature request labels Nov 23, 2017
@philwo
Copy link
Member

philwo commented Nov 23, 2017

Thanks a lot! I'll read and think through it in the next days. :)

@ulfjack
Copy link
Contributor

ulfjack commented Nov 23, 2017

There is another dimension which you didn't take into account. In particular, we allow action 'tags' to enable or disable specific features. For example, an action marked as 'nosandbox' can disable sandboxing for that action, and an action marked as 'worker' enables persistent workers for that action.

My plan was to have one flag to specify strategy order and fallback, and another flag to add or remove tags from actions identified by mnemonic. (And of course, per-strategy flags to configure the strategy.) One difference to your proposal is that my proposal does not unify worker sandboxing and sandboxing.

TODO: I need to write down a more complete description.

@ittaiz
Copy link
Member

ittaiz commented Nov 24, 2017 via email

@mmorearty
Copy link
Contributor Author

mmorearty commented Nov 24, 2017 via email

@mmorearty
Copy link
Contributor Author

mmorearty commented Nov 24, 2017 via email

@ittaiz
Copy link
Member

ittaiz commented Nov 24, 2017

that if you can't find something in the cache then you'll have
to run a compiler

big +1

@ulfjack
Copy link
Contributor

ulfjack commented Nov 24, 2017

Ha! I disagree that it's natural to think in terms of strategies. We've actually had a lot of problems in the past, and I've independently come to the conclusion that we'll need to redo the way this is currently done.

In general, I agree with the points you're making, but I wanted to point out that the proposal is incomplete, and there are a few additional use cases we need to cover.

bazel-io pushed a commit that referenced this issue Nov 29, 2017
- remove BaseSpawn.Local; instead, all callers pass in the full set of
  execution requirements they want to set
- disable caching and sandboxing for the symlink tree action - it does not
  declare outputs, so it can't be cached or sandboxed (fixes #4041)
- centralize the existing execution requirements in the ExecutionRequirements
  class
- centralize checking for execution requirements in the Spawn class
  (it's possible that we may need a more decentralized, extensible design in
  the future, but for now having them in a single place is simple and
  effective)
- update the documentation
- forward the relevant tags to execution requirements in TargetUtils (progress
  on #3960)
- this also contributes to #4153

PiperOrigin-RevId: 177288598
ochafik pushed a commit to ochafik/bazel that referenced this issue Nov 29, 2017
- remove BaseSpawn.Local; instead, all callers pass in the full set of
  execution requirements they want to set
- disable caching and sandboxing for the symlink tree action - it does not
  declare outputs, so it can't be cached or sandboxed (fixes bazelbuild#4041)
- centralize the existing execution requirements in the ExecutionRequirements
  class
- centralize checking for execution requirements in the Spawn class
  (it's possible that we may need a more decentralized, extensible design in
  the future, but for now having them in a single place is simple and
  effective)
- update the documentation
- forward the relevant tags to execution requirements in TargetUtils (progress
  on bazelbuild#3960)
- this also contributes to bazelbuild#4153

PiperOrigin-RevId: 177288598
@jin
Copy link
Member

jin commented Sep 3, 2019

cc @buchgr @ishikhman with https://blog.bazel.build/2019/06/19/list-strategy.html and other work on strategy selection recently, is this issue resolved?

@jin jin added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Sep 3, 2019
@ishikhman
Copy link
Contributor

Yes, I think so. It seems that it's been covered by #7480

Closing the issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: misc > misc P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request
Projects
None yet
Development

No branches or pull requests

7 participants