Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for "Remote Builds without the Bytes" #6862

Open
buchgr opened this issue Dec 7, 2018 · 40 comments

Comments

@buchgr
Copy link
Contributor

commented Dec 7, 2018

This issue tracks the progress towards Remote Builds without the Bytes.

Milestones

  • Implement basic support for not downloading intermediate artifacts (In Bazel 0.25)
    The feature can be enabled by adding the flag --experimental_remote_download_outputs=minimal to the remote caching / execution invocation. The feature mostly works, but comes with the limitations stated in the subsequent tasks.
  • Fix local actions that need to fetch remote outputs (In Bazel 0.25)
    While Bazel can run most actions remotely there are a few dozen or so action implementations that currently can't. This can be problematic because such actions often require Bazel to download remote files which this change tries to avoid. An example of such an action is the SymlinkAction.
  • Implement fetching of top level artifacts (In Bazel 0.26)
    Implement --experimental_remote_download_outputs=toplevel as described in the design document.
  • Store per output file metadata in the persistent action cache
    The persistent action cache stores the state of Bazel's output base. It does not know about remotely stored files. We need to extend it to do so. See #8248
  • Support remote cache eviction
    As detailed in the design document it's possible for Bazel to wrongly think that an output file exists on the remote system because it has been evicted ("garbage collected") in the meantime. If a Bazel action then depends on an evicted file there is currently no other way but to fail the build/test. See #8250
  • Add up- and download stats to the profiler
    A user should have insight into what is being up/downloaded and why.
  • Be compatible with the Build Event Service local file uploader
    This change currently doesn't interact well with the BEP and BES. See #8249
  • Don't create a runfiles tree for remote actions
    For a test that runs remotely we currently create a runfiles tree in the output base with dangling symlinks that's unused. We need to fix the SymlinkTreeStrategy.
  • Work seamlessly with IDEs

Related Issues

#1922 #5505

@buchgr buchgr self-assigned this Dec 7, 2018

@iirina

This comment has been minimized.

Copy link
Contributor

commented Dec 7, 2018

Please assign a priority (see the maintainers guide).

@buchgr buchgr added the P1 label Dec 7, 2018

buchgr added a commit that referenced this issue Dec 9, 2018

buchgr added a commit to buchgr/bazel that referenced this issue Dec 9, 2018

buchgr added a commit to buchgr/bazel that referenced this issue Dec 13, 2018

@buchgr buchgr referenced this issue Dec 13, 2018

Closed

test ci #6914

@ittaiz

This comment has been minimized.

Copy link
Member

commented Dec 28, 2018

@buchgr Do I understand correctly that Fix local actions that need to fetch remote outputs is a pre-requisite for developer machines to have cache hits from CI?
Also can you elaborate a bit about Work seamlessly with IDEs? What won't work off the bat?
Alternatively will we be able to run on CI with --experimental_remote_fetch_outputs=minimal and developer machines without and still have cache hits?

buchgr added a commit to buchgr/bazel that referenced this issue Jan 11, 2019

buchgr added a commit to buchgr/bazel that referenced this issue Feb 5, 2019

Pass MetadataHandler to ActionInputPrefetcher
Also, allow it to respond to interrupts.

Progress towards bazelbuild#6862

buchgr added a commit to buchgr/bazel that referenced this issue Feb 5, 2019

Add ability for a Spawn to declare a required local output
Some native actions run both remotely and locally. For example,
the CppCompileAction runs compilation remotely but does .d file
pruning locally. This change adds the necessary infrastructure
for an action to tell a Spawn which of its outputs it needs on
the local filesytem.

Progress towards bazelbuild#6862

buchgr added a commit to buchgr/bazel that referenced this issue Feb 5, 2019

Allow a SpawnRunner to inject output metadata.
Adds a MetadataInjector type that can be accessed via the
SpawnExecutionContext.

Progress towards bazelbuild#6862

buchgr added a commit to buchgr/bazel that referenced this issue Feb 5, 2019

Teach the FilesystemValueChecker about remote outputs
The FilesystemValueChecker is used by Bazel to detect modified
outputs before a command. This change teaches it about remote
outputs that don't exist in the output base. That is if
SkyFrame has metadata about a remote output file which does not
exist in the output base it will not invalidate the output. However,
if the file exists in the output base it will be taken as the source
of truth.

Progress towards bazelbuild#6862
@ittaiz

This comment has been minimized.

Copy link
Member

commented Feb 9, 2019

@buchgr does this mean the change is only one commit? (github says This branch is 1 commit ahead, 770 commits behind bazelbuild:master.)
I'd like to try it out and since we're using 0.22.0 I want to rebase your changes (if easy and applicable) on top of 0.22.0

@ola-rozenfeld

This comment has been minimized.

Copy link
Collaborator

commented Feb 13, 2019

Edit: sorry, disregard, my bad, it does compile with 0.22.0!

Which bazel version should we use to compile this with? For me, bazel build //src:bazel on this repository didn't work with any bazel releases in {0.16.1, 0.17.1, 0.19.2, 0.20.0, 0.21.0, 0.22.0}. I tried bootstrapping it, but that didn't work either. Thank you!

@moritzkiefer-da

This comment has been minimized.

Copy link

commented Feb 14, 2019

Hey, I was trying this branch on a very simple project today and it seems to fail to fetch binaries in some cases. I have a very simple BUILD.bazel file

cc_binary(
    name = "cbin",
    srcs = ["main.c"],
)

where main.c is just a hello world program.
I’m working in clean repository against a fresh cache started with docker run -v /tmp/bazel_empty_cache:/data -p 9090:8080 buchgr/bazel-remote-cache.

I’m now running the following commands:

~/code/bazel/bazel-bin/src/bazel run //:cbin --experimental_remote_fetch_outputs=minimal
~/code/bazel/bazel-bin/src/bazel clean --expunge
~/code/bazel/bazel-bin/src/bazel run //:cbin --experimental_remote_fetch_outputs=minimal

The first invocation works completely fine and outputs "Hello World". However, after cleaning the cache this fails and I get the following output:

Starting local Bazel server and connecting to it...
INFO: Invocation ID: c2199026-669c-4271-9412-fb8abce40f1b
INFO: Analysed target //:cbin (8 packages loaded, 58 targets configured).
INFO: Found 1 target...
Target //:cbin up-to-date:
  bazel-bin/cbin
INFO: Elapsed time: 3.058s, Critical Path: 0.56s
INFO: 2 processes: 2 remote cache hit.
INFO: Build completed successfully, 6 total actions
ERROR: Non-existent or non-executable /home/moritz/.cache/bazel/_bazel_moritz/9461a41f957a379ce0aaa8755f8e70f9/execroot/__main__/bazel-ouINFO: Build completed successfully, 6 total actions

So the binary was never downloaded. The problem appears to be in RemoteSpawnCache, if I change the MINIMAL case to behave like ALL it is downloaded again but ofc that is not the solution as it goes against the whole purpose of this patch. I am not entirely sure how this is supposed to work. If I change CppLinkAction to add the output to requiredLocalOutputs then it also works but if I understand the patch correctly this shouldn’t be necessary

@werkt

This comment has been minimized.

Copy link
Contributor

commented May 3, 2019

@werkt

It looks like the file Bazel is trying to upload is an output file. So it seems like the file fell out of the cache during the build, which means that no copy of it exists anymore anywhere. There is currently no way we can recover from this. I ll send a patch to provide a better error message.

This is not just transient - this occurs any time an action result is fetched that has missing content in the CAS. This feature should not couple the AC to the CAS such that it cannot handle incomplete data within the configuration.

irengrig added a commit to irengrig/bazel that referenced this issue May 3, 2019

remote: move options to its own package
progress towards bazelbuild#6862

Closes bazelbuild#7851.

PiperOrigin-RevId: 240590258

irengrig added a commit to irengrig/bazel that referenced this issue May 3, 2019

Flip --incompatible_symlinked_sandbox_expands_tree_artifacts_in_runfi…
…les_tree

This flag has been introduced six months ago and all downstream projects are green with it flipped. It's time to flip the flag and remove it in the 0.26.0 release.

Progress towards bazelbuild#6862

RELNOTES: --incompatible_symlinked_sandbox_expands_tree_artifacts_in_runfiles_tree has been flipped
PiperOrigin-RevId: 240883710

irengrig added a commit to irengrig/bazel that referenced this issue May 3, 2019

remote: add action input fetcher
Add a RemoteActionInputFetcher class which can stage action inputs that
are only available on a remote system. This change only introduces the
class and tests. Will enable it in a follow up CL.

Progress towards bazelbuild#6862

Closes bazelbuild#7866.

PiperOrigin-RevId: 240953164

irengrig added a commit to irengrig/bazel that referenced this issue May 3, 2019

irengrig added a commit to irengrig/bazel that referenced this issue May 3, 2019

Use ArtifactPathResolver everywhere in StandaloneTestStrategy
I discovered that some places in StandaloneTestStrategy don't properly
wrap their paths with ArtifactPathResolver. I manually audited the
StandaloneTestStrategy and fixed the remaining places.

Progress towards bazelbuild#6862

Closes bazelbuild#7888.

PiperOrigin-RevId: 240970935

irengrig added a commit to irengrig/bazel that referenced this issue May 3, 2019

remote: introduce downloadMinimal
Adds downloadMinimal() which will only download action outputs that
are strictly required and inject the remaining action metadata instead.

This change also extracts the logic of retrieving complete action result
metadata into its own method so that it can be used by both
downloadMinimal() and download().

downloadMinimal is only introduced in this CL and will be enabled in
a follow up CL.

Progress towards bazelbuild#6862

Closes bazelbuild#7867.

PiperOrigin-RevId: 240987655

irengrig added a commit to irengrig/bazel that referenced this issue May 3, 2019

Introduce DelegateFileSystem
An alternative to this implementation would be to mark all methods on
the FileSystem class public.

This will be used to implement an action file system in Bazel with the
implementation living in the remote package instead of vfs package.

Progress towards bazelbuild#6862
Closes bazelbuild#7889.

PiperOrigin-RevId: 241295198

irengrig added a commit to irengrig/bazel that referenced this issue May 3, 2019

remote: introduce --experimental_remote_download_outputs=minimal
This implements the first milestone of bazelbuild#6862. When specifying
--experimental_remote_download_outputs=minimal Bazel will avoid
downloading most action outputs when used with remote caching
and remote execution. That is, it will only download an action's
stdout/stderr as well as .d and .jdeps output files when building
C++ and Java. All other output files will only be downloaded on
demand, that is if a local action declares a remote output as an
input.

Enabling this mode currently requires specifying three additional
flags to the set of the remote caching/execution flags:
$ bazel build ... \
  --experimental_inmemory_jdeps_files \
  --experimental_inmemory_dotd_files \
  --experimental_remote_download_outputs=minimal

This mode does not yet interact well with the Build Event
Service (--bes_backend=...). Specifically, local file uploads
and remote file name conversions are disabled when
--experimental_remote_download_outputs=minimal is specified.

Lastly, this mode also does not at all interact with Bazel's
on disk action cache. That is, all (local) cached state will
be lost after restarting the Bazel server. Also changing the
value of --experimental_remote_download_outputs between builds
will invalidate all (cached) actions.

Closes bazelbuild#7905.

PiperOrigin-RevId: 241681028

irengrig added a commit to irengrig/bazel that referenced this issue May 3, 2019

remote: implement a naive action-scoped file system
This change accompanies d480c5f [1] in that it registers
an action-scoped filesystem that supports lazily fetching
action inputs that are remotely stored outputs of an upstream
action. While in theory any native java action in Bazel could
do arbitrary file system operations in practice this is mostly
useful for
 * AbstractFileWriteAction (calls Path.getInputStream())
 * TemplateExpansionAction (calls Path.getInputStream())
 * SymlinkAction (calls Path.(isFile|isExecutable|createSymbolicLink)())
 * SolibSymlinkAction (calls Path.createSymbolicLink())

For Starlark actions we found that only ctx.actions.expand_template(...)
directly interacts with the file system (via TemplateExpansionAction).
All other methods on ctx.actions create spawns internally which are
covered by d480c5f.

Progress towards bazelbuild#6862.

[1] bazelbuild@d480c5f

Closes bazelbuild#7926.

PiperOrigin-RevId: 241717157

irengrig added a commit to irengrig/bazel that referenced this issue May 3, 2019

remote: implement --experimental_remote_download_outputs=toplevel
This implements the second milestone of "Remote builds without the
Bytes" and introduces the "toplevel" option for the
--experimental_remote_download_outputs flag. This mode will only
download outputs of top level targets but not outputs of intermediate
actions.

On a build of Bazel against RBE I am still seeing a 50% speed up
compared to the "all" mode.

Progress towards bazelbuild#6862.

Closes bazelbuild#7984.

PiperOrigin-RevId: 243029119
@buchgr

This comment has been minimized.

Copy link
Contributor Author

commented May 3, 2019

This is not just transient - this occurs any time an action result is fetched that has missing content in the CAS. This feature should not couple the AC to the CAS such that it cannot handle incomplete data within the configuration.

Do I understand correctly that your backend regularly returns action cache entries that references output files that don't exist? We could introduce a second RPC to call FindMissingBlobs(...) but that would require two round trips for every cache hit which might be acceptable.

However, I am not convinced that it would materially improve things as it would only catch cases as mentioned above but not outputs being evicted at a later stage.

@werkt

This comment has been minimized.

Copy link
Contributor

commented May 3, 2019

This is not just transient - this occurs any time an action result is fetched that has missing content in the CAS. This feature should not couple the AC to the CAS such that it cannot handle incomplete data within the configuration.

Do I understand correctly that your backend regularly returns action cache entries that references output files that don't exist? We could introduce a second RPC to call FindMissingBlobs(...) but that would require two round trips for every cache hit which might be acceptable.

You are correct. I removed all validation from AC fetches long ago. I suppose there are some heuristics to be applied here with metadata and misses, but that will be unable to correct an individual build with it's client behavior, just future fetches

However, I am not convinced that it would materially improve things as it would only catch cases as mentioned above but not outputs being evicted at a later stage.

Yeah, transiency is going to be an issue here in general, but this is certainly not meant to combat it directly

@ulfjack

This comment has been minimized.

Copy link
Contributor

commented May 6, 2019

I'm not sure how I feel about it, but Mark added code to Bazel to support action rewinding, which might cover this use case.

@buchgr

This comment has been minimized.

Copy link
Contributor Author

commented May 6, 2019

@ulfjack thanks for pointing this out.

At this point I'd prefer to not use action rewinding for missing build outputs in Bazel. My major argument against it is that a prerequisite for this is for all actions to be bit by bit reproducible (we could of course detect non reproducible actions, but some actions might legitimately not be bit by bit reproducible) I think I prefer a solution could be found around attaching TTLs to build outputs.

@ufalklof

This comment has been minimized.

Copy link

commented May 6, 2019

@werkt

This comment has been minimized.

Copy link
Contributor

commented May 8, 2019

I don't really see how action rewinding precludes reproducibility - with the required re-execution of an action in order to produce an output file comes any new digest that should satisfy the downstream actions, just remap/recalculate the actions downstream of a rewound action based on the newly mapped digest result (and give me an option to make that a build ERROR, because I don't want non-reproducible actions. ever).

@ulfjack

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

I don't think it precludes reproducibility, and we don't guarantee bit-for-bit reproducibility either.

@benjaminp

This comment has been minimized.

Copy link
Collaborator

commented May 9, 2019

Does rewinding work with open-source Skyframe? I haven't seen a single public test for it.

@ulfjack

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

It does not work for the mode of Skyframe where it keeps reverse edges. I think that code is all in Skyframe, however, I'm not sure if it can be enabled or if it works. It can probably be made to work with reverse edges, but that has not been a priority.

@anakanemison

This comment has been minimized.

Copy link

commented May 9, 2019

(Mark here.) What @werkt and @ulfjack say about rewinding sounds right to me too.

The current implementation of rewinding deals with nondeterministic rewound actions by invalidating the intersection of the rewound action's transitive reverse deps and the failed (due to the lost input) action's transitive deps.

That subgraph is usually small: failed action -> lost output artifact -> generating action. It can sometimes be larger, e.g. when the lost output artifact is in a runfiles collection.

This could be extended to invalidating all of the rewound action's transitive reverse deps. That would have to happen after it's made compatible with Skyframe modes that track reverse edges, and would raise questions about how to handle reporting for actions run twice in a single build.

Agreed that failing the build in the event of an unequal rewind is a feasible extension.

We could make some existing rewind tests public after making it compatible with --notrack_incremental_state.

@ittaiz

This comment has been minimized.

Copy link
Member

commented May 13, 2019

@buchgr we weren't able to find internal users willing to use it without ResultStore :(
Are you still aiming to land support for ResultStore in 0.26?

I'm guessing this didn't made it in, right? Any chance it will be a simple cherry-pick on top of 0.26 so we will be able to run with a custom bazel version until 0.27 lands?

@buchgr

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

I can provide you a cherry pick for 0.26. I ll hope to merge it next week in time for 0.27.

@ittaiz

This comment has been minimized.

Copy link
Member

commented May 19, 2019

I’d love that, thanks!

buchgr added a commit to buchgr/remote-apis that referenced this issue May 22, 2019

clarify consistency requirements of GetActionResult
- move the statement from the service documentation to the call.
- add a section about extending TTLs

related discussions:
- https://groups.google.com/d/msg/remote-execution-apis/zmcEV8kubRc/5PPcHOVZAgAJ
- bazelbuild/bazel#6862

ola-rozenfeld added a commit to bazelbuild/remote-apis that referenced this issue May 22, 2019

clarify consistency requirements of GetActionResult (#79)
- move the statement from the service documentation to the call.
- add a section about extending TTLs

related discussions:
- https://groups.google.com/d/msg/remote-execution-apis/zmcEV8kubRc/5PPcHOVZAgAJ
- bazelbuild/bazel#6862
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.