Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Important race condition when using remote cache without remote build #3360

Open
mmorearty opened this issue Jul 11, 2017 · 61 comments
Open
Labels
not stale Issues or PRs that are inactive but not considered stale P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@mmorearty
Copy link
Contributor

Description of the problem / feature request / question:

Our org has about 50 engineers using Bazel, with a remote cache but without remote build. About once every 3-4 weeks, I would hear of a case of someone's build that was failing in strange ways when the remote cache was turned on, but would succeed when the cache was off.

I finally got a repro case, and was able to figure out what was happening. Bazel's remote spawn strategy does its work in this order:

  1. Compute the hash of the action (which includes the hash of all the input files)
  2. Check if the remote cache contains the action result matching that hash
  3. If it does, then download it
  4. If it doesn't, then do a local build and upload the action result

But if the user makes any changes to the sources (e.g. saves a file, or uses git to check out a different branch of their code) after step 1, but before the build in step 3, then you will end up with an incorrect (corrupt) entry in the cache: Its hash matches the input files when step 1 executed, but its contents do not match what would have been built from those input files — its contents match what was built from the modified input files.

From that point on, all other developers on the team who try to build that same code will download the incorrect build results.

Interestingly, I think this cannot cause cache problems if you are using remote cache WITH remote build. In that scenario, Bazel's SimpleBlobStoreActionCache.uploadFileContents() will first read the file into memory, then compute its hash, and upload the already in-memory blob. So the file contents in the cache are guaranteed to be accurate (the source file's hash matches the source file's contents). I haven't tried it, but I imagine that the worst that could happen — and I'm not even sure if this could happen — is that the remote build might fail because it can't find the input files it wants. No serious harm done.

If possible, provide a minimal example to reproduce the problem:

This BUILD file has a rule called :race which, when built, will first sleep for a few seconds, and then will simply cat foo.in to foo.out:

genrule(
	name = "race",
	srcs = ["foo.in"],
	outs = ["foo.out"],
	cmd = "sleep 10; cat $(SRCS) > $@"
)

Make a file foo.in: echo 1 > foo.in

Start a build with a command line similar to this one (for this test, I have nginx running on localhost acting as a remote cache; you will need to set up some sort of similar remote cache):

bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 build --spawn_strategy=remote --remote_rest_cache=http://localhost:8000/bazel-cache/race :race

While the build is happening — during the sleep 10 — in another shell, change the contents of foo.in: echo 2 > foo.in

The build will complete, and bazel-genfiles/foo.out will, not surprisingly, contain 2. That's not a bug. But then:

  • Do a clean: bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 clean
  • Reset foo.in to its old contents: echo 1 > foo.in
  • Do another bazel build, with the same command line as above.

At this point, bazel-genfiles/foo.out will contain 2, because that's what Bazel found in the remote cache. This is definitely wrong.

Environment info

  • Operating System:

OSX; probably on Linux too, I haven't tested it

  • Bazel version (output of bazel info release):

0.5.2

@ola-rozenfeld
Copy link
Contributor

That's awesome, thank you for reporting it!

@ola-rozenfeld ola-rozenfeld self-assigned this Jul 11, 2017
@ola-rozenfeld
Copy link
Contributor

Wait, but this is a race condition in local bazel caching as well. If you compile :race without any remote_cache, chnge foo.in mid-flight, then after the build change it back and and try to recompile :race again, it will say that the target is up-to-date, but give you incorrect results.
I suspect the right thing to do would be to abort the build with an error as soon as Bazel notices that one of the input files was changed mid-compile (where changed means changed timestamp, disregarding the contents), before putting the results into any cache (remote or local).

@mmorearty
Copy link
Contributor Author

I see you're right, it's a local bazel caching issue as well, I hadn't noticed that but it makes sense.

I agree with your suggestion of checking timestamps. I can't think of a better way to handle it than that.

When I was thinking this was only a problem with the remote cache, I thought bazel could just silently decide not to upload the incorrect files to the remote cache, but continue the local build. But since, as you say, the local cache is wrong too, maybe the build has to be aborted. Ugh. That feels undesirable, but I'm not sure what else to do. I suppose it could requeue the action and run it again, with a limit of 3 tries or something? That might be nice, but it's getting pretty fancy.

I'm also trying to figure out if there are more complicated scenarios that need to be considered. For example, suppose you have Package A, which compiles Input A, and Package B, which compiles Input B. A build is going at the same time that a git checkout is taking place. So Package A builds correctly, no errors; then, after it's done, Input A changes; and then Package B starts building. Does this mean that the build for Package B needs to recursively check the timestamps of all ancestor inputs in order to avoid corrupting the cache? I don't know.

@mmorearty
Copy link
Contributor Author

As a longer-term solution, it might be interesting to explore copy-on-write filesystems: If sandboxed builds made copy-on-write copies of the input files instead of using symlinks, then the process of creating the copies should be fast and shouldn't use much additional disk space, but I think it would protect us against this (as long as the hashes or timestamps are double-checked after the copies are made).

But this is sort of exotic, and only some newer filesystems support this, such as macOS APFS, Windows ReFS, and Linux ZFS (I'm saying this based on a quick Google search, I may be wrong about support in those filesystems).

@buchgr
Copy link
Contributor

buchgr commented Jul 12, 2017

cc: @ulfjack

@ulfjack ulfjack added category: service APIs P2 We'll consider working on this in future. (Assignee optional) type: bug labels Jul 12, 2017
@ulfjack
Copy link
Contributor

ulfjack commented Jul 12, 2017

Thanks for the report. Well spotted.

Technically, yes, there is the possibility of a race in the local case. However, it's sufficiently unlikely to happen. We can't make it completely reliable in all cases since it runs on your local machine and most people have root access to their local machines, so can do whatever they want.

The remote cache is a bigger concern, because the bad result from one machine can poison the results for other users. Note that any user can already poison the results for any other user with a shared cache. You should only use a shared cache if you completely, fully trust everyone in your team; anyone with write access to the shared cache can own everyone else's machine. As such, using a user-writable shared cache is something that we'll warn people against. I apparently didn't add that to the documentation yet.

The compromise is to have a cache that's written by an automated process, e.g., a CI system, which very few people have access to, and configure the clients for read-only access. That is, if you don't do remote execution on machines that are locked down.

That said, we might want to double-check the checksums of the files before we upload an entry to the remote cache, as modifying source files during a build is a common-enough operation that this can easily happen by accident.

If we enable --watchfs by default, we could check for relevant events for the relevant files between starting an action and writing the cache entry. The watcher APIs don't guarantee timely delivery, so an event that happens before writing a cache entry could still be delayed until after (so there's still a race), but it would be much less likely to happen. It also won't work for network file systems, because the relevant APIs usually don't.

I think this is going to require some infrastructure to make it possible to check for file change events in the right places. We could maybe double-check the checksum before uploading the cache entry, but that could also be prohibitively expensive.

Here's what isn't going to work:

  • we can't stop the build on any modified file - a lot of people leave Bazel running in a loop, or realize that they still need to change something after they started the build; it usually doesn't cause any problems, and we'd get a lot of angry users if we broke the build every time that happens
  • we're not going to copy the input files for every action; too expensive (we tried)
  • we can't use hard-links for every action; too expensive (we tried)
  • we can't rely on the file system - most systems don't have copy-on-write file systems readily available; relying on esoteric file systems is not an option

@ola-rozenfeld
Copy link
Contributor

Checking the checksums doesn't eliminate the race condition in general (see provided example, where the file changes from 1 to 2 and back to 1). We need to check the timestamps, which might actually be cheaper.
How about if we only add this change: check the timestamps of all input files before writing the result to a remote cache, and if we see changes from after the action execution started, we skip setting the action result in the remote cache? Perhaps that won't be too expensive -- I can try and profile that and see how bad it is.
I believe that the cache poisoning should not be "transitive", due to how our Merkle trees work -- the next actions will depend on inputs with particular digests, so we shouldn't worry about propagating the errors, right?

The local cache problem means that the remote cache problem will be reproducible for a particular user -- their local cache will be hit first, return an incorrect result, and the build will continue wrongly from there. So users who run Bazel in a loop will need to remember to do a bazel clean often...

@ulfjack
Copy link
Contributor

ulfjack commented Jul 12, 2017

Sure, I think using the watcher is actually a better plan. Absent that, we'd actually check the ctime, which we already record in some cases (but maybe not all?).

Uhm, what? There's no problem with the local cache. The case that you describe is something we've never observed in practice, and anyone running bazel clean is doing it wrong.

@buchgr
Copy link
Contributor

buchgr commented Jul 13, 2017

That said, we might want to double-check the checksums of the files before we upload an entry to the remote cache, as modifying source files during a build is a common-enough operation that this can easily happen by accident.

I think before uploading we should read the whole file into memory, then compute the checksum, then do the upload. This at least guarantees that the hash and the contents match.

Note that any user can already poison the results for any other user with a shared cache.

I agree and disagree. It seems to me that anyone would want to run with a content-addressable storage cache that computes the hash server-side and doesn't trust the client's hash. So unless a malicious user manages to create a hash-collision, things should be fine. The remote cache can use the hash as a hint and return an error if its own computation and the client provided hash mismatch, but should never just accept the client's hash.

@ulfjack
Copy link
Contributor

ulfjack commented Jul 13, 2017

We aren't uploading the input files to the remote cache. We do upload the output files, but the remote worker has code to double-check that the checksum matches.

The action cache is not a pure CAS. The action results are entries from action hash to action result. You're right that it's impossible to poison a CAS (if the remote server checks the checksums), but that's not the problem. I think it's not ideal that we store CAS and non-CAS entries in exactly the same way - implementations might need to distinguish between the two and use different storage mechanisms. It's on my TODO list to change the SimpleBlobStore interface (or introduce a new one) that allows implementations to store these separately, and then also update the Ondisk implementation to put the CAS and non-CAS entries into different directories.

Also, I apologize for my last comment. Thanks to Ola for pointing out the flaws in my not-well-thought-through previous post, and I agree that we should track file modification time. I'm still worried about costs, since it requires one additional stat per input file per action, but it's probably ok.

@ulfjack
Copy link
Contributor

ulfjack commented Jul 13, 2017

The action hash is a hash of the action description, not of the action result. If we knew the action result already, we wouldn't need to ask the cache.

@ulfjack ulfjack added P1 I'll work on this now. (Assignee required) Release blocker and removed P2 We'll consider working on this in future. (Assignee optional) labels Jul 14, 2017
@ulfjack
Copy link
Contributor

ulfjack commented Jul 14, 2017

I'm convincing myself that we shouldn't make a release without a bug fix or workaround for this issue.

mmorearty added a commit to Asana/bazel that referenced this issue Jul 15, 2017
Workaround for bazelbuild#3360. After an
action finishes executing, we then re-check the timestamps of all of the
input files to that action. If any of them have changed, we don't upload
the action to the remote cache.

This does not address the problem that even the local cache is
incorrect. Fortunately, that problem is more rare, and also it can be
worked around with "bazel clean".
@buchgr
Copy link
Contributor

buchgr commented Jul 17, 2017

The 0.5.3 release will contain a bug fix by @ola-rozenfeld that should lower the chances of hitting the remote cache race by checking the input files ctime before and after execution. However, as @ulfjack explained to me, this will not fix the bug. The problem is that we need to compute the input's ctime at the time the action's hash is computed, which is usually much before the action is executed. So this fix still doesn't address this problem, but just reduces the time window it can. We don't know whether this is significant tough. @ulfjack mentioned we need to do some internal refactoring to be able to fix it for real. This will NOT be done in time for the 0.5.3 release.

bazel-io pushed a commit that referenced this issue Jul 17, 2017
Lower the chances of triggering #3360. Immediately before uploading
output artifacts, check if any of the inputs ctime changed, and if
so don't upload the outputs to the remote cache.

I profiled the runs on //src:bazel before vs. after the change; as expected, the
number of VFS_STAT increases by a factor of 180% (!), but, according to the profiler,
it adds less than 1ms to the overall build time of 130s.

PiperOrigin-RevId: 162179308
@mmorearty
Copy link
Contributor Author

Thanks for your work on this. @ola-rozenfeld's fix addresses the example I gave in my original description, but it's still not difficult to run into this problem with only slightly more complicated scenarios. This is real, and I think it's likely that my team would hit this if we were using the current version of Bazel. (We are currently using a fork of Bazel with my own fix, similar to what I describe below.) Luckily, I think a fix similar to Ola's is not too difficult (although unfortunately it wouldn't be as localized as her fix was).

The problem is that the current fix gets the ctime immediately before the action runs, and then immediately after the action runs; but often, the actual digests of the inputs files were computed much earlier during Bazel's execution.

The real-world scenario is: The user starts a build (with local compilation but with remote cache) which will take "a little while" (let's say, at least 30-60 seconds). While the build is going, the user saves a file in their editor — a file that Bazel has not yet used in any action (but Bazel has already computed its digest, because that happened fairly early in Bazel's execution). The result is cache corruption.

One way to address this would be: Immediately before any file's digest is computed, get its ctime; and save that somewhere (probably alongside the digest somehow). Then, much later, immediately after an action has executed, check if the ctimes of any of its input files have changed.

Here's a repro case for this scenario, structured quite similarly to the one above, except with two actions instead of one:

Make this BUILD file. It has two targets:

  • lib, which waits ten seconds and then copies lib.in to lib.out; and
  • main, which depends on lib, and also waits ten seconds and then copies both lib.out and main.in to main.out
genrule(
    name = "main",
    srcs = [":lib", "main.in"],
    outs = ["main.out"],
    cmd = "sleep 10; cat $(SRCS) > $@",
)

genrule(
    name = "lib",
    srcs = ["lib.in"],
    outs = ["lib.out"],
    cmd = "sleep 10; cat $(SRCS) > $@",
)

Make the files lib.in and main.in:

echo lib 1 > lib.in
echo main 1 > main.in

Start a build of :main with a command line similar to this one (for this test, I have nginx running on localhost acting as a remote cache; you will need to set up some sort of similar remote cache):

bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 build -s --spawn_strategy=remote --remote_rest_cache=http://localhost:8000/bazel-cache/race :main

While the build is happening — during the first sleep 10 (in other words, while :lib is sleeping) — in another shell, change the contents of main.in: echo main 2 > main.in

The build will complete, and bazel-genfiles/main.out will, not surprisingly, contain:

lib 1
main 2

That's not a bug. But then:

  • Do a clean: bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 clean
  • Reset main.in to its old contents: echo main 1 > main.in
  • Do another bazel build, with the same command line as above.

At this point, bazel-genfiles/main.out will still have main 2 on its second line, because that's what Bazel found in the remote cache. This is wrong.

@mmorearty
Copy link
Contributor Author

The best way I can think of to address this is to change FileSystem.getDigest() so that instead of returning just the digest (as a byte[]), it returns a small class that has two members: the digest, plus the ctime that the file had just before the digest was computed. This is unfortunate because it will ripple through a whole lot of code. I can't think of a better way to do it.

How are filesystems exposed internally in Bazel? If they are located through some sort of registry, then perhaps another way to do it would be, FileSystem.getDigest() wouldn't change, but instead it would keep track (in the FileSystem class) of the ctime just before the digest was computed. And then there can be a separate call, FileSystem.getDigestCtime() or something, to get the ctime that the file had just before the digest was computed.

mmorearty added a commit to Asana/bazel that referenced this issue Nov 22, 2017
Avoid triggering bazelbuild#3360. Immediately before a file's digest is computed
for the first time, cache its last-modified time (mtime). Later, before
uploading output artifacts, check if any of the inputs' mtimes have
changed since their digest was computed, and if so, don't upload the
outputs to the remote cache.

We need to use mtime, not ctime (last-change time), because some files
(e.g. genrule-setup.sh) have their metadata changed after their digest
is computed.

Also, the cache needs to resolve symbolic links, because some files are
accessed from multiple paths that all resolve to the same file.
@buchgr
Copy link
Contributor

buchgr commented Dec 4, 2017

What do you think about recomputing the action hash (incl. source file hashes, affected merkle tree paths) before uploading to the remote cache?

The code changes would be limited to the remote package. It's not clear to me that this would be noticeable in performance. Furthermore, the cost of hashing could be hidden by concurrently uploading output artifacts and only uploading the action hash if it did not change.

@ulfjack
Copy link
Contributor

ulfjack commented Dec 5, 2017

@buchgr, I'm not sure what you're proposing here, but rehashing all the input files is likely prohibitively expensive. Just think of rehashing the JDK for every Javac action - not an option. The right thing is to use the ctime to check for file changes, and conservatively not cache if the ctime changed (or re-execute). Note that it's unsafe to upload a cache entry if you aren't sure that the inputs match what the compiler saw during the action.

@hchauvin
Copy link
Contributor

hchauvin commented Dec 5, 2017

Maybe I'm missing something about the race condition here, but the way we avoid corrupting the cache is to recompute the hash on the cache service for every upload and throw an error if there is a mismatch. This way, in the case the file changes locally between the computation of the hash and the call to the remote cache service, the build just fails hand has to be started again.

@ulfjack
Copy link
Contributor

ulfjack commented Nov 20, 2018

@buchgr Are you able to reproduce with experimental_guard_against_concurrent_changes enabled?

@uri-canva
Copy link
Contributor

What's the remaining issue with this flag? I'm considering running a cache locally on developer laptops instead of using the dir cache as discussed in #5139.

@buchgr
Copy link
Contributor

buchgr commented Jan 19, 2019

@uri-canva on our CI we have as of last week seen actions being marked as modified even though no files have been modified on macOS. I did not have time to look into the issue yet. I am not aware of any problems on Linux.

@ob
Copy link
Contributor

ob commented Jan 19, 2019

@buchgr, do you mean with the flag or without? I am not enabling this flag and have seen the same behavior on macOS: Actions run locally when I know for a fact they are fully cached.

I opened #7084 to help with debugging.

@buchgr buchgr added team-Remote-Exec Issues and PRs for the Execution (Remote) team P2 We'll consider working on this in future. (Assignee optional) and removed category: http caching labels Feb 5, 2019
@virusdave
Copy link
Contributor

What's the status of this?

@buchgr buchgr removed their assignment Jan 9, 2020
@MikhailTymchukFT
Copy link

@ulfjack

Absent that, we'd actually check the ctime, which we already record in some cases (but maybe not all?).

Where do you store this info?

@ulfjack
Copy link
Contributor

ulfjack commented Apr 15, 2020

In FileContentsProxy via FileArtifactValue.

The check is already implemented, though. Just enable

--experimental_guard_against_concurrent_changes

@cpsauer
Copy link
Contributor

cpsauer commented Nov 5, 2020

Is there something blocking this from being enabled by default? Seems like a pretty clear, serious correctness issue, solved by @ulfjack's great work.

@nouiz
Copy link

nouiz commented Nov 6, 2020

I agree, the default should be safe. For people that can guarantee the source file doesn't change, giving them a speed up knob can be useful. But it should be safe by default.

@philwo
Copy link
Member

philwo commented Mar 30, 2021

@buchgr @ulfjack Does anyone remember why this flag isn't enabled by default? 😅 Is it because of the "There may be cases where the Linux kernel delays writing of files, which could cause false positives." part in the documentation of the flag?

@uri-canva
Copy link
Contributor

According to the comments above the issue happens on macOS too. We disabled it regardless of that because the overhead is very high, but I assume since it's not enabled by default there haven't been many eyes on that anyway.

@mmorearty
Copy link
Contributor Author

If this context helps, I can tell you that I don’t know how we would use Bazel if this flag wasn’t available. (We are a Mac shop.) Before this flag existed, corruption of our remote cache was somewhat rare, but when it did happen — perhaps once a month — we were pretty mystified and would have to spend a long time trying to figure out what happened and what to do about it.

@nouiz
Copy link

nouiz commented May 12, 2021

This page (https://docs.bazel.build/versions/master/remote-caching.html#known-issues) say.
"
Input file modification during a build

When an input file is modified during a build, Bazel might upload invalid results to the remote cache. You can enable a change detection with the --experimental_guard_against_concurrent_changes flag. There are no known issues and it will be enabled by default in a future release. See issue #3360 for updates. Generally, avoid modifying source files during a build.
"

Can it be enabled by default now as the bold text tell it will happens? People that do not want it can disable this.
We need the cache to be safe by default.

@uri-canva
Copy link
Contributor

I think that the docs aren't accurate anymore, there are known issues and that's why it's not enabled by default.

@cpsauer
Copy link
Contributor

cpsauer commented May 13, 2021

It feels like there's a "{Fast, Correct} - Choose two" joke in here somewhere :)

@cpsauer
Copy link
Contributor

cpsauer commented May 13, 2021

Do we have a handle on where the drawbacks are? I don't remember noticing meaningful slowdown when I enabled in on macOS.

@uri-canva
Copy link
Contributor

See the comments above.

@uri-canva
Copy link
Contributor

@buchgr can you open a new issue specifically for the false positive when using --experimental_guard_against_concurrent_changes? That would help summarise what the current state with it, and would help people finding this issue. As it is now when someone reads the first replies, then jumps to the last replies, they won't understand what is happening.

@uri-canva
Copy link
Contributor

I'm silly, here's the issue: #5566, sorry about the noise @buchgr.

@cpsauer
Copy link
Contributor

cpsauer commented May 13, 2021

No, I get it, and have read. But you're seeing it manifest as serious performance issue for you, @uri-canva, worse than the cache poisoning? Any more details you can provide (here or in the other issue)?

What I was seeing on my repo/machines was that it's trading off a minor performance issue for serious correctness/cache-poisoning issue--and human time is much more valuable than CPU time. Seems like some others maybe feel that way...but it's totally possible we're not hitting the issue the same way you are and are missing the severity of the trade off.

@nouiz
Copy link

nouiz commented May 19, 2021

We can force which bazel version is used.
But we can't force or verify that each developer add the extra flag when needed.
We are all humans, so it is normal that sometimes mistake will be made and the flag will be forgotten...
And when this happens, the consequence is important and lose much humans time before being corrected.

At this point, I think we need someone to justify a big speed regression to justify tolerating a cache corruption problem.
Something like the speed without/wo the flag. Without that, I think we should just change the default.
Right now, the speed regression looks more an anecdote then a big real issue. (seem to be reported by only 1 person without any numbers).

The cache poisoning is a big real issue that affected many people.

@github-actions
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

@github-actions github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label May 10, 2023
@mmorearty
Copy link
Contributor Author

I don’t understand why the flag still exists, instead of fully integrating the fix, not behind a command line flag. With local builds, Bazel caching is perfect with the flag; without the flag, it would not be usable, due to random, extremely difficult-to-debug cache corruption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not stale Issues or PRs that are inactive but not considered stale P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

16 participants