Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disk action cache can become out of sync with in-memory action cache #2660

Closed
benjaminp opened this issue Mar 9, 2017 · 11 comments
Closed
Assignees
Labels
P1 I'll work on this now. (Assignee required) type: bug under investigation

Comments

@benjaminp
Copy link
Collaborator

benjaminp commented Mar 9, 2017

Let's begin with a reproducible example of this bug (output omitted where it's not relevant):

$ cat BUILD 
[genrule(
  name = 'noise-{}'.format(i),
  cmd = 'echo "hi" > $@',
  outs = ['noise-{}.txt'.format(i)],
) for i in range(5000)]

filegroup(
  name = 'noise',
  srcs = ['noise-{}.txt'.format(i) for i in range(5000)],
)

genrule(
  name = 'a',
  cmd = 'echo 1 > $@',
  outs = ['a.txt'],
)

genrule(
  name = 'b',
  cmd = 'echo 101 > $@',
  outs = ['b.txt'],
)
$ bazel clean --expunge
$ bazel build noise
$ bazel build a
$ bazel build b
$ bazel build --check_up_to_date a
INFO: Found 1 target...
Target //:a up-to-date:
  bazel-genfiles/a.txt
INFO: Elapsed time: 0.263s, Critical Path: 0.00s
$ bazel shutdown
$ bazel build --check_up_to_date a
........
INFO: Found 1 target...
ERROR: action 'Executing genrule //:a' is not up-to-date.
Target //:a failed to build
INFO: Elapsed time: 2.389s, Critical Path: 0.00s

As you can see, Bazel now thinks a is out of date despite us doing nothing but restarting the server!

I believe the problem is CompactPersistentActionCache.ActionMap's ability to retain the cache journal on save(). Specifically, Bazel has a heuristic that if the action cache journal is much smaller than the action cache, it keeps the journal instead of applying it to (and rewriting) the main action cache. This is fine in principle, but in reality, PersistentMap simply forgets about the journal once it closes it. This means that if the journal-keeping heuristic is applied twice in a row, the first cache journal will simply be overwritten by the new one, losing data.

In the example, the problem plays out as follows:

  1. Executing the noise rule makes a large action cache.
  2. When we build a, the large action cache causes Bazel to write only to the cache journal.
  3. When we build b, the same heuristic applies and Bazel writes to the action cache journal stamping over the the one it wrote in step 2. The on-disk action cache is now out of sync with the in-memory one.
  4. bazel build --check_up_to_date a works because the in-memory action cache still knows about a.txt.
  5. When we restart the server, the action cache and journal from step 3 is read. All references to the work done in step 2 are lost because its journal was never applied.
@damienmg
Copy link
Contributor

What is the version of Bazel you are using? We had a bug in 0.4.4 (#2490) and we are currently testing 0.4.5 that address that issue, could that be it?

/cc @ulfjack also

@benjaminp
Copy link
Collaborator Author

benjaminp commented Mar 10, 2017

I'm using 0.4.3—I actually avoided upgrading because the cpp input pruning code change looked unstable—but from code inspection it looks like the issue has been present since the dawn of Bazel through today. (You have to get into a very special state to see this issue.) This issue looks like a problem with PersistentMap and doesn't involve anything fancy like input-discovering actions as #2490 does.

@damienmg damienmg assigned ulfjack and unassigned damienmg Mar 14, 2017
@damienmg
Copy link
Contributor

Ok @ulfjack is definitely way more able to understand the action cache that I am, so reassigning.

Setting to P1 because correctness

@damienmg damienmg added P1 I'll work on this now. (Assignee required) type: bug labels Mar 14, 2017
@ulfjack
Copy link
Contributor

ulfjack commented Mar 14, 2017

That's why you shouldn't write your own database. I don't think there's a correctness issue here, it's 'just' forgetting stuff, presumably resulting in slower than necessary builds. Correct?

@ulfjack
Copy link
Contributor

ulfjack commented Mar 14, 2017

If the analysis above is correct, then this bug is many years old already.

@benjaminp
Copy link
Collaborator Author

Right, I don't believe this can cause incorrect incremental builds.

I note that if this problem did lead to correctness issues, it would highlight defects more systemic than just this bug in PersistentMap or at least in the assumptions of PersistentMap's clients. Since PersistentMap only updates its backing store periodically and never fsyncs , clients have to expect to receive garbage from it.

@ulfjack
Copy link
Contributor

ulfjack commented Apr 7, 2017

I think Java may be calling fsync on file close. Anyway, thanks for the patch, I'll merge it shortly.

@meteorcloudy
Copy link
Member

I believe 184faf6 is causing #3043 due to the leak of the journal file handle. On Windows, you cannot delete a file when it's open in another process.
Any idea? @ulfjack @benjaminp

@meteorcloudy
Copy link
Member

I noticed this commit by blaming PersistentMap.java, I am trying to confirm this by reverting this change .

@ulfjack
Copy link
Contributor

ulfjack commented Jun 14, 2017

Clean should tell the action cache to close all files and remove all in-memory data. If it's not doing that right now, then that'd be a bug already. If it does clear the in-memory action cache, then it shouldn't be difficult to close any open files as part of that.

@benjaminp
Copy link
Collaborator Author

One thing I noticed when working on this bug was that there isn't a way to close the resources associated with a PersistentMap except by calling save. BlazeWorkspace.clearCaches simply does actionCache = null, which presumably leaks the journalfd if it's open because the last build didn't call save().

luca-digrazia pushed a commit to luca-digrazia/DatasetCommitsDiffSearch that referenced this issue Sep 4, 2022
    This fixes bazelbuild/bazel#2660. Basically,
    if we elect to keep the journal during PersistentMap.save(), we
    shouldn't stomp over it the next time save() is called.

    In writeJournal(), we now check if the journal file exists, and open
    it in append mode if it does. Alternatively, we could simply not close
    (and thus forget about) the journal in save(), but that would leak the
    journal file handle if save() was never called with keepJournal()
    returning false.

    Change-Id: Id00732f161c8b5a082a6c109aee115591ace2ea7
    PiperOrigin-RevId: 152480978
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) type: bug under investigation
Projects
None yet
Development

No branches or pull requests

4 participants