Skip to content

Conversation

@demerphq
Copy link
Contributor

We have been regularly encountering bugs in the caching logic of 1.18.3, typically related to migrating modules, but also in other circumstances. I believe that the cache logic contains flaws that are triggered under the heavier load of our 20k+ modules.

While investigating this I did a deep analysis of the cache logic, and was able to create the following tests for two problems with the cache. This logic from compiler/elixir.ex in particular seems incorrect:

           size != last_size or
              has_any_key?(stale_modules, modules) or
                (last_mtime > mtime and
                   (missing_beam_file?(dest, modules) or digest_changed?(source, digest))) ->
              [source]

this code guards checking if the beam file exist if the mtime are not appropriately ordered. If the mtime is unchanged then it fails to notice missing beam files.

I noticed the cache logic embeds the assumption that system time is strictly increasing, which is not correct. Clock drift is a real thing, and system times are adjusted. IMO all ordering comparisons on time should be changed to straight out inequality statemens, and some like the one above should just be removed. Personally i dont think there is much to be won by guarding the digest_changed() function with an mtime change check the way this code does. It seems to be it is just a cache bug waiting to happen.

I'd expect this code to look like this:

           size != last_size or
              has_any_key?(stale_modules, modules) or
                missing_beam_file?(dest, modules) or
                digest_changed?(source, digest) ->
              [source]

I'd also consider coupling inode with mtime. Don't just use mtime alone.

As far as the manifest goes, it does not include any metadata to validate that it is not corrupt. We have seen that in the cache error scenarios we have investigated that the manifest file misses data it should include. I looked for a hash of the inputs used to construct the manifest but there is none, so under some circumstances it is unable to tell the manifest is broken.

I plan to apply a set of patches to greatly expand cache validation at startup and compare results between old and new, so that the next time this bug strikes I have more data.

Why does the cache use the working directory of the repo as its cache key? It seems to be an unforced barrier to relocatable trees, what problem does it solve?

This patch is against 1.19.1, i can provide a relacement against the latest build if you wish.

@josevalim josevalim changed the base branch from main to v1.19 October 28, 2025 08:45
@josevalim
Copy link
Member

Hi @demerphq! Thank you for the PR and the tests. Those are great!

There were indeed bugs in this part of the code, some of them fixed as part of v1.19 here: cb108d5

I will run your tests locally and verify if there are cases we did not fully cover yet.

Why does the cache use the working directory of the repo as its cache key? It seems to be an unforced barrier to relocatable trees, what problem does it solve?

Because .beam files point to the their source, so if you relocate them, then they point to invalid sources unless recompiled. We have been considering options to allow opting-out from this check.

@josevalim
Copy link
Member

Ok, both tests fail in main, great job! I have pushed three commits to main:

  1. --no-check-cwd to disable cwd checks
  2. Strict comparison of timestamps to improve reliability
  3. Make sure manifests are deleted before modules are cleaned

That solves one of the tests. Regarding .beam files being removed, I am not yet convinced. From the four root causes:

a. "Partial cache restore in CI" - CI should validate the restore is not partial by also hashing the contents, as we did with our manifest
b. "Manual deletion of .beam files" - don't change build artifacts directly, as removing .beam files is not the only aspect that may go wrong
c. "Corrupted _build directory" - same as above. If we were to check all outputs on every command, it would slow down all CLI experience. Perhaps something opt-in?
d. "Partial mix clean operations" - addressed by making sure manifest is deleted first

So I can see us adding more validations for output consistency, especially for when builds are restored from external stores, but at the moment I'd see that as an opt-in feature. @demerphq said they'd like to explore content addressable caches as well, which is something we'd welcome feedback on as part of new features.

@josevalim josevalim closed this in 1a2c0a2 Oct 28, 2025
josevalim added a commit that referenced this pull request Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants