Compile cache bugs - missing beam file test. #14866
Closed
+185
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We have been regularly encountering bugs in the caching logic of 1.18.3, typically related to migrating modules, but also in other circumstances. I believe that the cache logic contains flaws that are triggered under the heavier load of our 20k+ modules.
While investigating this I did a deep analysis of the cache logic, and was able to create the following tests for two problems with the cache. This logic from compiler/elixir.ex in particular seems incorrect:
this code guards checking if the beam file exist if the mtime are not appropriately ordered. If the mtime is unchanged then it fails to notice missing beam files.
I noticed the cache logic embeds the assumption that system time is strictly increasing, which is not correct. Clock drift is a real thing, and system times are adjusted. IMO all ordering comparisons on time should be changed to straight out inequality statemens, and some like the one above should just be removed. Personally i dont think there is much to be won by guarding the digest_changed() function with an mtime change check the way this code does. It seems to be it is just a cache bug waiting to happen.
I'd expect this code to look like this:
I'd also consider coupling inode with mtime. Don't just use mtime alone.
As far as the manifest goes, it does not include any metadata to validate that it is not corrupt. We have seen that in the cache error scenarios we have investigated that the manifest file misses data it should include. I looked for a hash of the inputs used to construct the manifest but there is none, so under some circumstances it is unable to tell the manifest is broken.
I plan to apply a set of patches to greatly expand cache validation at startup and compare results between old and new, so that the next time this bug strikes I have more data.
Why does the cache use the working directory of the repo as its cache key? It seems to be an unforced barrier to relocatable trees, what problem does it solve?
This patch is against 1.19.1, i can provide a relacement against the latest build if you wish.