Fail cache downloads when required, retry on memory alloc fail #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue
Cache downloads can fail when containers have memory pressure. This happens somewhat regularly with our jobs, particularly when the cache has a lot of files in the ZIP, such as a node_modules cache.
The problem is that these do NOT fail the job on the cache unzip failure but the cache is effectively corrupt: files will be missing or truncated.
My theory is that this is due to GC pressure. The extract code unzips all of the files, iterating the archive in a loop. Inspecting the Golang
flate
library, there does not appear to be pooling between the decompressors, eg. they each allocate their own memory. It looks like they try to use fairly small buffers, but its not clear how much they might store at a given time. In any case, with an archive that may have 100K files, this loop will create these things quickly and GC may get behind.So this fix:
Separately, it introduces a
required
field oncache
so that failures to unzip the cache will fail the job immediately. This seems like the ideal behavior (a broken cache is going to be a broken build), but to make this non-breaking this field is introduced as aboolean
.Symptom
These are intermittent and will succeed on retrying the job in many cases.
Added tests, etc.
Might be worth splitting into 2 PRs, one for the extractor, one for the command / cache changes.