Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail cache downloads when required, retry on memory alloc fail #1

Closed
wants to merge 2 commits into from

Conversation

shawnburke
Copy link

@shawnburke shawnburke commented Feb 4, 2022

Issue

Cache downloads can fail when containers have memory pressure. This happens somewhat regularly with our jobs, particularly when the cache has a lot of files in the ZIP, such as a node_modules cache.

The problem is that these do NOT fail the job on the cache unzip failure but the cache is effectively corrupt: files will be missing or truncated.

My theory is that this is due to GC pressure. The extract code unzips all of the files, iterating the archive in a loop. Inspecting the Golang flate library, there does not appear to be pooling between the decompressors, eg. they each allocate their own memory. It looks like they try to use fairly small buffers, but its not clear how much they might store at a given time. In any case, with an archive that may have 100K files, this loop will create these things quickly and GC may get behind.

So this fix:

  1. Detects memory errors
  2. Retries memory allocation errors 3 times
  3. Waits 1s and triggers a GC on memory error

Separately, it introduces a required field on cache so that failures to unzip the cache will fail the job immediately. This seems like the ideal behavior (a broken cache is going to be a broken build), but to make this non-breaking this field is introduced as a boolean.

Symptom

Downloading cache.zip from https://s3.dualstack.us-west-2.amazonaws.com/ci-gitlab.foo.com/project/3304/ui-cache 
WARNING: ui/node_modules/date-fns/locale/ta/_lib/match/index.js: write ui/node_modules/date-fns/locale/ta/_lib/match/index.js: cannot allocate memory (suppressing repeats) 
WARNING: logn-ui/node_modules/date-fns/locale/ta/index.d.ts: read ../../../../../../cache/...-monorepo/cache.zip: cannot allocate memory (suppressing repeats) 
/scripts-dafas-2051793280/restore_cache: line 203:   113 Killed                  '/usr/bin/gitlab-runner-helper' "cache-extractor" 

These are intermittent and will succeed on retrying the job in many cases.

Added tests, etc.

Might be worth splitting into 2 PRs, one for the extractor, one for the command / cache changes.

@shawnburke shawnburke force-pushed the fail-memory-errors branch 2 times, most recently from f211f8e to 7e98bb3 Compare February 4, 2022 01:50
@Reprazent
Copy link
Member

@shawnburke Apologies, this repository is a mirror of https://gitlab.com/gitlab-org/gitlab-runner where the development happens. Would you mind opening a merge request there?

@Valentinodelicious
Copy link

  • ********

@StingRayZA
Copy link
Member

Closing this PR, would be greate to see you on gitlab.com...

@StingRayZA StingRayZA closed this Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants