Consider mitigation strategies for workflows facing OOM issues

For a couple of times now, the AWS CLI has been killed during the production workflow in the "Update CloudFront KeyValueStore redirects" (`docs-assembler deploy update-redirects`) step. The cause appears to be the underlying system running out of memory, which is reflected in the error code provided by the CLI (137).

The main suspect so far is getting unlucky with a _noisy neighbor_ - but it's still weird that it never happened and then happens twice in a week, so we can't rule out just reaching memory limits now. AFAIU we consistently exit cleanly from the tooling in earlier steps, but maybe somehow they aren't freed from the runner's perspective...? Our input in this step doesn't seem to be a factor - we had an OOM when running `aws cloudfront describe-key-value-store` to acquire an ARN, which is much cheaper than the paginated update that happens later in the process.

How should we tackle this?

- Can we increase the runner's available memory?
- Add a swapfile/partition?
- Add memory consumption logs/metrics?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider mitigation strategies for workflows facing OOM issues #1918

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider mitigation strategies for workflows facing OOM issues #1918

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions