Skip to content

Consider mitigation strategies for workflows facing OOM issues #1918

@cotti

Description

@cotti

For a couple of times now, the AWS CLI has been killed during the production workflow in the "Update CloudFront KeyValueStore redirects" (docs-assembler deploy update-redirects) step. The cause appears to be the underlying system running out of memory, which is reflected in the error code provided by the CLI (137).

The main suspect so far is getting unlucky with a noisy neighbor - but it's still weird that it never happened and then happens twice in a week, so we can't rule out just reaching memory limits now. AFAIU we consistently exit cleanly from the tooling in earlier steps, but maybe somehow they aren't freed from the runner's perspective...? Our input in this step doesn't seem to be a factor - we had an OOM when running aws cloudfront describe-key-value-store to acquire an ARN, which is much cheaper than the paginated update that happens later in the process.

How should we tackle this?

  • Can we increase the runner's available memory?
  • Add a swapfile/partition?
  • Add memory consumption logs/metrics?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions