Skip to content

OOM status not reported for Task/Batch APIs at all times #1806

@RobertLucian

Description

@RobertLucian

Description

There are multiple cases when the OOM error is not reported in cortex get:

  1. Exit code 137/236/237/350/363/370 as shown in logs, but exit code 0 in container status with reason as OOMKilled and Job marked as Successful.
  2. Exit code 137/236/237/350 as shown in logs, but exit code 0 in container status with reason as OOMKilled and Job marked as Failed.
  3. Pod evicted by k8s engine, with Job marked as Successful, but with pod reason “memory was too low, had to be evicted”-like message.

Reproducibility

Set a very low mem request in the cortex.yaml config and then create a big numpy array in the job.
Submit the job and notice the job status not being set to OOM.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions