Description
There are multiple cases when the OOM error is not reported in cortex get:
- Exit code 137/236/237/350/363/370 as shown in logs, but exit code 0 in container status with
reason as OOMKilled and Job marked as Successful.
- Exit code 137/236/237/350 as shown in logs, but exit code 0 in container status with
reason as OOMKilled and Job marked as Failed.
- Pod evicted by k8s engine, with Job marked as Successful, but with pod reason “memory was too low, had to be evicted”-like message.
Reproducibility
Set a very low mem request in the cortex.yaml config and then create a big numpy array in the job.
Submit the job and notice the job status not being set to OOM.