How can I update pod request memory and restart it when Pod OOMKilled #4493

zen-xu · 2020-11-09T02:57:44Z

Use Cases

When would you use this?

It's hard to guess how much memory need for our tasks to run. So we want if the pod is OOMKilled, argo can scale up its memory and restart it automaticly.

alexec · 2020-11-09T20:00:54Z

Alternative #3106

alexec · 2020-11-09T20:02:20Z

There was a request to retry any failed workflow pod using different parameters (e.g. resource requests/limits).

#3106 is about helping identify these prior to execution.

zen-xu · 2020-11-10T09:38:36Z

Thanks @alexec

Is there a convenient way to get pod message?

therc · 2021-02-24T20:31:50Z

I agree with @zen-xu

We have the same issue. In our case, we could use input file size as a rule of thumb to set memory limits, but even then, there are some workloads where memory usage depends to a great extent on the contents (entropy, even) of the files. I am not able to distinguish authentic failures (the job returning exit code 1) from the kernel killing our pods, so the risk is that we might end up retrying something that will fail at any memory size. Exposing the message would go a long way. Would a PR be accepted if I wrote one?

sarabala1979 · 2021-02-24T20:37:12Z

@therc more welcome to create PR. There is a hacky workaround. you can have a pre-step to determine the resource for the processing step and pass it as podSpecPatch on processing step

therc · 2021-02-24T21:12:29Z

@sarabala1979 I use podSpecPatch to set a higher limit when retrying. What I need to avoid is re-running a graph node that didn't run out of memory. Some of these workloads use GPUs, so doing all this useless extra work can get expensive quickly. Both the code and the data in a given workflow are very experimental and thus have a high degree of variability, so we'd probably want more than one retry.

Ideally, I'd create retry nodes with

when: "({{tasks.previous.status}} == Failed) && (tasks.previous.message == OOMKilled)"

or just the second boolean, since that would imply a Failed status.

sarabala1979 · 2021-02-24T21:58:28Z

@simster7 was exploring previously adding condition base retrystrategy select the node and other properties.
Simon: can you provide your thought?

terrytangyuan · 2021-02-25T04:03:10Z

I recently added a retry policy to allow retry only on transient errors in #4999. You can also specify a regular expression that captures any additional errors that can be seen as transient and retryable (based on the node messages like “OOMKilled” that you are interested), see TRANSIENT_ERROR_PATTERN in https://github.com/argoproj/argo-workflows/blob/master/docs/environment-variables.md.

Is this what you need?

therc · 2021-02-25T04:13:51Z

That's close. Is the retry counter exposed to the templates? Say you want the first retry to use twice as much memory as the original attempt. We don't want to rerun with the same resources if there was an OOM error.

terrytangyuan · 2021-02-25T04:23:50Z

Yes, you should be able to use the variable "retries" in https://github.com/argoproj/argo-workflows/blob/master/docs/variables.md#containerscript-templates

therc · 2021-02-25T04:25:42Z

Great! I'll give that a try tomorrow. Thank you so much.

tczhao · 2021-02-25T05:05:47Z

@terrytangyuan On a slightly different topic, I think we'd better add an example or extend the documentation to let users know there's OnTransientError retry policy.

terrytangyuan · 2021-02-25T13:59:58Z

@terrytangyuan On a slightly different topic, I think we'd better add an example or extend the documentation to let users know there's OnTransientError retry policy.

Definitely. Just documented this in #5196.

therc · 2021-02-26T00:36:24Z

Yes, you should be able to use the variable "retries" in https://github.com/argoproj/argo-workflows/blob/master/docs/variables.md#containerscript-templates

Maybe I'm doing something stupid, but I can't get "retries" and "podSpecPatch" to play nice.
#5219

zen-xu added the type/feature Feature request label Nov 9, 2020

zen-xu changed the title ~~How can I update one step pod request memory and restart it when Pod OOMKilled~~ How can I update pod request memory and restart it when Pod OOMKilled Nov 9, 2020

terrytangyuan mentioned this issue Feb 25, 2021

docs: Document the new OnTransientError retry policy #5196

Merged

1 task

tczhao mentioned this issue Mar 21, 2021

Retry on different resource #5471

Closed

alexec closed this as completed Oct 1, 2021

agilgur5 added the area/retryStrategy Template-level retryStrategy label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I update pod request memory and restart it when Pod OOMKilled #4493

How can I update pod request memory and restart it when Pod OOMKilled #4493

zen-xu commented Nov 9, 2020

alexec commented Nov 9, 2020

alexec commented Nov 9, 2020

zen-xu commented Nov 10, 2020

therc commented Feb 24, 2021

sarabala1979 commented Feb 24, 2021

therc commented Feb 24, 2021

sarabala1979 commented Feb 24, 2021

terrytangyuan commented Feb 25, 2021 •

edited

therc commented Feb 25, 2021 via email

terrytangyuan commented Feb 25, 2021

therc commented Feb 25, 2021 via email

tczhao commented Feb 25, 2021

terrytangyuan commented Feb 25, 2021

therc commented Feb 26, 2021

How can I update pod request memory and restart it when Pod OOMKilled #4493

How can I update pod request memory and restart it when Pod OOMKilled #4493

Comments

zen-xu commented Nov 9, 2020

Use Cases

alexec commented Nov 9, 2020

alexec commented Nov 9, 2020

zen-xu commented Nov 10, 2020

therc commented Feb 24, 2021

sarabala1979 commented Feb 24, 2021

therc commented Feb 24, 2021

sarabala1979 commented Feb 24, 2021

terrytangyuan commented Feb 25, 2021 • edited

therc commented Feb 25, 2021 via email

terrytangyuan commented Feb 25, 2021

therc commented Feb 25, 2021 via email

tczhao commented Feb 25, 2021

terrytangyuan commented Feb 25, 2021

therc commented Feb 26, 2021

terrytangyuan commented Feb 25, 2021 •

edited