Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I update pod request memory and restart it when Pod OOMKilled #4493

Closed
zen-xu opened this issue Nov 9, 2020 · 14 comments
Closed

How can I update pod request memory and restart it when Pod OOMKilled #4493

zen-xu opened this issue Nov 9, 2020 · 14 comments
Labels
area/retryStrategy Template-level retryStrategy type/feature Feature request

Comments

@zen-xu
Copy link

zen-xu commented Nov 9, 2020

Use Cases

When would you use this?

It's hard to guess how much memory need for our tasks to run. So we want if the pod is OOMKilled, argo can scale up its memory and restart it automaticly.

@zen-xu zen-xu added the type/feature Feature request label Nov 9, 2020
@zen-xu zen-xu changed the title How can I update one step pod request memory and restart it when Pod OOMKilled How can I update pod request memory and restart it when Pod OOMKilled Nov 9, 2020
@alexec
Copy link
Contributor

alexec commented Nov 9, 2020

Alternative #3106

@alexec
Copy link
Contributor

alexec commented Nov 9, 2020

There was a request to retry any failed workflow pod using different parameters (e.g. resource requests/limits).

#3106 is about helping identify these prior to execution.

@zen-xu
Copy link
Author

zen-xu commented Nov 10, 2020

Thanks @alexec

Is there a convenient way to get pod message?

@therc
Copy link

therc commented Feb 24, 2021

I agree with @zen-xu

We have the same issue. In our case, we could use input file size as a rule of thumb to set memory limits, but even then, there are some workloads where memory usage depends to a great extent on the contents (entropy, even) of the files. I am not able to distinguish authentic failures (the job returning exit code 1) from the kernel killing our pods, so the risk is that we might end up retrying something that will fail at any memory size. Exposing the message would go a long way. Would a PR be accepted if I wrote one?

@sarabala1979
Copy link
Member

@therc more welcome to create PR. There is a hacky workaround. you can have a pre-step to determine the resource for the processing step and pass it as podSpecPatch on processing step

@therc
Copy link

therc commented Feb 24, 2021

@sarabala1979 I use podSpecPatch to set a higher limit when retrying. What I need to avoid is re-running a graph node that didn't run out of memory. Some of these workloads use GPUs, so doing all this useless extra work can get expensive quickly. Both the code and the data in a given workflow are very experimental and thus have a high degree of variability, so we'd probably want more than one retry.

Ideally, I'd create retry nodes with

when: "({{tasks.previous.status}} == Failed) && (tasks.previous.message == OOMKilled)"

or just the second boolean, since that would imply a Failed status.

@sarabala1979
Copy link
Member

@simster7 was exploring previously adding condition base retrystrategy select the node and other properties.
Simon: can you provide your thought?

@terrytangyuan
Copy link
Member

terrytangyuan commented Feb 25, 2021

I recently added a retry policy to allow retry only on transient errors in #4999. You can also specify a regular expression that captures any additional errors that can be seen as transient and retryable (based on the node messages like “OOMKilled” that you are interested), see TRANSIENT_ERROR_PATTERN in https://github.com/argoproj/argo-workflows/blob/master/docs/environment-variables.md.

Is this what you need?

@therc
Copy link

therc commented Feb 25, 2021 via email

@terrytangyuan
Copy link
Member

Yes, you should be able to use the variable "retries" in https://github.com/argoproj/argo-workflows/blob/master/docs/variables.md#containerscript-templates

@therc
Copy link

therc commented Feb 25, 2021 via email

@tczhao
Copy link
Member

tczhao commented Feb 25, 2021

@terrytangyuan On a slightly different topic, I think we'd better add an example or extend the documentation to let users know there's OnTransientError retry policy.

@terrytangyuan
Copy link
Member

@terrytangyuan On a slightly different topic, I think we'd better add an example or extend the documentation to let users know there's OnTransientError retry policy.

Definitely. Just documented this in #5196.

@therc
Copy link

therc commented Feb 26, 2021

Yes, you should be able to use the variable "retries" in https://github.com/argoproj/argo-workflows/blob/master/docs/variables.md#containerscript-templates

Maybe I'm doing something stupid, but I can't get "retries" and "podSpecPatch" to play nice.
#5219

@alexec alexec closed this as completed Oct 1, 2021
@agilgur5 agilgur5 added the area/retryStrategy Template-level retryStrategy label Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retryStrategy Template-level retryStrategy type/feature Feature request
Projects
None yet
Development

No branches or pull requests

7 participants