-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I update pod request memory and restart it when Pod OOMKilled #4493
Comments
Alternative #3106 |
There was a request to retry any failed workflow pod using different parameters (e.g. resource requests/limits). #3106 is about helping identify these prior to execution. |
Thanks @alexec Is there a convenient way to get pod message? |
I agree with @zen-xu We have the same issue. In our case, we could use input file size as a rule of thumb to set memory limits, but even then, there are some workloads where memory usage depends to a great extent on the contents (entropy, even) of the files. I am not able to distinguish authentic failures (the job returning exit code 1) from the kernel killing our pods, so the risk is that we might end up retrying something that will fail at any memory size. Exposing the message would go a long way. Would a PR be accepted if I wrote one? |
@therc more welcome to create PR. There is a hacky workaround. you can have a pre-step to determine the resource for the processing step and pass it as |
@sarabala1979 I use podSpecPatch to set a higher limit when retrying. What I need to avoid is re-running a graph node that didn't run out of memory. Some of these workloads use GPUs, so doing all this useless extra work can get expensive quickly. Both the code and the data in a given workflow are very experimental and thus have a high degree of variability, so we'd probably want more than one retry. Ideally, I'd create retry nodes with
or just the second boolean, since that would imply a Failed status. |
@simster7 was exploring previously adding condition base |
I recently added a retry policy to allow retry only on transient errors in #4999. You can also specify a regular expression that captures any additional errors that can be seen as transient and retryable (based on the node messages like “OOMKilled” that you are interested), see TRANSIENT_ERROR_PATTERN in https://github.com/argoproj/argo-workflows/blob/master/docs/environment-variables.md. Is this what you need? |
That's close. Is the retry counter exposed to the templates? Say you want
the first retry to use twice as much memory as the original attempt. We
don't want to rerun with the same resources if there was an OOM error.
|
Yes, you should be able to use the variable "retries" in https://github.com/argoproj/argo-workflows/blob/master/docs/variables.md#containerscript-templates |
Great! I'll give that a try tomorrow. Thank you so much.
|
@terrytangyuan On a slightly different topic, I think we'd better add an example or extend the documentation to let users know there's OnTransientError retry policy. |
Definitely. Just documented this in #5196. |
Maybe I'm doing something stupid, but I can't get "retries" and "podSpecPatch" to play nice. |
Use Cases
When would you use this?
It's hard to guess how much memory need for our tasks to run. So we want if the pod is OOMKilled, argo can scale up its memory and restart it automaticly.
The text was updated successfully, but these errors were encountered: