Workflow steps fail with a 'pod deleted' message. #4565

ebr · 2020-11-19T18:37:35Z

Summary

Maybe related to #3381?

Some of the workflow steps end up in Error state, with pod deleted. I am not sure which of the following data points are relevant, but listing all observations:

the workflow uses PodGC: strategy: OnPodSuccess.
we are seeing this for ~5% of workflow steps.
affected steps are a part of a withItems loop
the workflow is not large - ~170 to 300 concurrent nodes
this is observed since deploying v2.12.0rc2 yesterday, including v2.12.0rc2 executor image. We were previously on v2.11.6 and briefly on v2.11.7, and have not seen this.
k8s events confirm the pods ran to completion
cluster scaling has been ruled out as the cause - this is observed on multiple k8s nodes, all of which are still running
we have not tried the same workflow without PodGC yet.

Diagnostics

What Kubernetes provider are you using?

docker

What version of Argo Workflows are you running?

v2.12.0rc2 for all components

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

alexec · 2020-11-19T18:40:23Z

I'm betting this is caused by #4024, maybe when it interacts with #4025.

alexec · 2020-11-19T18:41:23Z

Can you please try running your controller with INFORMER_WRITE_BACK=false?

ebr · 2020-11-19T19:13:40Z

Can confirm that we are seeing zero occurrences of this with INFORMER_WRITE_BACK=false, on a 10x larger workflow. We'll leave it like this for now. Are there any performance/reliability issues to look out for with this flag?

alexec · 2020-11-19T19:26:14Z

INFORMER_WRITE_BACK=false only points to a solution - it is not the solution itself and it comes with a performance cost

alexec · 2020-11-19T19:26:39Z

If I create an engineering build - can you please test it?

ebr · 2020-11-19T19:33:23Z

Certainly! will it also include what we discussed re: #4560 ?

alexec · 2020-11-19T19:39:27Z

Please can you try argoproj/workflow-controller:grace and check to see if allowing a short grace-period before marking node as error is logged?

ebr · 2020-11-19T19:48:56Z

@alexec should I also remove the INFORMER_WRITE_BACK envvar?

alexec · 2020-11-19T19:51:56Z

Yes. Remove INFORMER_WRITE_BACK.

ebr · 2020-11-19T20:47:36Z

Still seeing ~3% of pods being deleted on :grace, and no log messages with grace-period. I'd like to revert to 2.12.0rc2 and set INFORMER_WRITE_BACK=false for now - is that safe aside from the performance cost?

alexec · 2020-11-19T21:35:26Z

@ebr - no - that is an unsupported development flag.

This is a blocking issue for v2.12 and does need to be fixed.

alexec · 2020-11-19T21:50:13Z

I've just updated the build with a longer grace period and more logging. Could you please try it out?

ebr · 2020-11-19T22:45:39Z

no longer seeing this on :grace! but no log messages either.

alexec · 2020-11-19T22:58:26Z

can you tell me if you see recentlyStarted in your logs?

ebr · 2020-11-19T23:08:14Z

No recentlyStarted.... I'm going to try reproducing this again on 2.12rc2

alexec · 2020-11-19T23:10:52Z

That's really odd. You should absolutely see it. What is your log level? info? Is your imagePullPolicy set to Always?

ebr · 2020-11-19T23:11:57Z

Log level debug as we set in our Zoom troubleshooting and I haven't changed it; imagePullPolicy: Always.

ebr · 2020-11-19T23:19:24Z

there are definitely dozens of pods being started and completed, and none of the given strings appear in the controller logs.... I double checked that I'm still on :grace.

alexec · 2020-11-19T23:23:49Z

Do you see Pod for node deleted? You should not.

ebr · 2020-11-19T23:24:45Z

No, I'm not seeing that. but I'm no longer seeing deleted pods either

alexec · 2020-11-19T23:29:29Z

Ok, I did not docker push the very latest version. I have now pushed it if yo have time to test and check for recentlyStarted in your logs.

alexec · 2020-11-19T23:30:11Z

Why am I asking again? Seeing recentlyStarted: true will give me confidence that fix is correct. Not something else changed.

ebr · 2020-11-19T23:35:31Z

Sounds good, I will try right now before signing off for the day. thanks for all your attention to this!

ebr · 2020-11-20T00:10:29Z

I wasn't able to reproduce this at all (still on :grace). And see none of the above log messages...
Tomorrow I'm going to try reproducing this again on 2.12.0rc2
A silly thought perhaps, but could this have been caused by another workflow? The only thing that changed throughout the day (aside from my testing) was that we deleted some previously failed workflows.

ebr · 2020-11-20T15:54:57Z

Update: we did see quite a lot of deleted pods through the night - hard to say how many exactly because we were running over 12,000 pods, but it's definitely in the thousands. this is still on :grace. So I guess it's not as immediately reproducible, but definitely still happening.
I can't say whether there were any log messages with recentlyStarted: true, as the controller is pumping out such massive log volume that they've already been rotated out

alexec · 2020-11-23T20:02:16Z

Not fixed - but less of a problem?

alexec · 2020-11-23T20:42:51Z

Ah - the :grace version does not have any fixes from :mot (which is now merged to :latest). If I create a new engineering build of :grace that includes that - could you test the together?

alexec · 2020-11-23T20:44:25Z

Done. argoproj/workflow-controller:grace has both fixes.

ebr · 2020-11-24T21:03:40Z

hey @alexec - thank you - I was offline for a couple of days, but will test it as soon as I can. It is unlikely to happen today.

Not fixed - but less of a problem?

This is correct. Definitely not as reproducible anymore. For what it's worth, my team has been running jobs on the :mot build since Friday (4 days), and many very large workflows have completed perfectly, while some smaller ones experienced pod deletion. There doesn't seem to be any systematic pattern to it - it's very intermittent. I'll try the new :grace build when it's safe to do so; we're running some really critical workloads right now so I can't promise to do it soon, unfortunately.

ebr · 2020-11-25T05:07:22Z

Some more deleted pods today; i switched to :grace. Seeing this:

significant increase in RAM usage of the workflow controller
still some deleted pods, very few and very intermittent
still not seeing any log messages like recentlyStarted: true or grace, but also due to the frequent controller restarts and infrequent pod deletions, I can't be sure.

the worklfows seem to be running successuflly and without changes though, i'm debating whether to revert to :mot or just leave it alone. increased RAM usage is not an issue for us.

alexec · 2020-11-25T17:02:09Z

recentlyStarted was logged at debug level. I've increased this to info level and published a new :grace image.
I've added RECENTLY_STARTED_POD_DURATION=10s you could try setting this to higher values, e.g. RECENTLY_STARTED_POD_DURATION=30s to see if that improves the issue.
:mot is merged to master and the fix is on :latest.
I would not expect to see increased RAM usage. Are you able to test :latest so we can be sure that the issue is with grace not something else?

It is Thanksgiving at the moment - I'll look in more detail on Monday.

ebr · 2020-11-25T18:50:31Z

no worries and no rush - happy Thanksgiving! I actually did change to :latest shortly after posting the last reply, as I wasn't comfortable with the spurious restarts. And yes, on :latest the restarts stopped, and the RAM usage went back down and flattened. I noticed that on :grace the RAM usage increased proportionately to the number of workflow pods in the cluster - which is not the case on :latest.

I've been at debug loglevel the entire time, so should've seen the recentlyStarted messages, but perhaps the issue is too intermittent. We don't have log aggregation in place, and tailing the logs doesn't seem to be sufficient, there's just too much volume relative to the frequency of these occurrences. Anyway, I'm going to avoid changing builds at least today and tomorrow, so this may have to wait until Monday for me as well.

Am I the only user reporting this? Perhaps it would be helpful for someone else to also try reproducing the issue, just to be sure it's not a red herring caused by something related to our EKS cluster set up (I don't see how it could be, but stranger things have happened).

alexec · 2020-11-30T19:16:29Z

@ebr I think I've heard of one or two users mention this (but not reporting this by creating a ticket), so I assume they are not seriously impacted.

Let me know how you get on with :grace with RECENTLY_STARTED_POD_DURATION=30s.

alexec · 2020-12-16T23:04:06Z

I believe this has been likely fixed/mitigated on master.

vladlosev · 2020-12-18T19:01:30Z

We can confirm the issue gone in our runs on a build off the commit 625e3ce in master.

dynamicmindset · 2021-02-11T17:29:33Z

Hello, faced the same issue with Argo v2.12.8.
Garbage collection is set to "onPodCompletion".
Affected steps are a part of a withItems loop and it happens with both 100 or 50 parallelism (haven't tried less).

(36 out 2000 steps failed with "pod deleted")

If I change garbage collection to wait until the end of workflow it does not reproduce anymore.
Do you have any suggestions on how I can prevent this from happening?

anshumanbh · 2021-05-10T20:13:17Z

Hi @alexec, we are seeing this as well. What was the fix here? Or, what can we add to our workflow files to introduce the additional grace period?

jinnovation · 2022-03-16T13:49:21Z

Hi @alexec, we are seeing this as well. What was the fix here? Or, what can we add to our workflow files to introduce the additional grace period?

+1 to this. @alexec, I -- and I'm sure others -- would appreciate a general explanation of what the issue and corresponding fixes ended up being. That'd help teams like mine who run into this issue figure out if upgrading our version of Argo is the right remediation.

I understand though that this issue was closed nearly two years ago, which might complicate things. :P

alexec · 2022-03-16T14:17:11Z

Please read this:

https://blog.argoproj.io/architecting-workflows-for-reliability-d33bd720c6cc

jinnovation · 2022-03-16T14:44:56Z

https://blog.argoproj.io/architecting-workflows-for-reliability-d33bd720c6cc

Helpful, thank you.

ebr added the type/bug label Nov 19, 2020

alexec added the type/regression Regression from previous behavior (a specific type of bug) label Nov 19, 2020

alexec added this to the v2.12 milestone Nov 19, 2020

alexec self-assigned this Nov 19, 2020

alexec mentioned this issue Nov 19, 2020

fix(controller): Allow a grace-period before marking a node as error on missing pod. Resolves #4565 #4566

Closed

1 task

ebr mentioned this issue Nov 19, 2020

Zombie workflows (aka "stuck workflows") #4560

Closed

alexec added the show-stopper label Nov 24, 2020

alexec removed the show-stopper label Dec 2, 2020

alexec closed this as completed Dec 16, 2020

This issue was closed.

Workflow steps fail with a 'pod deleted' message. #4565

Workflow steps fail with a 'pod deleted' message. #4565

Comments

ebr commented Nov 19, 2020 • edited Loading

Summary

Diagnostics

alexec commented Nov 19, 2020 • edited Loading

alexec commented Nov 19, 2020

ebr commented Nov 19, 2020

alexec commented Nov 19, 2020

alexec commented Nov 19, 2020

ebr commented Nov 19, 2020

alexec commented Nov 19, 2020

ebr commented Nov 19, 2020

alexec commented Nov 19, 2020

ebr commented Nov 19, 2020 • edited Loading

alexec commented Nov 19, 2020

alexec commented Nov 19, 2020

ebr commented Nov 19, 2020

alexec commented Nov 19, 2020

ebr commented Nov 19, 2020

alexec commented Nov 19, 2020 • edited Loading

ebr commented Nov 19, 2020

ebr commented Nov 19, 2020

alexec commented Nov 19, 2020

ebr commented Nov 19, 2020

alexec commented Nov 19, 2020

alexec commented Nov 19, 2020

ebr commented Nov 19, 2020

ebr commented Nov 20, 2020

ebr commented Nov 20, 2020

alexec commented Nov 23, 2020

alexec commented Nov 23, 2020

alexec commented Nov 23, 2020

ebr commented Nov 24, 2020

ebr commented Nov 25, 2020

alexec commented Nov 25, 2020 • edited Loading

ebr commented Nov 25, 2020

alexec commented Nov 30, 2020

alexec commented Dec 16, 2020

vladlosev commented Dec 18, 2020

dynamicmindset commented Feb 11, 2021 • edited Loading

anshumanbh commented May 10, 2021

jinnovation commented Mar 16, 2022 • edited Loading

alexec commented Mar 16, 2022

jinnovation commented Mar 16, 2022

ebr commented Nov 19, 2020 •

edited

Loading

alexec commented Nov 19, 2020 •

edited

Loading

ebr commented Nov 19, 2020 •

edited

Loading

alexec commented Nov 19, 2020 •

edited

Loading

alexec commented Nov 25, 2020 •

edited

Loading

dynamicmindset commented Feb 11, 2021 •

edited

Loading

jinnovation commented Mar 16, 2022 •

edited

Loading