-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Immediately fail the task in case of worker pod having a fatal container state #37670
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. small nit
But also would love to get someone else to take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice improvement, I was worried about the sub-reason for each reason (eg: pull QPS exceeded
which I'm not sure if it's the exact thing in all the K8S versions and distributions), but as the user has the option to update the reasons list, he can fix it without upgrading the provider version if we detect another similar case or any bug.
LGTM
…ner state (apache#37670) * fail the task in case of worker pod having fatal container state * version number updated
…ner state (apache#37670) * fail the task in case of worker pod having fatal container state * version number updated
What happened
When the worker pods init/base containers are in a pending state due to fatal container
state reasons, the tasks eventually fail and the pods are deleted. Currently, it has to wait until the worker_pods_pending_timeout even though the worker pods don't recover.
What do you think should happen instead
When the worker pods init/base containers are in a pending state due to fatal container
state reasons, the worker pod doesn't recover. It doesn't make sense to wait until the worker_pods_pending_timeout. Instead, mark the tasks as failed and delete the worker pods.