Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes_executor does not properly fail task-instances in cleanup_stuck_queued_tasks #39078

Open
2 tasks done
waldoppper opened this issue Apr 16, 2024 · 3 comments
Open
2 tasks done
Labels
area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet

Comments

@waldoppper
Copy link

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.8.3

What happened?

I'm chasing an issue involving a hard-to-reproduce issue with deferrable operators being "stuck" in Queued state.

In debugging, I noticed what seems to be a defficiency in the kubernetes_executor's override of cleanup_stuck_queued_tasks is not actually failing the instance.

What you think should happen instead?

Background: In debugging, I focused on the logs available at default log-level, which included

Marking task instance <TaskInstance: my_dag.my_group.my_task scheduled__2024-03-25T20:00:00+00:00 [queued]> stuck in queued as failed. If the task instance has available retries, it will be retried.

In reviewing the code printing this message, it seems clear that the expectation of a base_executor subclass is to fail the task-instances. For reference, the celery_executor does.

Problem: The kubernetes_executor does not.

Solution: It seems to me like it should.

How to reproduce

The true root cause of my issue is still a mystery to me. This is an attempt at fixing this safety net.

Operating System

debian

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes==8.0.0

Deployment

Other 3rd-party Helm chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@waldoppper waldoppper added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Apr 16, 2024
@paramjeet01
Copy link

@waldoppper , Did you mitigate this issue ? Found any temporary solution ?

@waldoppper
Copy link
Author

We've tried manually deleting the task instance with no luck. (this makes me think that addressing this particular issue may not help us)

The only workaround I'm aware of is to disable deferral on operators.

@waldoppper
Copy link
Author

@paramjeet01, we found that restarting the scheduler enables the task to start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet
Projects
None yet
Development

No branches or pull requests

2 participants