Possible Race Condition caused by creation of duplicated Task Attempts in Kubernetes Executor #68071
Unanswered
GustavoBuosi
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello! In my company, me, @NiltonDuarte and @leandroszikora are deploying Airflow 3.1.7 for many teams and providing support for their infrastructure. In the case we are reporting here, our current setup has:
apache-airflow-providers-cncf-kubernetes==10.12.3Currently, we have already observed some issues that may affect task runtimes, and most of them were indirectly related to infra resources that interfered with the Scheduler and API Server behavior.
However, in an environment that only runs on OnDemand instances in Azure Kubernetes Services we realized that some tasks would still die abruptly, not related to OoM issues. This was observed during one of the attempts of an ExternalTaskSensor in Reschedule mode, that was running for approximately 3 hours and had a timeout of 8 hours.
Two workers were created for the same sensor in a ~1 second time window.
Attempt 1 (started at 2026-06-04T00:38:48):
{"timestamp":"2026-06-04T00:38:48.358475Z","level":"info","event":"Executing workload","workload": ...} {"timestamp":"2026-06-04T00:38:55.696277Z","level":"info","event":"Rescheduling task, marking task as UP_FOR_RESCHEDULE","logger":"task","filename":"supervisor.py","lineno":1806} {"timestamp":"2026-06-04T00:38:55.712667Z","level":"error","event":"API server error","status_code":409,"detail":{"detail":{"reason":"invalid_state","message":"TI was not in the running state so it cannot be updated","previous_state":"failed"}}, ...}Attempt 2 (started at 2026-06-04T00:38:49, one second later):
{"timestamp":"2026-06-04T00:38:49.686111Z","level":"info","event":"Executing workload","workload": ...} {"timestamp":"2026-06-04T00:38:50.500902Z","level":"info","event":"Process exited","pid":17,"exit_code":-9,"signal_sent":"SIGKILL","logger":"supervisor","filename":"supervisor.py","lineno":710}We suspect two of out schedulers created a race condition and in the end we get a similar behavior as seen in #63183.
We have looked at 2 of our schedulers logs (the third one did not contain relevant traces) and interestingly enough found out that:
We expected that Scheduler 2 would be able to check that there is attempt 1 that is in UP_FOR_RESCHEDULE and would not generate a second run.
After searching in the repo, the only discussion we've found similar to this is #57041 - however, it has not been replied. As its author states, we also have checked that we have use_row_level_locking=True.
There does seem to have a PR that would fix the issue for cases when both task runs are on success #63355, but it seems our scenario described here is different.
We can provide additional logs and info if desired.
Edit 1
We've come across #60330 which would at least prevent a new task attempt id being generated in Airflow 3.1.8+, however we are still seeing open PRs with issues that may cause task execution inconsistencies like #63355.
Beta Was this translation helpful? Give feedback.
All reactions