Possible Race Condition caused by creation of duplicated Task Attempts in Kubernetes Executor #68071

GustavoBuosi · 2026-06-05T09:29:28Z

GustavoBuosi
Jun 5, 2026

Hello! In my company, me, @NiltonDuarte and @leandroszikora are deploying Airflow 3.1.7 for many teams and providing support for their infrastructure. In the case we are reporting here, our current setup has:

Airflow 3.1.7
Python 3.13.11
Postgres Metastore
apache-airflow-providers-cncf-kubernetes==10.12.3
Linux Images
3 API Server Pods
3 Scheduler Pods
3 DAG Processor Pods

Currently, we have already observed some issues that may affect task runtimes, and most of them were indirectly related to infra resources that interfered with the Scheduler and API Server behavior.

However, in an environment that only runs on OnDemand instances in Azure Kubernetes Services we realized that some tasks would still die abruptly, not related to OoM issues. This was observed during one of the attempts of an ExternalTaskSensor in Reschedule mode, that was running for approximately 3 hours and had a timeout of 8 hours.

Two workers were created for the same sensor in a ~1 second time window.

Attempt 1 (started at 2026-06-04T00:38:48):

{"timestamp":"2026-06-04T00:38:48.358475Z","level":"info","event":"Executing workload","workload": ...}

{"timestamp":"2026-06-04T00:38:55.696277Z","level":"info","event":"Rescheduling task, marking task as UP_FOR_RESCHEDULE","logger":"task","filename":"supervisor.py","lineno":1806}
{"timestamp":"2026-06-04T00:38:55.712667Z","level":"error","event":"API server error","status_code":409,"detail":{"detail":{"reason":"invalid_state","message":"TI was not in the running state so it cannot be updated","previous_state":"failed"}}, ...}

Attempt 2 (started at 2026-06-04T00:38:49, one second later):

{"timestamp":"2026-06-04T00:38:49.686111Z","level":"info","event":"Executing workload","workload": ...}
{"timestamp":"2026-06-04T00:38:50.500902Z","level":"info","event":"Process exited","pid":17,"exit_code":-9,"signal_sent":"SIGKILL","logger":"supervisor","filename":"supervisor.py","lineno":710}

We suspect two of out schedulers created a race condition and in the end we get a similar behavior as seen in #63183.

We have looked at 2 of our schedulers logs (the third one did not contain relevant traces) and interestingly enough found out that:

Scheduler 1 started attempt 1 of the task. After it got killed, it constantly kept emitting the event that attempt 1 failed continuously - most likely it keeps fetching incorrect state:

2026-06-04T02:34:12.394506Z [warning  ] Event: dag-... Failed, task: dag.task.1, annotations: <omitted> [airflow.providers.cncf.kubernetes.executors.kubernetes_executor_utils.KubernetesJobWatcher] loc=kubernetes_executor_utils.py:309

Scheduler 2 started attempt 2 of the task. However, it emitted the event about this attempt once and moved on:

2026-06-04T02:34:12.394506Z [warning  ] Event: dag-... Failed, task: dag.task.2, annotations: <omitted> [airflow.providers.cncf.kubernetes.executors.kubernetes_executor_utils.KubernetesJobWatcher] loc=kubernetes_executor_utils.py:309

We expected that Scheduler 2 would be able to check that there is attempt 1 that is in UP_FOR_RESCHEDULE and would not generate a second run.

After searching in the repo, the only discussion we've found similar to this is #57041 - however, it has not been replied. As its author states, we also have checked that we have use_row_level_locking=True.

There does seem to have a PR that would fix the issue for cases when both task runs are on success #63355, but it seems our scenario described here is different.

We can provide additional logs and info if desired.

Edit 1

We've come across #60330 which would at least prevent a new task attempt id being generated in Airflow 3.1.8+, however we are still seeing open PRs with issues that may cause task execution inconsistencies like #63355.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Race Condition caused by creation of duplicated Task Attempts in Kubernetes Executor #68071

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Possible Race Condition caused by creation of duplicated Task Attempts in Kubernetes Executor #68071

Uh oh!

Uh oh!

GustavoBuosi Jun 5, 2026

Attempt 1 (started at 2026-06-04T00:38:48):

Attempt 2 (started at 2026-06-04T00:38:49, one second later):

Edit 1

Replies: 0 comments

GustavoBuosi
Jun 5, 2026