Skip to content

Race condition between scheduler processing events and trigger completion — queued-state #67287

@sw-cyderes

Description

@sw-cyderes

Under which category would you file this issue?

Airflow Core

Apache Airflow version

3.2.1

What happened and how to reproduce it?

Same race condition as #66374, but the post-defer TI lands in queued state rather than scheduled, so the fix in #66431 / backport #67089 (merged for 3.2.2) may not cover it. The scheduler still treats the stale executor success event from the worker's defer-exit as a state mismatch and kills the task externally.

Timeline from one occurrence:

[2026-05-20 13:37:14] worker TI started (try_number=1)
[2026-05-20 13:37:18] worker Pausing task as DEFERRED (defer() called, worker exits cleanly)
[2026-05-20 13:37:45] scheduler Executor LocalExecutor reported that the task
instance <TaskInstance: .... [queued]> finished with
state success, but the task instance's state attribute is queued.
Task marked as up_for_retry
[2026-05-20 13:38:05] worker try_number=2 starts, success

Notice the TI state attribute is queued (and the TI bracket label is also [queued]), not scheduled. The fix added by #66431 in process_executor_events only marks the event as ti_requeued when state == SCHEDULED and next_method is not None. The same logical condition (resume-after-defer with a stale executor success) can also leave the TI in QUEUED under load — that path falls through the new branch and the task is still failed/retried.

We see this race fire on a subset of deferred tasks per DAG run. It is not reproducible deterministically — same code, same workload, sometimes races, sometimes doesn't.

Differences and linkages with current tickets

Reproducer (probabilistic)

  1. Airflow 3.2.x, LocalExecutor
  2. A deferrable operator that: Calls self.defer(trigger=..., method_name="execute_complete").
  3. Observe the audit/event log. A subset of TIs may hit the [queued] mismatch instead of scheduled.

What you think should happen instead?

The ti_requeued branch in process_executor_events should also handle the queued-state variant, e.g.:

if (
state == TaskInstanceState.SUCCESS
and ti.next_method is not None
and ti.state in (TaskInstanceState.SCHEDULED, TaskInstanceState.QUEUED)
):
# stale defer-exit success, treat as requeue
...

Operating System

Airflow runs in the official apache/airflow:3.2.1 Docker image on Linux. Host: Ubuntu / Linux.

Deployment

Docker-Compose

Apache Airflow Provider(s)

No response

Versions of Apache Airflow Providers

No response

Official Helm Chart version

Not Applicable

Kubernetes Version

No response

Helm Chart configuration

No response

Docker Image customizations

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:Schedulerincluding HA (high availability) schedulerarea:async-operatorsAIP-40: Deferrable ("Async") Operatorskind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yet

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions