Skip to content

Scheduler queued-timeout can reschedule a deferred TimeDeltaSensorAsync and cause premature success #65248

@hkc-8010

Description

@hkc-8010

Apache Airflow version

2.11.2

What happened?

We observed a TimeDeltaSensorAsync task succeed long before its target time after a scheduler stuck in queued reschedule event fired just after the task had already started and deferred.

This was on KubernetesExecutor in a production environment, but the interesting part appears to be in the scheduler / deferral interaction rather than the surrounding platform.

The task was:

  • DAG: generate_eval_imp_tfexample_install_to_revenue
  • Task: d28_eval_imp_generation.wait_for_d28_window
  • Run ID: scheduled__2026-03-18T00:00:00+00:00
  • Operator: TimeDeltaSensorAsync

The trigger log showed the sensor still had about 29 days remaining:

[2026-03-19T00:08:06.515+0000] {temporal.py:74} INFO - 2526713 seconds remaining; sleeping 3600 seconds

But the task still resumed and succeeded about 100 seconds later.

What you think should happen instead?

Once a task has transitioned from queued to running and then deferred, the scheduler's queued-timeout watchdog should no longer be able to reschedule that task instance based on stale queued state.

A deferred TimeDeltaSensorAsync should also not be able to resume into execute_complete() and succeed unless the trigger has legitimately fired because the target datetime was reached.

How to reproduce

I do not yet have a minimal standalone reproducer, but the sequence below looks sufficient to explain the bug:

  1. A task instance remains in queued longer than [scheduler] task_queued_timeout.
  2. The task is finally picked up by a worker and starts running.
  3. It defers successfully via TimeDeltaSensorAsync / DateTimeTrigger.
  4. The scheduler's stuck_in_queued_reschedule path still processes the TI and reschedules it.
  5. The task is launched again, resumes via execute_complete(), and succeeds even though the original trigger had not reached the target time.

Operating System

Linux

Versions of Apache Airflow Providers

No response

Deployment

Other

Deployment details

KubernetesExecutor on Astro Hosted Runtime 13.6.0

Anything else?

Here is the timeline reconstructed from the task log, triggerer log, scheduler log, and Airflow metadata DB:

2026-03-19 00:00:46.827 scheduler: sent TI to KubernetesExecutor
2026-03-19 00:00:54.020 scheduler: received executor event with state queued
2026-03-19 00:00:54.054 scheduler: setting external_id=4387116

2026-03-19 00:08:03.860 task log: task starts running on worker 100.64.21.164
2026-03-19 00:08:06.158 task log: Pausing task as DEFERRED
2026-03-19 00:08:06.280 task log: exited with return code 100 (task deferral)
2026-03-19 00:08:06.515 trigger log: trigger starting
2026-03-19 00:08:06.515 trigger log: 2526713 seconds remaining; sleeping 3600 seconds

2026-03-19 00:08:08.031 Airflow event log: stuck in queued reschedule
  "Task was in queued state for longer than 300.0 seconds; task state will be set back to scheduled."

2026-03-19 00:08:09.601 triggerer log: Trigger 796267 completed
2026-03-19 00:08:10.140 scheduler: sending TI to KubernetesExecutor again
2026-03-19 00:08:17.098 scheduler: received executor event with state queued
2026-03-19 00:08:17.162 scheduler: setting external_id=4387227

2026-03-19 00:09:42.773 task log: second execution starts on worker 100.64.22.189
2026-03-19 00:09:42.926 task log: Resuming after deferral
2026-03-19 00:09:48.163 task log: task marks SUCCESS

The key inconsistency is that the stuck in queued reschedule event happens after the task has already run and deferred in the same try.

The trigger log also never shows the normal path where the target time is actually reached. It only shows:

[2026-03-19T00:08:06.515+0000] {temporal.py:70} INFO - trigger starting
[2026-03-19T00:08:06.515+0000] {temporal.py:74} INFO - 2526713 seconds remaining; sleeping 3600 seconds

There is no later "time condition reached" message before the task resumes.

I also checked the live 2.11.2 code in the running scheduler image:

  • airflow/jobs/scheduler_job_runner.py _handle_tasks_stuck_in_queued() selects TIs by state == QUEUED, queued_dttm < now - task_queued_timeout, and queued_by_job_id == self.job.id, then calls _maybe_requeue_stuck_ti().
  • _reschedule_stuck_task() updates the TI back to SCHEDULED and clears queued_dttm.
  • airflow/models/taskinstance.py logs Resuming after deferral when next_kwargs is not None.
  • airflow/sensors/time_delta.py has TimeDeltaSensorAsync.execute_complete() simply return None.

So the behavior looks consistent with a race where a TI chosen as "stuck queued" is rescheduled after it has already moved through running -> deferred, and the resumed execution path then succeeds without re-checking the time condition.

I also ruled out obvious infrastructure causes during the incident window:

  • no scheduler pod restarts
  • no triggerer pod restarts
  • no triggerer tracebacks / OOMs / process death
  • no relevant Kubernetes warning events

A possibly related issue is #51301, but that report is about queued-timeout task failures and alerting. This issue seems to be about queued-timeout interacting incorrectly with deferred tasks and leading to premature success.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:Schedulerincluding HA (high availability) schedulerarea:Triggererarea:async-operatorsAIP-40: Deferrable ("Async") Operatorskind:bugThis is a clearly a bugpriority:highHigh priority bug that should be patched quickly but does not require immediate new release

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions