Apache Airflow version
2.11.2
What happened?
We observed a TimeDeltaSensorAsync task succeed long before its target time after a scheduler stuck in queued reschedule event fired just after the task had already started and deferred.
This was on KubernetesExecutor in a production environment, but the interesting part appears to be in the scheduler / deferral interaction rather than the surrounding platform.
The task was:
- DAG:
generate_eval_imp_tfexample_install_to_revenue
- Task:
d28_eval_imp_generation.wait_for_d28_window
- Run ID:
scheduled__2026-03-18T00:00:00+00:00
- Operator:
TimeDeltaSensorAsync
The trigger log showed the sensor still had about 29 days remaining:
[2026-03-19T00:08:06.515+0000] {temporal.py:74} INFO - 2526713 seconds remaining; sleeping 3600 seconds
But the task still resumed and succeeded about 100 seconds later.
What you think should happen instead?
Once a task has transitioned from queued to running and then deferred, the scheduler's queued-timeout watchdog should no longer be able to reschedule that task instance based on stale queued state.
A deferred TimeDeltaSensorAsync should also not be able to resume into execute_complete() and succeed unless the trigger has legitimately fired because the target datetime was reached.
How to reproduce
I do not yet have a minimal standalone reproducer, but the sequence below looks sufficient to explain the bug:
- A task instance remains in
queued longer than [scheduler] task_queued_timeout.
- The task is finally picked up by a worker and starts running.
- It defers successfully via
TimeDeltaSensorAsync / DateTimeTrigger.
- The scheduler's
stuck_in_queued_reschedule path still processes the TI and reschedules it.
- The task is launched again, resumes via
execute_complete(), and succeeds even though the original trigger had not reached the target time.
Operating System
Linux
Versions of Apache Airflow Providers
No response
Deployment
Other
Deployment details
KubernetesExecutor on Astro Hosted Runtime 13.6.0
Anything else?
Here is the timeline reconstructed from the task log, triggerer log, scheduler log, and Airflow metadata DB:
2026-03-19 00:00:46.827 scheduler: sent TI to KubernetesExecutor
2026-03-19 00:00:54.020 scheduler: received executor event with state queued
2026-03-19 00:00:54.054 scheduler: setting external_id=4387116
2026-03-19 00:08:03.860 task log: task starts running on worker 100.64.21.164
2026-03-19 00:08:06.158 task log: Pausing task as DEFERRED
2026-03-19 00:08:06.280 task log: exited with return code 100 (task deferral)
2026-03-19 00:08:06.515 trigger log: trigger starting
2026-03-19 00:08:06.515 trigger log: 2526713 seconds remaining; sleeping 3600 seconds
2026-03-19 00:08:08.031 Airflow event log: stuck in queued reschedule
"Task was in queued state for longer than 300.0 seconds; task state will be set back to scheduled."
2026-03-19 00:08:09.601 triggerer log: Trigger 796267 completed
2026-03-19 00:08:10.140 scheduler: sending TI to KubernetesExecutor again
2026-03-19 00:08:17.098 scheduler: received executor event with state queued
2026-03-19 00:08:17.162 scheduler: setting external_id=4387227
2026-03-19 00:09:42.773 task log: second execution starts on worker 100.64.22.189
2026-03-19 00:09:42.926 task log: Resuming after deferral
2026-03-19 00:09:48.163 task log: task marks SUCCESS
The key inconsistency is that the stuck in queued reschedule event happens after the task has already run and deferred in the same try.
The trigger log also never shows the normal path where the target time is actually reached. It only shows:
[2026-03-19T00:08:06.515+0000] {temporal.py:70} INFO - trigger starting
[2026-03-19T00:08:06.515+0000] {temporal.py:74} INFO - 2526713 seconds remaining; sleeping 3600 seconds
There is no later "time condition reached" message before the task resumes.
I also checked the live 2.11.2 code in the running scheduler image:
airflow/jobs/scheduler_job_runner.py _handle_tasks_stuck_in_queued() selects TIs by state == QUEUED, queued_dttm < now - task_queued_timeout, and queued_by_job_id == self.job.id, then calls _maybe_requeue_stuck_ti().
_reschedule_stuck_task() updates the TI back to SCHEDULED and clears queued_dttm.
airflow/models/taskinstance.py logs Resuming after deferral when next_kwargs is not None.
airflow/sensors/time_delta.py has TimeDeltaSensorAsync.execute_complete() simply return None.
So the behavior looks consistent with a race where a TI chosen as "stuck queued" is rescheduled after it has already moved through running -> deferred, and the resumed execution path then succeeds without re-checking the time condition.
I also ruled out obvious infrastructure causes during the incident window:
- no scheduler pod restarts
- no triggerer pod restarts
- no triggerer tracebacks / OOMs / process death
- no relevant Kubernetes warning events
A possibly related issue is #51301, but that report is about queued-timeout task failures and alerting. This issue seems to be about queued-timeout interacting incorrectly with deferred tasks and leading to premature success.
Are you willing to submit PR?
Code of Conduct
Apache Airflow version
2.11.2
What happened?
We observed a
TimeDeltaSensorAsynctask succeed long before its target time after a schedulerstuck in queued rescheduleevent fired just after the task had already started and deferred.This was on
KubernetesExecutorin a production environment, but the interesting part appears to be in the scheduler / deferral interaction rather than the surrounding platform.The task was:
generate_eval_imp_tfexample_install_to_revenued28_eval_imp_generation.wait_for_d28_windowscheduled__2026-03-18T00:00:00+00:00TimeDeltaSensorAsyncThe trigger log showed the sensor still had about 29 days remaining:
But the task still resumed and succeeded about 100 seconds later.
What you think should happen instead?
Once a task has transitioned from
queuedtorunningand thendeferred, the scheduler's queued-timeout watchdog should no longer be able to reschedule that task instance based on stale queued state.A deferred
TimeDeltaSensorAsyncshould also not be able to resume intoexecute_complete()and succeed unless the trigger has legitimately fired because the target datetime was reached.How to reproduce
I do not yet have a minimal standalone reproducer, but the sequence below looks sufficient to explain the bug:
queuedlonger than[scheduler] task_queued_timeout.TimeDeltaSensorAsync/DateTimeTrigger.stuck_in_queued_reschedulepath still processes the TI and reschedules it.execute_complete(), and succeeds even though the original trigger had not reached the target time.Operating System
Linux
Versions of Apache Airflow Providers
No response
Deployment
Other
Deployment details
KubernetesExecutoron Astro Hosted Runtime 13.6.0Anything else?
Here is the timeline reconstructed from the task log, triggerer log, scheduler log, and Airflow metadata DB:
The key inconsistency is that the
stuck in queued rescheduleevent happens after the task has already run and deferred in the same try.The trigger log also never shows the normal path where the target time is actually reached. It only shows:
There is no later "time condition reached" message before the task resumes.
I also checked the live 2.11.2 code in the running scheduler image:
airflow/jobs/scheduler_job_runner.py_handle_tasks_stuck_in_queued()selects TIs bystate == QUEUED,queued_dttm < now - task_queued_timeout, andqueued_by_job_id == self.job.id, then calls_maybe_requeue_stuck_ti()._reschedule_stuck_task()updates the TI back toSCHEDULEDand clearsqueued_dttm.airflow/models/taskinstance.pylogsResuming after deferralwhennext_kwargs is not None.airflow/sensors/time_delta.pyhasTimeDeltaSensorAsync.execute_complete()simply returnNone.So the behavior looks consistent with a race where a TI chosen as "stuck queued" is rescheduled after it has already moved through
running -> deferred, and the resumed execution path then succeeds without re-checking the time condition.I also ruled out obvious infrastructure causes during the incident window:
A possibly related issue is #51301, but that report is about queued-timeout task failures and alerting. This issue seems to be about queued-timeout interacting incorrectly with deferred tasks and leading to premature success.
Are you willing to submit PR?
Code of Conduct