Apache Airflow version
2.11.2
If "Other Airflow 2 version" selected, which one?
N/A
What happened?
We have multiple anonymized production incidents on Airflow 2.11.2 with CeleryExecutor where retry attempts that definitely ran are missing from task_instance_history.
The visible user symptom is that the Task logs / tries UI does not show all real attempts. But the more important observation is that this is not just a UI problem: one or more retry attempts are actually absent from task_instance_history even though scheduler and worker logs prove those attempts executed.
Observed pattern in affected runs:
- the surviving
task_instance row reflects the current or final attempt
- one or more earlier retry attempts are missing from
task_instance_history
/tries therefore omits those attempts because it is built from task_instance_history plus the current task_instance
- in at least one case, logs for a missing attempt were still retrievable directly, which makes the UI inconsistency more confusing
What you think should happen instead?
Every retry attempt that actually executes should be preserved in task_instance_history.
If a task reaches deferred, up_for_retry, failed, or another terminal transition for a given try number, that try should still exist in task_instance_history afterward, and /tries should list it.
How to reproduce
I do not yet have a minimal standalone reproducer, but the repeated field pattern is:
- Run Airflow 2.11.2 with
CeleryExecutor
- Use a task that can retry, including cases that may defer and then retry again
- Let the task execute multiple tries
- Inspect scheduler logs for the task/run and confirm a given try number was sent to the executor and finished
- Query
task_instance_history for that same task/run
Observed result in affected runs:
task_instance.try_number advances normally
- some earlier try numbers below the current try are missing from
task_instance_history
/tries omits those missing attempts
Operating System
Linux / Kubernetes
Versions of Apache Airflow Providers
Not yet isolated to a provider-specific issue.
Deployment
Other Kubernetes deployment
Deployment details
These incidents were observed on Astro Hosted deployments running Runtime 13.6.0 (Airflow 2.11.2+astro.2) with:
CeleryExecutor
- two scheduler replicas present
- PostgreSQL metadata DB
I am filing this upstream because the symptom is directly in core retry history persistence (task_instance_history), not in a provider package.
Anything else?
This does not look like a UI-only bug.
Concrete example 1
Dag: cs-forecast-cl-data-preprocessing-bk-eks-intg
Task: publish_data.pyspark
Run: manual__2026-04-20T13:05:04+00:00
Direct DB query from the scheduler container after the incident:
{'current_try_number': 3, 'current_state': 'failed', 'current_start_date': '2026-04-22 13:00:30.668381+00:00', 'current_end_date': '2026-04-22 13:06:51.673156+00:00', 'current_hostname': '10.94.10.200', 'current_external_executor_id': 'a29904ce-e180-4b40-80d6-366e3a3b8cd2'}
history_rows=
{'try_number': 1, 'state': 'success', 'start_date': '2026-04-20 13:19:53.454647+00:00', 'end_date': '2026-04-20 13:25:16.544094+00:00', 'hostname': '10.94.23.102', 'external_executor_id': 'b7ba76f7-d337-4634-a5b6-989dd041eef1'}
{'history_try_numbers': [1], 'missing_try_numbers': [2]}
So the current row says try_number=3, but task_instance_history contains only try 1. Try 2 is missing.
Scheduler and worker logs prove try 2 actually ran:
[2026-04-22T12:51:27.388+0000] {scheduler_job_runner.py:692} INFO - Sending TaskInstanceKey(dag_id='cs-forecast-cl-data-preprocessing-bk-eks-intg', task_id='publish_data.pyspark', run_id='manual__2026-04-20T13:05:04+00:00', try_number=2, map_index=-1) to CeleryExecutor with priority 2 and queue default
[2026-04-22 12:51:27,841: INFO/ForkPoolWorker-3] Running <TaskInstance: cs-forecast-cl-data-preprocessing-bk-eks-intg.publish_data.pyspark manual__2026-04-20T13:05:04+00:00 [queued]> on host 10.94.10.200
[2026-04-22T12:51:46.816+0000] {scheduler_job_runner.py:813} INFO - TaskInstance Finished: dag_id=cs-forecast-cl-data-preprocessing-bk-eks-intg, task_id=publish_data.pyspark, run_id=manual__2026-04-20T13:05:04+00:00, map_index=-1, run_start_date=2026-04-22 12:51:28.238431+00:00, run_end_date=None, run_duration=323.089447, state=deferred, executor=CeleryExecutor(parallelism=25), executor_state=success, try_number=2, max_tries=2, job_id=584134, pool=default_pool, queue=default, priority_weight=2, operator=PySparkOperator, queued_dttm=2026-04-22 12:51:27.386468+00:00, queued_by_job_id=583149, pid=3570
[2026-04-22T12:55:24.447+0000] {scheduler_job_runner.py:692} INFO - Sending TaskInstanceKey(dag_id='cs-forecast-cl-data-preprocessing-bk-eks-intg', task_id='publish_data.pyspark', run_id='manual__2026-04-20T13:05:04+00:00', try_number=2, map_index=-1) to CeleryExecutor with priority 2 and queue default
[2026-04-22T12:55:30.928+0000] {scheduler_job_runner.py:813} INFO - TaskInstance Finished: dag_id=cs-forecast-cl-data-preprocessing-bk-eks-intg, task_id=publish_data.pyspark, run_id=manual__2026-04-20T13:05:04+00:00, map_index=-1, run_start_date=2026-04-22 12:51:28.238431+00:00, run_end_date=2026-04-22 12:55:29.120599+00:00, run_duration=240.882168, state=up_for_retry, executor=CeleryExecutor(parallelism=25), executor_state=success, try_number=2, max_tries=2, job_id=584141, pool=default_pool, queue=default, priority_weight=2, operator=PySparkOperator, queued_dttm=2026-04-22 12:55:24.445065+00:00, queued_by_job_id=584108, pid=3734
[2026-04-22T13:00:30.033+0000] {scheduler_job_runner.py:692} INFO - Sending TaskInstanceKey(dag_id='cs-forecast-cl-data-preprocessing-bk-eks-intg', task_id='publish_data.pyspark', run_id='manual__2026-04-20T13:05:04+00:00', try_number=3, map_index=-1) to CeleryExecutor with priority 2 and queue default
[2026-04-22T13:06:55.545+0000] {scheduler_job_runner.py:813} INFO - TaskInstance Finished: dag_id=cs-forecast-cl-data-preprocessing-bk-eks-intg, task_id=publish_data.pyspark, run_id=manual__2026-04-20T13:05:04+00:00, map_index=-1, run_start_date=2026-04-22 13:00:30.668381+00:00, run_end_date=2026-04-22 13:06:51.673156+00:00, run_duration=381.004775, state=failed, executor=CeleryExecutor(parallelism=25), executor_state=success, try_number=3, max_tries=2, job_id=584157, pool=default_pool, queue=default, priority_weight=2, operator=PySparkOperator, queued_dttm=2026-04-22 13:06:46.703023+00:00, queued_by_job_id=583149, pid=4154
That sequence shows try 2 absolutely existed and reached deferred and then up_for_retry, but afterward there is still no task_instance_history row for try 2.
Concrete example 2
Dag: bd-sourcery-odm-snapshot-daily-bk-eks
Task: create_snapshot.bkng_data
Run: scheduled__2026-04-23T01:00:00+00:00
Direct DB query from the scheduler container:
{'current_try_number': 20, 'current_state': 'success', 'current_start_date': '2026-04-24 12:31:41.179719+00:00', 'current_end_date': '2026-04-24 14:14:41.020332+00:00', 'current_hostname': '10.94.27.178', 'current_external_executor_id': '3b9419d0-e97f-489c-9cde-5ef072d99854'}
history_rows=
{'try_number': 3, 'state': 'failed', 'start_date': '2026-04-24 02:22:26.799412+00:00', 'end_date': '2026-04-24 02:46:03.074407+00:00', 'hostname': '10.94.30.159', 'external_executor_id': '5122ae59-e21c-4e69-bc3b-29bc2f55944c'}
{'try_number': 6, 'state': 'failed', 'start_date': '2026-04-24 08:32:26.583491+00:00', 'end_date': '2026-04-24 08:38:02.100532+00:00', 'hostname': '10.94.23.156', 'external_executor_id': 'd965e200-196f-46f2-b6f5-40de834202df'}
{'try_number': 8, 'state': 'failed', 'start_date': '2026-04-24 09:08:22.046598+00:00', 'end_date': '2026-04-24 09:12:15.401738+00:00', 'hostname': '10.94.13.159', 'external_executor_id': '83efe05b-119c-4083-bf5b-3fa49f8f9e94'}
{'try_number': 10, 'state': 'failed', 'start_date': '2026-04-24 09:18:25.321445+00:00', 'end_date': '2026-04-24 09:18:30.991704+00:00', 'hostname': '10.94.30.159', 'external_executor_id': 'cb110363-675e-471d-9c9d-06cdf88ff6b6'}
{'try_number': 11, 'state': 'failed', 'start_date': '2026-04-24 09:23:45.550726+00:00', 'end_date': '2026-04-24 09:23:59.497683+00:00', 'hostname': '10.94.30.159', 'external_executor_id': '136b1a2c-8a1d-4947-ab78-300d0c5e911a'}
{'try_number': 14, 'state': 'failed', 'start_date': '2026-04-24 10:23:08.636760+00:00', 'end_date': '2026-04-24 10:31:06.771334+00:00', 'hostname': '10.94.17.127', 'external_executor_id': '35776857-c608-4a13-95ba-9d900daeaa6f'}
{'try_number': 15, 'state': 'failed', 'start_date': '2026-04-24 10:54:08.495904+00:00', 'end_date': '2026-04-24 11:06:53.790567+00:00', 'hostname': '10.94.20.200', 'external_executor_id': 'd3ccbaa3-3504-4b00-b248-3b51e750e25e'}
{'try_number': 16, 'state': 'failed', 'start_date': '2026-04-24 11:06:56.858252+00:00', 'end_date': '2026-04-24 11:15:49.369237+00:00', 'hostname': '10.94.9.51', 'external_executor_id': 'c065e193-c0fd-4081-a45e-09ead9bd613b'}
{'try_number': 17, 'state': 'failed', 'start_date': '2026-04-24 11:15:57.006413+00:00', 'end_date': '2026-04-24 11:52:47.887098+00:00', 'hostname': '10.94.20.17', 'external_executor_id': 'c2c0fe70-2f9f-4901-af1b-b22fc603bb67'}
{'try_number': 18, 'state': 'failed', 'start_date': '2026-04-24 11:52:54.897007+00:00', 'end_date': '2026-04-24 12:16:18.761605+00:00', 'hostname': '10.94.22.79', 'external_executor_id': '4641c634-b572-4cbc-84ce-9d8b983210c3'}
{'try_number': 19, 'state': 'failed', 'start_date': '2026-04-24 12:16:21.797645+00:00', 'end_date': '2026-04-24 12:31:33.573200+00:00', 'hostname': '10.94.27.178', 'external_executor_id': 'f2a1b243-562d-4334-ae68-dc4054b5e8c9'}
{'history_try_numbers': [3, 6, 8, 10, 11, 14, 15, 16, 17, 18, 19], 'missing_try_numbers': [1, 2, 4, 5, 7, 9, 12, 13]}
So this is not a one-off single-gap case. Here, a task that reached try_number=20 is missing many earlier attempts from task_instance_history.
Related issue
This looks related in bug family, but not identical in executor or exact symptom, to:
That open issue reports retry-history loss symptoms on Airflow 3.1.x under KubernetesExecutor.
I have not directly reproduced this on a running Airflow 3.2 deployment yet, so I do not want to overclaim version scope here. But the retry-history snapshot seam still appears materially similar in current 3.x code, so this may not be isolated to the 2.x line.
Apache Airflow version
2.11.2
If "Other Airflow 2 version" selected, which one?
N/A
What happened?
We have multiple anonymized production incidents on Airflow 2.11.2 with
CeleryExecutorwhere retry attempts that definitely ran are missing fromtask_instance_history.The visible user symptom is that the Task logs / tries UI does not show all real attempts. But the more important observation is that this is not just a UI problem: one or more retry attempts are actually absent from
task_instance_historyeven though scheduler and worker logs prove those attempts executed.Observed pattern in affected runs:
task_instancerow reflects the current or final attempttask_instance_history/triestherefore omits those attempts because it is built fromtask_instance_historyplus the currenttask_instanceWhat you think should happen instead?
Every retry attempt that actually executes should be preserved in
task_instance_history.If a task reaches
deferred,up_for_retry,failed, or another terminal transition for a given try number, that try should still exist intask_instance_historyafterward, and/triesshould list it.How to reproduce
I do not yet have a minimal standalone reproducer, but the repeated field pattern is:
CeleryExecutortask_instance_historyfor that same task/runObserved result in affected runs:
task_instance.try_numberadvances normallytask_instance_history/triesomits those missing attemptsOperating System
Linux / Kubernetes
Versions of Apache Airflow Providers
Not yet isolated to a provider-specific issue.
Deployment
Other Kubernetes deployment
Deployment details
These incidents were observed on Astro Hosted deployments running Runtime 13.6.0 (
Airflow 2.11.2+astro.2) with:CeleryExecutorI am filing this upstream because the symptom is directly in core retry history persistence (
task_instance_history), not in a provider package.Anything else?
This does not look like a UI-only bug.
Concrete example 1
Dag:
cs-forecast-cl-data-preprocessing-bk-eks-intgTask:
publish_data.pysparkRun:
manual__2026-04-20T13:05:04+00:00Direct DB query from the scheduler container after the incident:
So the current row says
try_number=3, buttask_instance_historycontains only try1. Try2is missing.Scheduler and worker logs prove try 2 actually ran:
That sequence shows try 2 absolutely existed and reached
deferredand thenup_for_retry, but afterward there is still notask_instance_historyrow for try 2.Concrete example 2
Dag:
bd-sourcery-odm-snapshot-daily-bk-eksTask:
create_snapshot.bkng_dataRun:
scheduled__2026-04-23T01:00:00+00:00Direct DB query from the scheduler container:
So this is not a one-off single-gap case. Here, a task that reached
try_number=20is missing many earlier attempts fromtask_instance_history.Related issue
This looks related in bug family, but not identical in executor or exact symptom, to:
That open issue reports retry-history loss symptoms on Airflow 3.1.x under
KubernetesExecutor.I have not directly reproduced this on a running Airflow 3.2 deployment yet, so I do not want to overclaim version scope here. But the retry-history snapshot seam still appears materially similar in current 3.x code, so this may not be isolated to the 2.x line.