Skip to content

CeleryExecutor retries can be missing from task_instance_history, so /tries omits real attempts #67238

@hkc-8010

Description

@hkc-8010

Apache Airflow version

2.11.2

If "Other Airflow 2 version" selected, which one?

N/A

What happened?

We have multiple anonymized production incidents on Airflow 2.11.2 with CeleryExecutor where retry attempts that definitely ran are missing from task_instance_history.

The visible user symptom is that the Task logs / tries UI does not show all real attempts. But the more important observation is that this is not just a UI problem: one or more retry attempts are actually absent from task_instance_history even though scheduler and worker logs prove those attempts executed.

Observed pattern in affected runs:

  • the surviving task_instance row reflects the current or final attempt
  • one or more earlier retry attempts are missing from task_instance_history
  • /tries therefore omits those attempts because it is built from task_instance_history plus the current task_instance
  • in at least one case, logs for a missing attempt were still retrievable directly, which makes the UI inconsistency more confusing

What you think should happen instead?

Every retry attempt that actually executes should be preserved in task_instance_history.

If a task reaches deferred, up_for_retry, failed, or another terminal transition for a given try number, that try should still exist in task_instance_history afterward, and /tries should list it.

How to reproduce

I do not yet have a minimal standalone reproducer, but the repeated field pattern is:

  1. Run Airflow 2.11.2 with CeleryExecutor
  2. Use a task that can retry, including cases that may defer and then retry again
  3. Let the task execute multiple tries
  4. Inspect scheduler logs for the task/run and confirm a given try number was sent to the executor and finished
  5. Query task_instance_history for that same task/run

Observed result in affected runs:

  • task_instance.try_number advances normally
  • some earlier try numbers below the current try are missing from task_instance_history
  • /tries omits those missing attempts

Operating System

Linux / Kubernetes

Versions of Apache Airflow Providers

Not yet isolated to a provider-specific issue.

Deployment

Other Kubernetes deployment

Deployment details

These incidents were observed on Astro Hosted deployments running Runtime 13.6.0 (Airflow 2.11.2+astro.2) with:

  • CeleryExecutor
  • two scheduler replicas present
  • PostgreSQL metadata DB

I am filing this upstream because the symptom is directly in core retry history persistence (task_instance_history), not in a provider package.

Anything else?

This does not look like a UI-only bug.

Concrete example 1

Dag: cs-forecast-cl-data-preprocessing-bk-eks-intg
Task: publish_data.pyspark
Run: manual__2026-04-20T13:05:04+00:00

Direct DB query from the scheduler container after the incident:

{'current_try_number': 3, 'current_state': 'failed', 'current_start_date': '2026-04-22 13:00:30.668381+00:00', 'current_end_date': '2026-04-22 13:06:51.673156+00:00', 'current_hostname': '10.94.10.200', 'current_external_executor_id': 'a29904ce-e180-4b40-80d6-366e3a3b8cd2'}
history_rows=
{'try_number': 1, 'state': 'success', 'start_date': '2026-04-20 13:19:53.454647+00:00', 'end_date': '2026-04-20 13:25:16.544094+00:00', 'hostname': '10.94.23.102', 'external_executor_id': 'b7ba76f7-d337-4634-a5b6-989dd041eef1'}
{'history_try_numbers': [1], 'missing_try_numbers': [2]}

So the current row says try_number=3, but task_instance_history contains only try 1. Try 2 is missing.

Scheduler and worker logs prove try 2 actually ran:

[2026-04-22T12:51:27.388+0000] {scheduler_job_runner.py:692} INFO - Sending TaskInstanceKey(dag_id='cs-forecast-cl-data-preprocessing-bk-eks-intg', task_id='publish_data.pyspark', run_id='manual__2026-04-20T13:05:04+00:00', try_number=2, map_index=-1) to CeleryExecutor with priority 2 and queue default

[2026-04-22 12:51:27,841: INFO/ForkPoolWorker-3] Running <TaskInstance: cs-forecast-cl-data-preprocessing-bk-eks-intg.publish_data.pyspark manual__2026-04-20T13:05:04+00:00 [queued]> on host 10.94.10.200

[2026-04-22T12:51:46.816+0000] {scheduler_job_runner.py:813} INFO - TaskInstance Finished: dag_id=cs-forecast-cl-data-preprocessing-bk-eks-intg, task_id=publish_data.pyspark, run_id=manual__2026-04-20T13:05:04+00:00, map_index=-1, run_start_date=2026-04-22 12:51:28.238431+00:00, run_end_date=None, run_duration=323.089447, state=deferred, executor=CeleryExecutor(parallelism=25), executor_state=success, try_number=2, max_tries=2, job_id=584134, pool=default_pool, queue=default, priority_weight=2, operator=PySparkOperator, queued_dttm=2026-04-22 12:51:27.386468+00:00, queued_by_job_id=583149, pid=3570

[2026-04-22T12:55:24.447+0000] {scheduler_job_runner.py:692} INFO - Sending TaskInstanceKey(dag_id='cs-forecast-cl-data-preprocessing-bk-eks-intg', task_id='publish_data.pyspark', run_id='manual__2026-04-20T13:05:04+00:00', try_number=2, map_index=-1) to CeleryExecutor with priority 2 and queue default

[2026-04-22T12:55:30.928+0000] {scheduler_job_runner.py:813} INFO - TaskInstance Finished: dag_id=cs-forecast-cl-data-preprocessing-bk-eks-intg, task_id=publish_data.pyspark, run_id=manual__2026-04-20T13:05:04+00:00, map_index=-1, run_start_date=2026-04-22 12:51:28.238431+00:00, run_end_date=2026-04-22 12:55:29.120599+00:00, run_duration=240.882168, state=up_for_retry, executor=CeleryExecutor(parallelism=25), executor_state=success, try_number=2, max_tries=2, job_id=584141, pool=default_pool, queue=default, priority_weight=2, operator=PySparkOperator, queued_dttm=2026-04-22 12:55:24.445065+00:00, queued_by_job_id=584108, pid=3734

[2026-04-22T13:00:30.033+0000] {scheduler_job_runner.py:692} INFO - Sending TaskInstanceKey(dag_id='cs-forecast-cl-data-preprocessing-bk-eks-intg', task_id='publish_data.pyspark', run_id='manual__2026-04-20T13:05:04+00:00', try_number=3, map_index=-1) to CeleryExecutor with priority 2 and queue default

[2026-04-22T13:06:55.545+0000] {scheduler_job_runner.py:813} INFO - TaskInstance Finished: dag_id=cs-forecast-cl-data-preprocessing-bk-eks-intg, task_id=publish_data.pyspark, run_id=manual__2026-04-20T13:05:04+00:00, map_index=-1, run_start_date=2026-04-22 13:00:30.668381+00:00, run_end_date=2026-04-22 13:06:51.673156+00:00, run_duration=381.004775, state=failed, executor=CeleryExecutor(parallelism=25), executor_state=success, try_number=3, max_tries=2, job_id=584157, pool=default_pool, queue=default, priority_weight=2, operator=PySparkOperator, queued_dttm=2026-04-22 13:06:46.703023+00:00, queued_by_job_id=583149, pid=4154

That sequence shows try 2 absolutely existed and reached deferred and then up_for_retry, but afterward there is still no task_instance_history row for try 2.

Concrete example 2

Dag: bd-sourcery-odm-snapshot-daily-bk-eks
Task: create_snapshot.bkng_data
Run: scheduled__2026-04-23T01:00:00+00:00

Direct DB query from the scheduler container:

{'current_try_number': 20, 'current_state': 'success', 'current_start_date': '2026-04-24 12:31:41.179719+00:00', 'current_end_date': '2026-04-24 14:14:41.020332+00:00', 'current_hostname': '10.94.27.178', 'current_external_executor_id': '3b9419d0-e97f-489c-9cde-5ef072d99854'}
history_rows=
{'try_number': 3, 'state': 'failed', 'start_date': '2026-04-24 02:22:26.799412+00:00', 'end_date': '2026-04-24 02:46:03.074407+00:00', 'hostname': '10.94.30.159', 'external_executor_id': '5122ae59-e21c-4e69-bc3b-29bc2f55944c'}
{'try_number': 6, 'state': 'failed', 'start_date': '2026-04-24 08:32:26.583491+00:00', 'end_date': '2026-04-24 08:38:02.100532+00:00', 'hostname': '10.94.23.156', 'external_executor_id': 'd965e200-196f-46f2-b6f5-40de834202df'}
{'try_number': 8, 'state': 'failed', 'start_date': '2026-04-24 09:08:22.046598+00:00', 'end_date': '2026-04-24 09:12:15.401738+00:00', 'hostname': '10.94.13.159', 'external_executor_id': '83efe05b-119c-4083-bf5b-3fa49f8f9e94'}
{'try_number': 10, 'state': 'failed', 'start_date': '2026-04-24 09:18:25.321445+00:00', 'end_date': '2026-04-24 09:18:30.991704+00:00', 'hostname': '10.94.30.159', 'external_executor_id': 'cb110363-675e-471d-9c9d-06cdf88ff6b6'}
{'try_number': 11, 'state': 'failed', 'start_date': '2026-04-24 09:23:45.550726+00:00', 'end_date': '2026-04-24 09:23:59.497683+00:00', 'hostname': '10.94.30.159', 'external_executor_id': '136b1a2c-8a1d-4947-ab78-300d0c5e911a'}
{'try_number': 14, 'state': 'failed', 'start_date': '2026-04-24 10:23:08.636760+00:00', 'end_date': '2026-04-24 10:31:06.771334+00:00', 'hostname': '10.94.17.127', 'external_executor_id': '35776857-c608-4a13-95ba-9d900daeaa6f'}
{'try_number': 15, 'state': 'failed', 'start_date': '2026-04-24 10:54:08.495904+00:00', 'end_date': '2026-04-24 11:06:53.790567+00:00', 'hostname': '10.94.20.200', 'external_executor_id': 'd3ccbaa3-3504-4b00-b248-3b51e750e25e'}
{'try_number': 16, 'state': 'failed', 'start_date': '2026-04-24 11:06:56.858252+00:00', 'end_date': '2026-04-24 11:15:49.369237+00:00', 'hostname': '10.94.9.51', 'external_executor_id': 'c065e193-c0fd-4081-a45e-09ead9bd613b'}
{'try_number': 17, 'state': 'failed', 'start_date': '2026-04-24 11:15:57.006413+00:00', 'end_date': '2026-04-24 11:52:47.887098+00:00', 'hostname': '10.94.20.17', 'external_executor_id': 'c2c0fe70-2f9f-4901-af1b-b22fc603bb67'}
{'try_number': 18, 'state': 'failed', 'start_date': '2026-04-24 11:52:54.897007+00:00', 'end_date': '2026-04-24 12:16:18.761605+00:00', 'hostname': '10.94.22.79', 'external_executor_id': '4641c634-b572-4cbc-84ce-9d8b983210c3'}
{'try_number': 19, 'state': 'failed', 'start_date': '2026-04-24 12:16:21.797645+00:00', 'end_date': '2026-04-24 12:31:33.573200+00:00', 'hostname': '10.94.27.178', 'external_executor_id': 'f2a1b243-562d-4334-ae68-dc4054b5e8c9'}
{'history_try_numbers': [3, 6, 8, 10, 11, 14, 15, 16, 17, 18, 19], 'missing_try_numbers': [1, 2, 4, 5, 7, 9, 12, 13]}

So this is not a one-off single-gap case. Here, a task that reached try_number=20 is missing many earlier attempts from task_instance_history.

Related issue

This looks related in bug family, but not identical in executor or exact symptom, to:

That open issue reports retry-history loss symptoms on Airflow 3.1.x under KubernetesExecutor.

I have not directly reproduced this on a running Airflow 3.2 deployment yet, so I do not want to overclaim version scope here. But the retry-history snapshot seam still appears materially similar in current 3.x code, so this may not be isolated to the 2.x line.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:corekind:bugThis is a clearly a bugpriority:highHigh priority bug that should be patched quickly but does not require immediate new release

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions