Skip to content

Pre-assign Celery task ID at queuing time to prevent duplicate execution on scheduler crash#65594

Merged
ashb merged 1 commit intoapache:mainfrom
astronomer:ext-executor-id-race
Apr 23, 2026
Merged

Pre-assign Celery task ID at queuing time to prevent duplicate execution on scheduler crash#65594
ashb merged 1 commit intoapache:mainfrom
astronomer:ext-executor-id-race

Conversation

@ashb
Copy link
Copy Markdown
Member

@ashb ashb commented Apr 21, 2026

When a scheduler crashes between dispatching a task to Celery and processing the QUEUED event that persists external_executor_id, the replacement scheduler cannot adopt the in-flight task. Without the Celery task ID in the database, try_adopt_task_instances has no AsyncResult to look up, so the task is reset and re-queued — causing duplicate execution of an already-running task.

Fix this by generating a UUID for external_executor_id at queuing time (in _enqueue_task_instances_with_queued_state), committed to the database atomically with the QUEUED state transition. The same ID is carried through the workload and passed as task_id to Celery's apply_async(), making the Celery task ID deterministic from database state. A fresh UUID is generated on every queuing — including reschedule sensor re-queuing — avoiding stale result backend collisions.

This also fixes the separate race in #55004 where external_executor_id is lost when the task instance row is locked during event processing. process_executor_events uses skip_locked=True, and get_event_buffer() flushes the executor's in-memory buffer into a local variable. If a TI is locked and skipped, its QUEUED event is consumed from the buffer but never processed — the event and its task ID are silently dropped. With the ID now written to the database before the task is even sent to Celery, adoption no longer depends on the event being processed.

The ID is added to TaskInstanceDTO with Field(exclude=True) (same pattern as executor_config) so it is available on the in-memory model but excluded from the JSON payload sent to workers.

Closes: #55004
Closes: #58570
Closes: #64997

@boring-cyborg boring-cyborg Bot added area:Executors-core LocalExecutor & SequentialExecutor area:providers area:Scheduler including HA (high availability) scheduler provider:celery labels Apr 21, 2026
Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated
@ashb ashb force-pushed the ext-executor-id-race branch from 2fbba3e to c01855c Compare April 21, 2026 13:57
@ashb ashb requested a review from ephraimbuddy April 21, 2026 13:58
Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated
@ashb ashb force-pushed the ext-executor-id-race branch from c01855c to d04f387 Compare April 21, 2026 15:34
@ashb ashb added the backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch label Apr 21, 2026
Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated
Comment thread providers/celery/src/airflow/providers/celery/executors/celery_executor.py Outdated
Comment thread providers/celery/tests/unit/celery/executors/test_celery_executor.py Outdated
Comment thread airflow-core/tests/unit/jobs/test_scheduler_job.py Outdated
Comment thread airflow-core/src/airflow/utils/orm_event_handlers.py Outdated
@ashb ashb force-pushed the ext-executor-id-race branch 2 times, most recently from b8785ab to 36c6afc Compare April 22, 2026 14:52
@ashb ashb requested a review from kaxil April 22, 2026 14:52
Copy link
Copy Markdown
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A few follow-ups inline -- L944 (SQLite RETURNING on 3.15-3.34) looks like a real bug on the supported SQLite floor, worth a quick fix before merge. Rest are minor.

Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py
Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated
Comment thread airflow-core/tests/unit/jobs/test_scheduler_job.py Outdated
@ashb ashb force-pushed the ext-executor-id-race branch from 36c6afc to 6d02e2d Compare April 22, 2026 18:58
…execution on scheduler crash

When a scheduler crashes between dispatching a task to Celery and
processing the QUEUED event that persists `external_executor_id`, the
replacement scheduler cannot adopt the in-flight task. Without the
Celery task ID in the database, `try_adopt_task_instances` has no
`AsyncResult` to look up, so the task is reset and re-queued — causing
duplicate execution of an already-running task.

Fix this by generating `external_executor_id` via a DB-side UUID
function (`gen_random_uuid` on PostgreSQL, `UUID()` on MySQL, a
Python `uuid4` registered on SQLite) in the same bulk UPDATE that
sets state=QUEUED. The ID is committed atomically with the state
transition — no second write, no race window. RETURNING is used on
PostgreSQL and SQLite to read back the generated UUIDs without a
second round-trip; MySQL falls back to a SELECT.

The CeleryExecutor passes the pre-assigned ID to `apply_async()` as
the Celery `task_id`, making it deterministic from DB state. Other
executors ignore it and overwrite with their own ID (e.g. ECS task
ARN) during event processing.

This also fixes the separate race in apache#55004 where `external_executor_id`
is lost when the task instance row is locked during event processing.
`process_executor_events` uses `skip_locked=True`, and
`get_event_buffer()` flushes the executor's in-memory buffer into a
local variable. If a TI is locked and skipped, its QUEUED event is
consumed from the buffer but never processed — the event and its
task ID are silently dropped. With the ID now written to the database
before the task is even sent to Celery, adoption no longer depends on
the event being processed.

Closes: apache#55004
Closes: apache#58570
Closes: apache#64971
@ashb ashb force-pushed the ext-executor-id-race branch from 6d02e2d to c71d5c8 Compare April 22, 2026 20:55
@ashb ashb merged commit 3b188b9 into apache:main Apr 23, 2026
94 checks passed
@ashb ashb deleted the ext-executor-id-race branch April 23, 2026 08:18
@github-actions github-actions Bot added this to the Airflow 3.2.2 milestone Apr 23, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hi maintainer, this PR was merged without a milestone set.
We've automatically set the milestone to Airflow 3.2.2 based on: backport label targeting v3-2-test
If this milestone is not correct, please update it to the appropriate milestone.

This comment was generated by Milestone Tag Assistant.

@github-actions
Copy link
Copy Markdown
Contributor

Backport failed to create: v3-2-test. View the failure log Run details

Note: As of Merging PRs targeted for Airflow 3.X
the committer who merges the PR is responsible for backporting the PRs that are bug fixes (generally speaking) to the maintenance branches.

In matter of doubt please ask in #release-management Slack channel.

Status Branch Result
v3-2-test Commit Link

You can attempt to backport this manually by running:

cherry_picker 3b188b9 v3-2-test

This should apply the commit to the v3-2-test branch and leave the commit in conflict state marking
the files that need manual conflict resolution.

After you have resolved the conflicts, you can continue the backport process by running:

cherry_picker --continue

If you don't have cherry-picker installed, see the installation guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Executors-core LocalExecutor & SequentialExecutor area:providers area:Scheduler including HA (high availability) scheduler backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch provider:celery

Projects

None yet

3 participants