Skip to content

Sensor reschedule path holds long row locks on TaskInstance under high concurrency #66778

@1fanwang

Description

@1fanwang

Description

When a sensor task with mode="reschedule" raises AirflowRescheduleException, the task supervisor calls the execution API's PATCH /task-instances/{id}/state endpoint with a TIRescheduleStatePayload. That handler takes a SELECT ... FOR UPDATE on the TaskInstance row (to serialize against scheduler state writes) and then inserts a TaskReschedule row in the same transaction.

Under high sensor concurrency (hundreds of sensors with short poke intervals against the same DAG run, common in fan-out patterns), multiple workers can contend for related rows. With MySQL's default innodb_lock_wait_timeout = 50s, a blocked worker keeps a DB connection in IDLE_IN_TRANSACTION state for up to 50 seconds before raising OperationalError(1205) Lock wait timeout exceeded — long enough to stack up against the connection pool limit and cause cascading 5xx responses from the API server for the rest of the workload.

Two related issues at the reschedule write site in airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py:

  1. The 50 s lock wait is far longer than the reschedule write should ever need. A blocked reschedule should fail fast and retry, not hold a worker connection idle for nearly a minute.
  2. The reschedule branch is not wrapped in retry logic, so a transient Error 1213: Deadlock found or 1205 timeout surfaces to the worker as a task failure — even though the operation is idempotent and Airflow already has retry_db_transaction for exactly this pattern (used on DagRun, RenderedTaskInstanceFields, DagWarning, etc.).

Reproducer

A DAG with 50 sensor tasks, each mode="reschedule", poke_interval=10s, all in the same DAG run. With 8+ schedulers/workers and a MySQL backend, running this for ~15 minutes reproduces sporadic OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction') task failures alongside (1213, 'Deadlock found') retries that escape the route handler.

Proposal

In the TIRescheduleStatePayload branch of _create_ti_state_update_query_and_update_state (and the ti_update_state route wrapper):

  1. Wrap the reschedule write with the existing @retry_db_transaction(retries=10) decorator so transient deadlocks/lock-wait timeouts retry instead of failing the task.
  2. On MySQL, set a per-session innodb_lock_wait_timeout of 4 s (configurable via a new [scheduler] reschedule_lock_timeout_seconds setting) before the lock is acquired, and restore the previous value at the end of the request. The result: reschedules either succeed quickly, deadlock-retry through the decorator, or fail-fast after a few seconds — never blocking the connection pool for 50 s.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:corekind:bugThis is a clearly a bugpriority:highHigh priority bug that should be patched quickly but does not require immediate new release

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions