Sensor reschedule path holds long row locks on TaskInstance under high concurrency

### Description

When a sensor task with `mode="reschedule"` raises `AirflowRescheduleException`, the task supervisor calls the execution API's `PATCH /task-instances/{id}/state` endpoint with a `TIRescheduleStatePayload`. That handler takes a `SELECT ... FOR UPDATE` on the `TaskInstance` row (to serialize against scheduler state writes) and then inserts a `TaskReschedule` row in the same transaction.

Under high sensor concurrency (hundreds of sensors with short poke intervals against the same DAG run, common in fan-out patterns), multiple workers can contend for related rows. With MySQL's default `innodb_lock_wait_timeout = 50s`, a blocked worker keeps a DB connection in `IDLE_IN_TRANSACTION` state for up to 50 seconds before raising `OperationalError(1205) Lock wait timeout exceeded` — long enough to stack up against the connection pool limit and cause cascading 5xx responses from the API server for the rest of the workload.

Two related issues at the reschedule write site in `airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py`:

1. The 50 s lock wait is far longer than the reschedule write should ever need. A blocked reschedule should fail fast and retry, not hold a worker connection idle for nearly a minute.
2. The reschedule branch is not wrapped in retry logic, so a transient `Error 1213: Deadlock found` or 1205 timeout surfaces to the worker as a task failure — even though the operation is idempotent and Airflow already has `retry_db_transaction` for exactly this pattern (used on `DagRun`, `RenderedTaskInstanceFields`, `DagWarning`, etc.).

### Reproducer

A DAG with 50 sensor tasks, each `mode="reschedule"`, `poke_interval=10s`, all in the same DAG run. With 8+ schedulers/workers and a MySQL backend, running this for ~15 minutes reproduces sporadic `OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')` task failures alongside `(1213, 'Deadlock found')` retries that escape the route handler.

### Proposal

In the `TIRescheduleStatePayload` branch of `_create_ti_state_update_query_and_update_state` (and the `ti_update_state` route wrapper):

1. Wrap the reschedule write with the existing `@retry_db_transaction(retries=10)` decorator so transient deadlocks/lock-wait timeouts retry instead of failing the task.
2. On MySQL, set a per-session `innodb_lock_wait_timeout` of 4 s (configurable via a new `[scheduler] reschedule_lock_timeout_seconds` setting) before the lock is acquired, and restore the previous value at the end of the request. The result: reschedules either succeed quickly, deadlock-retry through the decorator, or fail-fast after a few seconds — never blocking the connection pool for 50 s.

### Are you willing to submit a PR?

- [X] Yes I am willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sensor reschedule path holds long row locks on TaskInstance under high concurrency #66778

Description

Reproducer

Proposal

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Sensor reschedule path holds long row locks on TaskInstance under high concurrency #66778

Description

Description

Reproducer

Proposal

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions