Description
When a sensor task with mode="reschedule" raises AirflowRescheduleException, the task supervisor calls the execution API's PATCH /task-instances/{id}/state endpoint with a TIRescheduleStatePayload. That handler takes a SELECT ... FOR UPDATE on the TaskInstance row (to serialize against scheduler state writes) and then inserts a TaskReschedule row in the same transaction.
Under high sensor concurrency (hundreds of sensors with short poke intervals against the same DAG run, common in fan-out patterns), multiple workers can contend for related rows. With MySQL's default innodb_lock_wait_timeout = 50s, a blocked worker keeps a DB connection in IDLE_IN_TRANSACTION state for up to 50 seconds before raising OperationalError(1205) Lock wait timeout exceeded — long enough to stack up against the connection pool limit and cause cascading 5xx responses from the API server for the rest of the workload.
Two related issues at the reschedule write site in airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py:
- The 50 s lock wait is far longer than the reschedule write should ever need. A blocked reschedule should fail fast and retry, not hold a worker connection idle for nearly a minute.
- The reschedule branch is not wrapped in retry logic, so a transient
Error 1213: Deadlock found or 1205 timeout surfaces to the worker as a task failure — even though the operation is idempotent and Airflow already has retry_db_transaction for exactly this pattern (used on DagRun, RenderedTaskInstanceFields, DagWarning, etc.).
Reproducer
A DAG with 50 sensor tasks, each mode="reschedule", poke_interval=10s, all in the same DAG run. With 8+ schedulers/workers and a MySQL backend, running this for ~15 minutes reproduces sporadic OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction') task failures alongside (1213, 'Deadlock found') retries that escape the route handler.
Proposal
In the TIRescheduleStatePayload branch of _create_ti_state_update_query_and_update_state (and the ti_update_state route wrapper):
- Wrap the reschedule write with the existing
@retry_db_transaction(retries=10) decorator so transient deadlocks/lock-wait timeouts retry instead of failing the task.
- On MySQL, set a per-session
innodb_lock_wait_timeout of 4 s (configurable via a new [scheduler] reschedule_lock_timeout_seconds setting) before the lock is acquired, and restore the previous value at the end of the request. The result: reschedules either succeed quickly, deadlock-retry through the decorator, or fail-fast after a few seconds — never blocking the connection pool for 50 s.
Are you willing to submit a PR?
Description
When a sensor task with
mode="reschedule"raisesAirflowRescheduleException, the task supervisor calls the execution API'sPATCH /task-instances/{id}/stateendpoint with aTIRescheduleStatePayload. That handler takes aSELECT ... FOR UPDATEon theTaskInstancerow (to serialize against scheduler state writes) and then inserts aTaskReschedulerow in the same transaction.Under high sensor concurrency (hundreds of sensors with short poke intervals against the same DAG run, common in fan-out patterns), multiple workers can contend for related rows. With MySQL's default
innodb_lock_wait_timeout = 50s, a blocked worker keeps a DB connection inIDLE_IN_TRANSACTIONstate for up to 50 seconds before raisingOperationalError(1205) Lock wait timeout exceeded— long enough to stack up against the connection pool limit and cause cascading 5xx responses from the API server for the rest of the workload.Two related issues at the reschedule write site in
airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py:Error 1213: Deadlock foundor 1205 timeout surfaces to the worker as a task failure — even though the operation is idempotent and Airflow already hasretry_db_transactionfor exactly this pattern (used onDagRun,RenderedTaskInstanceFields,DagWarning, etc.).Reproducer
A DAG with 50 sensor tasks, each
mode="reschedule",poke_interval=10s, all in the same DAG run. With 8+ schedulers/workers and a MySQL backend, running this for ~15 minutes reproduces sporadicOperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')task failures alongside(1213, 'Deadlock found')retries that escape the route handler.Proposal
In the
TIRescheduleStatePayloadbranch of_create_ti_state_update_query_and_update_state(and theti_update_stateroute wrapper):@retry_db_transaction(retries=10)decorator so transient deadlocks/lock-wait timeouts retry instead of failing the task.innodb_lock_wait_timeoutof 4 s (configurable via a new[scheduler] reschedule_lock_timeout_secondssetting) before the lock is acquired, and restore the previous value at the end of the request. The result: reschedules either succeed quickly, deadlock-retry through the decorator, or fail-fast after a few seconds — never blocking the connection pool for 50 s.Are you willing to submit a PR?