Skip to content

Scheduler MySQL: misplaced USE INDEX (ti_state) hint on outer rejoin causes task_instance full scan in _executable_task_instances_to_queued (Airflow 3.2 regression from PR #54103) #66763

@8silvergun

Description

@8silvergun

Under which category would you file this issue?

Airflow Core

Apache Airflow version

3.2.1

What happened and how to reproduce it?

What happened

On a production Airflow 3.2.1 + MySQL deployment with ~3.2M rows in task_instance, the scheduler stopped queueing tasks. TaskInstances remained in scheduled and did not transition to queued. Worker pods that had already completed their work could not report their final state back through the execution API, because metadata DB lock waits piled up behind the scheduler's critical-section query.

Worker-side symptom (Airflow 3 execution API client):

ReadTimeout: timed out
airflow.sdk.api.client.Client.request
PATCH /execution/task-instances/<...>/state
EOFError: Request socket closed before length
BrokenPipeError: [Errno 32] Broken pipe

Scheduler-side symptom:

sqlalchemy.exc.OperationalError: Query execution was interrupted
SELECT task_instance ... LIMIT 16 FOR UPDATE OF task_instance SKIP LOCKED

Root cause

The slow query is generated by
airflow.jobs.scheduler_job_runner.SchedulerJobRunner._executable_task_instances_to_queued.

PR #54103 (shipped in 3.2.0) reshaped this method into a two-step pattern: a ranked_query that picks candidate SCHEDULED task instances, followed by an outer query that rejoins the candidates back to task_instance by the composite identity (dag_id, task_id, run_id, map_index).

Both the inner ranked query and the outer rejoin currently apply the same MySQL index hint:

.with_hint(TI, "USE INDEX (ti_state)", dialect_name="mysql")

The hint is appropriate on the inner query — that query filters by TI.state == SCHEDULED and the ti_state(state) index is the right one.

The hint is harmful on the outer query. The outer query is not a state lookup; it is an exact lookup keyed by (dag_id, task_id, run_id, map_index). For that pattern, MySQL should use the task_instance_composite_key(dag_id, task_id, run_id, map_index) index. But the hint restricts the optimizer to ti_state, so the composite key is unreachable and the optimizer falls back to a full scan of task_instance.

Source check (current main, file airflow-core/src/airflow/jobs/scheduler_job_runner.py)

The function _executable_task_instances_to_queued contains two occurrences of .with_hint(TI, "USE INDEX (ti_state)", dialect_name="mysql"):

  1. On the inner candidate query that filters DR.state == RUNNING and TI.state == SCHEDULED — appropriate.
  2. On the outer query: select(TI).with_hint(...).select_from(ranked_query).join(TI, (TI.dag_id == ranked_query.c.dag_id) & ...)misplaced.

Evidence (production EXPLAIN on Aurora MySQL 8.0)

Metadata table state at the time of incident (2026-05-11 KST):

task_instance rows: ~3,221,348
SCHEDULED task_instance: ~187
RUNNING/QUEUED task_instance: 0
RUNNING dag_run: ~141

Relevant existing indexes on task_instance:

task_instance_composite_key(dag_id, task_id, run_id, map_index)
ti_state(state)
ti_dag_run(dag_id, run_id)
ti_state_lkp(dag_id, task_id, run_id, state)

EXPLAIN of the unmodified scheduler query (outer hint present):

PRIMARY task_instance ALL rows=3,221,348
Using where; Using join buffer

EXPLAIN after removing only the outer USE INDEX (ti_state) hint:

PRIMARY task_instance eq_ref
key = task_instance_composite_key
rows = 1
Using index

Inner ranked query in both cases (kept as-is):

DERIVED task_instance ref
key = ti_state
rows = 187

So the same scheduler loop reads either ~3.2M rows or 1 row per candidate, depending solely on the outer hint.

Minimal reproduction

  1. Stand up Airflow 3.2.0 or 3.2.1 against MySQL 8.0.
  2. Populate task_instance so it has on the order of millions of total rows but only a small handful in SCHEDULED (this matches a long-lived production deployment).
  3. Run the scheduler under load.
  4. EXPLAIN the query emitted from _executable_task_instances_to_queued; the outer join to task_instance will be type=ALL.
  5. Remove only the second .with_hint(TI, "USE INDEX (ti_state)", dialect_name="mysql") (the one immediately before .select_from(ranked_query)). Re-run and EXPLAIN — the outer join becomes eq_ref on task_instance_composite_key.

What you think should happen instead?

The outer query (the rejoin to task_instance keyed by the composite identity) should use task_instance_composite_key, not ti_state. The fix is to remove the second .with_hint(TI, "USE INDEX (ti_state)", dialect_name="mysql") only. The inner hint should stay.

After the fix, on the same workload:

outer task_instance: task_instance_composite_key, eq_ref, rows=1
inner scheduled-TI scan: ti_state, ref/range over scheduled states

Operating System

Linux (kernel 6.x), Amazon EKS nodes on Bottlerocket / AL2023.

Deployment

Other 3rd-party Helm chart

Apache Airflow Provider(s)

No response

Versions of Apache Airflow Providers

No response

Official Helm Chart version

1.20.0

Kubernetes Version

1.33

Helm Chart configuration

No response

Docker Image customizations

No response

Anything else?

  • The original USE INDEX (ti_state) hint was added in MySQL Not Using Correct Index for Scheduler Critical Section Query #25627 (≈2022, Airflow 2.2.5) to fix a separate MySQL optimizer issue where the optimizer ignored ti_state on a single-step query. At the time, the hint placement was correct.
  • PR Include the max_active_tasks limit in the query fetching TIs to be queued #54103 (merged 2026-01-28, ships in 3.2.0) refactored _executable_task_instances_to_queued into the ranked + outer-rejoin shape and replicated the hint onto the outer rejoin. This is the regression introduction point.
  • The bug is dialect-scoped (dialect_name="mysql"), so PostgreSQL deployments are unaffected. This may be why the regression has gone unreported by managed-service users (MWAA/Composer typically default to Postgres).
  • The bug surfaces only when task_instance is large enough that a full scan is materially slower than an eq_ref lookup. Small test environments will not reproduce visible symptoms even though the EXPLAIN is wrong.
  • We've deployed an internal sitecustomize-based hotpatch that strips only the outer hint at process start, with no other behavior changes.
  • Possibly related but not the same issue: Tasks returning from deferred should be prioritized #57210 (deferred-task priority in the scheduler selection query) — references the same code path but a different symptom.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:Schedulerincluding HA (high availability) schedulerarea:performancekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yetpriority:highHigh priority bug that should be patched quickly but does not require immediate new release

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions