Skip to content

Speed up TaskInstance bulk insert on PostgreSQL with unnest#66868

Draft
safaehar wants to merge 2 commits into
apache:mainfrom
safaehar:perf/postgres-unnest-task-instance-insert
Draft

Speed up TaskInstance bulk insert on PostgreSQL with unnest#66868
safaehar wants to merge 2 commits into
apache:mainfrom
safaehar:perf/postgres-unnest-task-instance-insert

Conversation

@safaehar
Copy link
Copy Markdown

Summary

When the scheduler creates a DagRun, DagRun._create_task_instances bulk-inserts every TaskInstance for that run in a single call. On PostgreSQL this currently goes through SQLAlchemy ORM's bulk_insert_mappings, which emits

INSERT INTO task_instance (...) VALUES (...), (...), ...

— one bind tuple per row, ~35 columns each. The wire payload scales with rows × columns, which is costly for DagRuns with mapped-task expansion or wide DAGs.

This PR adds a PostgreSQL-only fast path that emits instead

INSERT INTO task_instance (<cols>)
SELECT * FROM unnest(:c1::t1[], :c2::t2[], ...)

— one typed array per column, so the payload scales with columns + rows and the planner sees a single static statement regardless of batch size.

The dispatch follows the existing dialect-branch precedent in airflow/dag_processing/collection.py::activate_assets_if_possible. Other backends (MySQL, SQLite) and the task_instance_mutation_hook path (which needs per-object ORM access) are unchanged.

Details

  • New private helpers in airflow/models/dagrun.py:
    • _build_postgres_unnest_insert(keys) — builds the INSERT … SELECT * FROM unnest(…) statement from TaskInstance.__mapper__. The SQL column list, ordering, and PG element types are all derived from the mapper (UtcDateTime → TIMESTAMP WITH TIME ZONE[], ExtendedJSON → JSONB[], ExecutorConfigType → BYTEA[], etc.), so new columns flow through without code changes here. The cast is injected by SQLAlchemy via bindparam(type_=postgresql.ARRAY(col.type)) rather than hand-rolled — avoiding a real footgun (text() placeholder parsing breaks on :id::UUID[] without an intervening space).
    • _bulk_insert_task_instance_dicts_postgres(task_dicts, session) — materializes the dict iterator, looks up the cached statement by frozenset(keys), and executes with column-major arrays.
  • _create_task_instances now branches on get_dialect_name(session) inside the hook_is_noop arm.
  • TaskInstance.insert_mapping now fills id (via the same uuid7 default as the column) and updated_at (via timezone.utcnow()) explicitly, so the unnest path does not need to replicate SQLAlchemy's column-default application. The pre-fill is behaviour-equivalent for non-Postgres backends because bulk_insert_mappings would have applied the same defaults.

Tests

Added in airflow-core/tests/unit/models/test_dagrun.py:

  • TestPostgresUnnestBulkInsert
    • drops dict keys that are not columns (mirrors bulk_insert_mappings)
    • emits the expected BYTEA[] / JSONB[] / TIMESTAMP WITH TIME ZONE[] / VARCHAR(1000)[] casts when compiled for postgres
    • uses the SQL column name (task_display_name) even when the Python attr is _task_display_property_value
    • is a no-op on empty input
    • emits column-major arrays as bind params
  • test_create_task_instances_uses_unnest_path_on_postgres — dispatch goes through the helper, not bulk_insert_mappings.
  • test_create_task_instances_uses_bulk_insert_mappings_on_non_postgres — sqlite/mysql keep the existing path.
  • test_create_task_instances_mutation_hook_still_uses_bulk_save_objects — non-noop hook stays on the ORM path even on PG.

All 8 new tests + the 167 other tests in test_dagrun.py pass locally on SQLite. mypy-airflow-core and ruff are clean.

Benchmarks

Driving motivation is a measured speedup on a downstream fork at Datadog. Numbers from infra-staging to follow before flipping out of draft.

Test plan

  • Validate in infra-staging with a real PostgreSQL — confirm scheduler throughput improves on DagRuns with large mapped-task expansion and no functional regression.
  • Run breeze testing core-tests --backend postgres --test-type Core -k "TestPostgresUnnest or test_create_task_instances" against postgres.
  • Rename airflow-core/newsfragments/pr_number.improvement.rst<this-PR-number>.improvement.rst once the PR number is assigned.
  • Confirm CI is green.

Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.7)

Generated-by: Claude Code (Opus 4.7) following the guidelines

DagRun._create_task_instances bulk-inserts every TaskInstance for a
DagRun in a single call. On PostgreSQL the ORM bulk_insert_mappings
path emits a multi-row INSERT ... VALUES (...), (...) with one bind
tuple per row, which scales poorly for large DagRuns (mapped task
expansion, wide DAGs).

This change branches on dialect and, on PostgreSQL, emits

    INSERT INTO task_instance (<cols>)
    SELECT * FROM unnest(:c1::t1[], :c2::t2[], ...)

so the driver sends one typed array per column instead of one bind
tuple per row. The statement is built once per dict-shape from
TaskInstance.__mapper__ (so new columns flow through automatically)
and cached at module scope. Other backends and the mutation-hook
path (hook_is_noop=False) are unchanged.

TaskInstance.insert_mapping now fills ``id`` and ``updated_at``
explicitly so the unnest path does not have to replicate
SQLAlchemy's column-default application; the values match the
existing column ``default=`` callables, so behavior is preserved
across all backends.
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented May 13, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example Dag that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant