Skip to content

Speed up migrations 0094 and 0108 by removing ORM imports and batching large updates#63867

Closed
ephraimbuddy wants to merge 1 commit intoapache:mainfrom
astronomer:optimize-94_108
Closed

Speed up migrations 0094 and 0108 by removing ORM imports and batching large updates#63867
ephraimbuddy wants to merge 1 commit intoapache:mainfrom
astronomer:optimize-94_108

Conversation

@ephraimbuddy
Copy link
Copy Markdown
Contributor

  • Replace ORM model imports (CallbackState, CallbackType, etc.) with inline constants so migration loading doesn't pull in the full Airflow runtime
  • Inline Task SDK serde deserialization in 0094 to eliminate the airflow.sdk.serde import
  • Use keyset pagination for batch queries in 0094 instead of re-scanning unmigrated rows
  • Batch task_instance NULL backfill in 0108 (10k rows at a time) to reduce lock duration on large tables
  • Consolidate per-column UPDATE statements into single per-table UPDATEs to reduce database round-trips

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

GPT-5.4


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

…g large updates

- Replace ORM model imports (CallbackState, CallbackType, etc.) with
  inline constants so migration loading doesn't pull in the full Airflow
  runtime
- Inline Task SDK serde deserialization in 0094 to eliminate the
  airflow.sdk.serde import
- Use keyset pagination for batch queries in 0094 instead of re-scanning
  unmigrated rows
- Batch task_instance NULL backfill in 0108 (10k rows at a time) to
  reduce lock duration on large tables
- Consolidate per-column UPDATE statements into single per-table UPDATEs
  to reduce database round-trips
@boring-cyborg boring-cyborg bot added area:db-migrations PRs with DB migration area:deadline-alerts AIP-86 (former AIP-57) labels Mar 18, 2026
@ephraimbuddy ephraimbuddy requested a review from Copilot March 18, 2026 08:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes two Alembic migrations (0094 and 0108) to reduce import overhead and improve performance on large metadata DBs by avoiding ORM/runtime imports and batching high-volume updates.

Changes:

  • Refactors multiple per-column UPDATEs into consolidated per-table UPDATE statements (0108).
  • Adds batched backfill for task_instance NULL columns to reduce long-running locks (0108).
  • Removes airflow.sdk.serde/ORM dependencies by inlining constants + a minimal deserializer and switching to keyset pagination for batch scanning (0094).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
airflow-core/src/airflow/migrations/versions/0108_3_2_0_fix_migration_file_ORM_inconsistencies.py Consolidates raw SQL updates and introduces batched task_instance backfill to reduce lock duration/import cost.
airflow-core/src/airflow/migrations/versions/0094_3_2_0_replace_deadline_inline_callback_with_fkey.py Removes SDK serde import by inlining a minimal deserializer and uses keyset pagination to avoid rescanning rows.

Comment on lines +194 to +197
conn.execute(
task_instance_table.update()
.where(task_instance_table.c.id.in_(batch_ids))
.values(

def _deserialize_task_sdk_value(value):
"""Deserialize a minimal subset of Task SDK serde values used in callback kwargs."""
if value is None or isinstance(value, bool | float | int | str):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Airflow requires Python >= 3.10, and isinstance(value, bool | float | int | str) works fine from 3.10 onward.

Comment on lines +131 to +141
if isinstance(value, int):
return timezone(timedelta(seconds=value))

if isinstance(value, str):
return ZoneInfo(value)

if isinstance(value, list) and len(value) == 3:
data, classname, _version = value
if classname in _SERDE_TIMEZONE_TYPES:
return _deserialize_task_sdk_timezone(data)

@ephraimbuddy ephraimbuddy deleted the optimize-94_108 branch March 18, 2026 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:db-migrations PRs with DB migration area:deadline-alerts AIP-86 (former AIP-57)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants