Skip to content

Helm chart: support bidirectional Airflow metadata DB reconciliation on helm upgrade (downgrade as well as upgrade) #68072

@jykae

Description

@jykae

Description

Today the official Airflow Helm chart's migrateDatabaseJob only runs forward airflow db migrate. Doing a helm upgrade that targets an older airflowVersion than the one currently running leaves the metadata DB schema ahead of the running image, and the api-server pod fails to start. The chart should reconcile the DB schema in both directions — upgrade and downgrade — based on the dispatched airflowVersion.

Use case/motivation

We operate Airflow on Kubernetes via this chart and ship to multiple environments through CI. We need rollback to be a first-class operation:

  • Deploying an older release tag (helm upgrade with an older airflowVersion) should bring the cluster — schema included — back to that release.
  • Today this requires out-of-band tooling: detecting current vs target version, then kubectl execing into the still-running old api-server pod and invoking airflow db downgrade --to-version <target> --yes before helm starts rolling the new image. We've implemented this as a workflow step driven by a small bash script, but it duplicates logic for every team using the chart and only helps people deploying via GitHub Actions — not ArgoCD, not manual helm.

A chart-native solution would mean: set airflowVersion: <older> in values, run helm upgrade, the chart reconciles the schema, the new (older) pods come up.

Hard constraint that shapes the design

airflow db downgrade --to-version X.Y.Z requires the alembic revision scripts for every revision between the current head and the target. Those scripts only ship inside the image of the version that introduced them. So:

Direction Image that must run the operation Why
Upgrade (current < target) Target image Forward revisions ship in the target image
Downgrade (current > target) Currently running image Reverse-direction code for revisions to undo only exists in the current image
Same none No-op

Today's migrateDatabaseJob is correct for the first row only. A pre-upgrade hook running with airflow_image_for_migrations (the target image) cannot perform a downgrade — the target image doesn't carry the scripts that need to be reversed.

Proposed design — single reconcile job, runtime decision

Helm templates render before any cluster read, so the chart can't pick the action at template time. But it doesn't need to: one job decides at runtime which action is required.

Keep the existing migrateDatabaseJob (rendered as <release>-run-airflow-migrations) — same name, same value keys, same ServiceAccount. Only the hook annotations and the container's command change.

helm.sh/hook                     helm.sh/hook-weight
─────────────                    ─────────────────────
pre-install,pre-upgrade          1   <release>-run-airflow-migrations   (same job as today)

(Was post-install,post-upgrade — moved to pre-upgrade so the schema is aligned before new pods roll, and so the downgrade branch can kubectl exec into still-running old pods.)

Container runs with the chart-templated target image. Pseudocode:

target=$AIRFLOW_TARGET_VERSION       # injected from .Values.airflowVersion
current=$(discover_current_version)  # query alembic_version table + mapping shipped in chart

case in
  current == ""     )  exec airflow db migrate ;;                              # fresh install
  current == target )  exit 0 ;;                                               # no-op
  current  < target )  exec airflow db migrate ;;                              # forward — target image has the scripts
  current  > target )                                                          # backward — must use old image
    old_pod=$(kubectl get pod -l component=api-server -o jsonpath='...' | head -1)
    exec kubectl exec -n "$NAMESPACE" "$old_pod" -c api-server -- airflow db downgrade --to-version "$target" --yes
  ;;
esac
Why pre-upgrade for both directions works
  • Forward migrate with the target image in pre-upgrade is what the chart already supports today via airflow_image_for_migrations — moving it from post-upgrade to pre-upgrade just means the schema is correct before the new pods start rolling instead of being raced by the waitForMigrations initContainer. Functionally equivalent for existing users.
  • Downgrade in pre-upgrade is the only window that works: the old api-server pods are still alive and reachable via kubectl exec, and their image carries the alembic reverse scripts. Once pre-upgrade returns and helm starts applying manifests, those pods get replaced.
Why a single job rather than two
Aspect Two-job design Single reconcile job
Templates rendered 2 1
Hook weights to reason about -10 / 1 1
Race between hooks yes (downgrade must finish before migrate starts) none
"Same version" code path both jobs no-op one early-exit
Cluster reads each job re-discovers current once
waitForMigrations race unchanged gone — schema is aligned before new pods roll
Discovery of current

Preference: query alembic_version table + ship a small alembic-rev → Airflow-version map alongside appVersion bumps. Avoids needing extra RBAC for version discovery — DB credentials are already available via standard_airflow_environment.

Alternatives if the mapping is undesirable: read Deployment/<release>-api-server pod spec image (requires deployments.get), or kubectl exec -- airflow version on the running pod (uses the same pods/exec RBAC the downgrade itself needs).

RBAC

Extend the existing migrateDatabaseJob ServiceAccount (<release>-migrate-database-job) with a Role scoped to the release namespace:

  • pods, pods/exec (verbs: get, list, create) — to run airflow db downgrade against the live api-server pod.

Forward migrate doesn't need this — it's only consumed by the downgrade branch.

Backward compatibility
  • Same job name, same value keys (migrateDatabaseJob.resources, tolerations, etc.), same ServiceAccount name. Users' values files don't need changes.
  • Hook moves from post-install,post-upgrade to pre-install,pre-upgrade. Functionally equivalent for forward migrations — just removes the race with the waitForMigrations initContainer on the new pods.
  • Upgrade and same-version paths are byte-identical for existing users (they only ever hit the forward-migrate branch).
  • Downgrade is always permitted — no opt-in flag. Today a downgrade helm upgrade half-succeeds and leaves the cluster broken; after this change it completes cleanly. There is no "safer" status quo to preserve.
Test matrix to add under chart/tests/
  1. Fresh install (no alembic_version row) → forward migrate
  2. Same version → early-exit no-op
  3. Forward (current < target) → forward migrate with target image
  4. Backward (current > target) → kubectl exec into discovered old pod with airflow db downgrade --to-version <target> --yes
  5. migrate-database-job ServiceAccount's Role renders with pods/exec

End-to-end (kind via breeze k8s tests):

  • Install 3.0.x → upgrade to 3.1.x → downgrade to 3.0.x → verify alembic head matches the 3.0.x branch tip and api-server starts.

Why a chart hook rather than out-of-band tooling

Because every operator (GitHub Actions, ArgoCD, Flux, manual helm upgrade) hits the same problem, the chart is the single right place to own the contract. The current state forces each deployer to maintain their own pre-helm script.

Related issues

None I could find on apache/airflow that propose chart-side support for downgrade. Closest relatives:

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions