Skip to content

Upgrade from 3.1.7 to 3.2.0 stuck on Database migrations (MySQL) #65350

@davidjfrickert

Description

@davidjfrickert

Under which category would you file this issue?

Helm chart

Apache Airflow version

3.1.7 -> 3.2.0

What happened and how to reproduce it?

Upgraded Airflow from 3.1.7 and 3.2.0. The db migrations pod was stuck for 20+ minutes.

Some logs:

2026-04-16T02:51:13.804504Z [info     ] Context impl MySQLImpl.        [alembic.runtime.migration] loc=migration.py:210
2026-04-16T02:51:13.804607Z [info     ] Will assume non-transactional DDL. [alembic.runtime.migration] loc=migration.py:213
2026-04-16T02:51:13.806593Z [info     ] Migrating the Airflow database (MySQL) [airflow.utils.db] loc=db.py:1179
2026-04-16T02:51:13.806680Z [info     ] MySQL: Committing session to release metadata locks [airflow.utils.db] loc=db.py:794
2026-04-16T02:58:44.621134Z [info     ] Context impl MySQLImpl.        [alembic.runtime.migration] loc=migration.py:210
2026-04-16T02:58:44.621234Z [info     ] Will assume non-transactional DDL. [alembic.runtime.migration] loc=migration.py:213
2026-04-16T02:58:44.687475Z [info     ] Running upgrade cc92b33c6709 -> 82dbd68e6171, Add composite index (ti_id, id DESC) to task_reschedule. [alembic.runtime.migration] loc=migration.py:621

First, it was stuck for 7.5 minutes even before it started doing any migrations, then the actual migration (Add composite index) was stuck for 20+ minutes.

The only way to get this to finish was to scale down all pods, which then let this migration finish. Is this expected?

We are using MySQL.

Graph of processes waiting for lock. Started migrations around start of this graph and ended when scaled down pods and let migration finish.
Image

What you think should happen instead?

Helm chart should handle migrations cleanly without needing user to scale down workloads.

Operating System

No response

Deployment

Official Apache Airflow Helm Chart

Apache Airflow Provider(s)

No response

Versions of Apache Airflow Providers

No response

Official Helm Chart version

1.20.0 (latest released)

Kubernetes Version

v1.30.14

Helm Chart configuration

defaultAirflowTag: "3.2.0"
airflowVersion: "3.2.0"

executor: "KubernetesExecutor"
allowPodLaunching: true

# Avoid having the helm chart managing these secret which causes it to rotate on upgrades and causes issues for DAGs or web UI access
jwtSecretName: airflow-jwt-secret
fernetKeySecretName: airflow-fernet-key
apiSecretKeySecretName: airflow-api-secret-key

env:
  - name: AIRFLOW__API__EXPOSE_CONFIG
    value: "False"
  - name: AIRFLOW__API__BASE_URL
    value: "https://(redacted)"
  - name: AIRFLOW__EMAIL__EMAIL_BACKEND
    value: "airflow.utils.email.send_email_smtp"
  - name: AIRFLOW__SMTP__SMTP_HOST
    value: "redacted"
  - name: AIRFLOW__SMTP__SMTP_MAIL_FROM
    value: "redacted"
  - name: AIRFLOW__SMTP__SMTP_STARTTLS
    value: "False"
  - name: AIRFLOW__WEBSERVER__SHOW_TRIGGER_FORM_IF_NO_PARAMS
    value: "True"
  - name: AIRFLOW__WEBSERVER__WARN_DEPLOYMENT_EXPOSURE
    value: "False"
## in Airflow V2 parallelism=0 meant infinite.
## in V3, a value must be supplied
  - name: AIRFLOW__CORE__PARALLELISM
    value: "256"  

# dags are mounted via localPath
dags:
  persistence:
    enabled: false

logs:
  persistence:
    enabled: true
    #size: 5Gi
    #annotations: {}
    #storageClassName: "rook-cephfs-fs00"
    existingClaim: airflow-logs

redis:
  enabled: false

postgresql:
  enabled: false

statsd:
  enabled: false

extraEnv: |
  - name: AIRFLOW_KUBE_NAMESPACE
    valueFrom:
      fieldRef:
        fieldPath: metadata.namespace
  - name: AIRFLOW_ENV
    valueFrom:
      configMapKeyRef:
        name: airflow.deployment
        key: environment

extraEnvFrom: |
  - secretRef:
      name: airflow-okta-creds
  - secretRef:
      name: airflow-admin-user-pwd

data:
  metadataSecretName: airflow-db-conn
  resultBackendSecretName: airflow-db-conn

# Root-level tolerations and affinity apply to all components (scheduler, webserver, triggerer, etc.)
# including the KubernetesExecutor pod template.
# If desired to have different rules for the pod template, set workers.affinity/workers.tolerations instead.
# See: https://github.com/apache/airflow/blob/main/chart/files/pod-template-file.kubernetes-helm-yaml
tolerations:
  - key: airflow
    operator: Equal
    value: "true"
    effect: NoSchedule

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
            - key: node-role.kubernetes.io/worker-airflow
              operator: In
              values:
                - "true"

webserver:
  defaultUser:
    enabled: true
    role: Admin
    username: redacted
    firstName: redacted
    lastName: redacted
    email: redacted

# Note: this is only applied on account creation, if account already exists details won't be updated
createUserJob:
  ttlSecondsAfterFinished: 300
  command: ~
  args:
    - "bash"
    - "-c"
    - |-
      exec \
      airflow users create \
          -r "{{ .Values.webserver.defaultUser.role }}" \
          -u "{{ .Values.webserver.defaultUser.username }}" \
          -e "{{ .Values.webserver.defaultUser.email }}" \
          -f "{{ .Values.webserver.defaultUser.firstName }}" \
          -l "{{ .Values.webserver.defaultUser.lastName }}" \
          -p "${AIRFLOW_ADMIN_PASSWORD}"
  applyCustomEnv: true

securityContexts:
  pod:
    runAsUser: 50000
    runAsGroup: 0
    fsGroup: 50000
    # changing permissions operation is slow, especially on big volumes like the logs volume
    # so, only change if the root permissions do not match the expected permissions
    fsGroupChangePolicy: "OnRootMismatch"
    runAsNonRoot: true

triggerer:
  replicas: 2
  resources:
    requests:
      memory: "500Mi"
      cpu: "1000m"
    limits:
      memory: "1Gi"
      cpu: "2000m"
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
              - key: node-role.kubernetes.io/worker-airflow
                operator: In
                values:
                  - "true"
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: component
                operator: In
                values:
                  - triggerer
          topologyKey: "kubernetes.io/hostname"
  extraVolumes:
  - name: dags
    hostPath:
      path: redacted
      type: Directory
  extraVolumeMounts:
    - name: dags
      mountPath: /opt/airflow/dags
      readOnly: true
      subPath: dags-v3

scheduler:
  replicas: 2
  resources:
    requests:
      memory: "4Gi"
      cpu: "2"
    limits:
      memory: "8Gi"
      cpu: "4"
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
              - key: node-role.kubernetes.io/worker-airflow
                operator: In
                values:
                  - "true"
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: component
                operator: In
                values:
                  - scheduler
          topologyKey: "kubernetes.io/hostname"

dagProcessor:
  replicas: 2
  resources:
  # dagProcessor has memory leak increasing around 300MB/day
    requests:
      memory: "4Gi"
      cpu: "1"
    limits:
      memory: "8Gi"
      cpu: "2"
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
              - key: node-role.kubernetes.io/worker-airflow
                operator: In
                values:
                  - "true"
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: component
                operator: In
                values:
                  - dag-processor
          topologyKey: "kubernetes.io/hostname"
  extraVolumes:
  - name: dags
    hostPath:
      path: redacted
      type: Directory
  extraVolumeMounts:
    - name: dags
      mountPath: /opt/airflow/dags
      readOnly: true
      subPath: dags-v3

apiServer:
  replicas: 2
  resources:
    requests:
      memory: "2Gi"
      cpu: "1000m"
    limits:
      memory: "3Gi"
      cpu: "2000m"
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
              - key: node-role.kubernetes.io/worker-airflow
                operator: In
                values:
                  - "true"
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: component
                operator: In
                values:
                  - api-server
          topologyKey: "kubernetes.io/hostname"
  env:
  ## this should achieve the same as the community chart's `airflow.extraPipPackages`. Note: underscore is necessary, not a typo!
    - name: _PIP_ADDITIONAL_REQUIREMENTS
      value: "airflow-exporter==1.6.0"
  service:
    type: NodePort
    ports:
      - name: api-server
        port: "{{ .Values.ports.apiServer }}"
        nodePort: 30010
## Uncomment below to enable Okta
## note: does not work when port-forwarding
  apiServerConfig: |
    import os
    from flask_appbuilder.security.manager import AUTH_OAUTH

    AUTH_TYPE = AUTH_OAUTH

    # registration configs
    AUTH_USER_REGISTRATION = True  # allow users who are not already in the FAB DB
    AUTH_USER_REGISTRATION_ROLE = "Viewer"  # this role will be given in addition to any AUTH_ROLES_MAPPING

    # the list of providers which the user can choose from
    OAUTH_PROVIDERS = [
      {
          "name": "okta",
          "icon": "fa-circle-o",
          "token_key": "access_token",
          "remote_app": {
              "client_id": os.environ["OKTA_CLIENT_ID"],
              "client_secret": os.environ["OKTA_CLIENT_SECRET"],
              "api_base_url": "redacted",
              "client_kwargs": {"scope": "openid profile email groups"},
              "server_metadata_url": "redacted",
              "access_token_url": "redacted",
              "authorize_url": "redacted",
          },
      },
    ]

    # a mapping from the values of `userinfo["role_keys"]` to a list of FAB roles
    AUTH_ROLES_MAPPING = {
        "Viewer": ["Viewer"],
        "redacted": ["Admin"],
    }

    # if we should replace ALL the user's roles each login, or only on registration
    AUTH_ROLES_SYNC_AT_LOGIN = True

    # force users to re-auth after 30min of inactivity (to keep roles in sync)
    PERMANENT_SESSION_LIFETIME = 1800

## this is a bit of a confusing section. workers section is technically for deploying CeleryExecutors which we don't use
## but extraVolumes/extraVolumeMounts of this section is used to append to pod manifest of launched pods by airflow
## As of chart 1.20.0, workers.resources is deprecated in favor of workers.kubernetes.resources
workers:
  extraVolumes:
  - name: dags
    hostPath:
      path: redacted
      type: Directory
  extraVolumeMounts:
    - name: dags
      mountPath: /opt/airflow/dags
      readOnly: true
      subPath: dags-v3
  kubernetes:
    resources:
      requests:
        memory: "500Mi"
        cpu: "200m"
      limits:
        memory: "1Gi"
        cpu: "2000m"

Docker Image customizations

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:db-migrationsPRs with DB migrationarea:helm-chartAirflow Helm Chartkind:bugThis is a clearly a bugpending-responsepriority:highHigh priority bug that should be patched quickly but does not require immediate new release

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions