Replies: 1 comment 1 reply
-
We cannot really give advice on concrete numbers -> https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html#notes-about-minimum-requirements explains why. I think we also would like to learn from users like you on some best practices, so I guees we need to get more feedback and learnings (and replace some workarounds with permanent fixes) to get some good guidelines/best practices. So it would be great to hear if the above comment help you to optimize things. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Airflow 3 on EKS is way hungrier than Airflow 2 — hitting OOMs, PgBouncer bottlenecks, and flaky health checks at scale
We're migrating from Airflow 2.10.0 to 3.1.7 (self-managed EKS, not Astronomer/MWAA) and running into scaling issues during stress testing that we never had in Airflow 2. Our platform is fairly large — ~450 DAGs, some with ~200 tasks, doing about 1,500 DAG runs / 80K task instances per day. At peak we're looking at ~140 concurrent DAG runs and ~8,000 tasks running at the same time across a mix of Celery and KubernetesExecutor.
Would love to hear from anyone running Airflow 3 at similar scale.
Our setup
CeleryExecutor,KubernetesExecutorS3XComBackend)Current stress-test topology
worker_concurrency: 16metadataPoolSize: 500,maxClientConn: 5000Key config:
We also had to relax liveness probes across the board (
timeoutSeconds: 60,failureThreshold: 10) and extend the API server startup probe to 5 minutes — the Helm chart defaults were way too aggressive for our load.One thing worth calling out: we never set CPU requests/limits on the API server, scheduler, or DagProcessor. We got away with that in Airflow 2, but it matters a lot more now that the API server handles execution traffic too.
What's going wrong
1. API server keeps getting OOMKilled
This is the big one. Under load, the API server pods hit their memory limit and get killed (exit code 137). We first saw this with just ~50 DAG runs and 150–200 concurrent tasks — nowhere near our production load.
Here's what we're seeing:
Our best guess: Airflow 3 serves both the Core API (UI, REST) and the Execution API (task heartbeats, XCom pushes, state transitions) on the same Uvicorn workers. So when hundreds of worker pods are hammering the API server with heartbeats and XCom data, it creates memory pressure that takes down everything — including the UI.
We saw #58395 which describes something similar (fixed in 3.1.5 via DB query fixes). We're on 3.1.7 and still hitting it — our issue seems more about raw request volume than query inefficiency.
2. PgBouncer is the bottleneck
With 64 Celery workers + hundreds of K8s executor pods + schedulers + API servers + DagProcessors all going through a single PgBouncer pod, the connection pool gets saturated:
airflow jobs check) queue up waiting for a DB connection"connection refused"when PgBouncer is overloadedWe've already bumped pool sizes from the defaults (
metadataPoolSize: 10,maxClientConn: 100) up to500/5000, but it still saturates at peak.One thing I really want to understand: with AIP-72 in Airflow 3, are KubernetesExecutor worker pods still connecting directly to the metadata DB through PgBouncer? The pod template still includes
SQL_ALCHEMY_CONNand the init containers still runairflow db check. #60271 seems to track this. If every K8s executor pod is opening its own PgBouncer connection, that would explain why our pool is exhausted.3. API server takes forever to start
Each Uvicorn worker independently loads the full Airflow stack — FastAPI routes, providers, plugins, DAG parsing init, DB connection pools. With 6 workers, startup takes 4+ minutes. The Helm chart default startup probe (60s) is nowhere close to enough, and rolling deployments are painfully slow because of it.
4. False-positive health check failures
Even with
SCHEDULER_HEALTH_CHECK_THRESHOLD=60, the UI flags components as unhealthy during peak load. They're actually fine — they just can't write heartbeats fast enough because PgBouncer is contended:What we'd like help with
Given our scale (450 DAGs, 8K concurrent tasks at peak, 80K daily), any guidance on these would be great:
What we've already tried
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"from the API server (was causing premature eviction)WORKER_PODS_CREATION_BATCH_SIZE(16 → 32) andparallelism(1024 → 2048)max_prepared_statements = 100to PgBouncer (fixed KEDA prepared statement errors)Airflow 2 vs 3 — what changed
For context, here's a summary of the differences between our Airflow 2 production setup and what we've had to do for Airflow 3. The general trend is that everything needs more resources and more tolerance for slowness:
Happy to share Helm values, logs, or whatever else would help. Would really appreciate hearing from anyone dealing with similar stuff.
Beta Was this translation helpful? Give feedback.
All reactions