Scaling Airflow 3 on EKS — API server OOMs, PgBouncer saturation, and health check flakiness at 8K concurrent tasks #62117

gilt-cl · 2026-02-18T10:34:13Z

gilt-cl
Feb 18, 2026

Airflow 3 on EKS is way hungrier than Airflow 2 — hitting OOMs, PgBouncer bottlenecks, and flaky health checks at scale

We're migrating from Airflow 2.10.0 to 3.1.7 (self-managed EKS, not Astronomer/MWAA) and running into scaling issues during stress testing that we never had in Airflow 2. Our platform is fairly large — ~450 DAGs, some with ~200 tasks, doing about 1,500 DAG runs / 80K task instances per day. At peak we're looking at ~140 concurrent DAG runs and ~8,000 tasks running at the same time across a mix of Celery and KubernetesExecutor.

Would love to hear from anyone running Airflow 3 at similar scale.

Our setup

Airflow 3.1.7, Helm chart 1.18.0, Python 3.12
Executor: hybrid CeleryExecutor,KubernetesExecutor
Infra: AWS EKS on Graviton4 ARM64 nodes (c8g.2xlarge, m8g.2xlarge, x8g.2xlarge)
Database: RDS PostgreSQL db.m7g.2xlarge (8 vCPU / 32 GiB) behind PgBouncer
XCom backend: custom S3 backend (S3XComBackend)
Autoscaling: KEDA for Celery workers and triggerer

Current stress-test topology

Component	Replicas	Memory	Notes
API Server	3	8Gi	6 Uvicorn workers each (18 total)
Scheduler	2	8Gi	Had to drop from 4 due to #57618
DagProcessor	2	3Gi	Standalone, 8 parsing processes
Triggerer	1+	KEDA-scaled
Celery Workers	2–64	16Gi	KEDA-scaled, `worker_concurrency: 16`
PgBouncer	1	512Mi / 1000m CPU	`metadataPoolSize: 500`, `maxClientConn: 5000`

Key config:

AIRFLOW__CORE__PARALLELISM = 2048
AIRFLOW__SCHEDULER__MAX_TIS_PER_QUERY = 512
AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC = 5           # was 2 in Airflow 2
AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC = 5     # was 2
AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD = 60
AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_CREATION_BATCH_SIZE = 32
AIRFLOW__OPERATORS__DEFAULT_DEFERRABLE = True

We also had to relax liveness probes across the board (timeoutSeconds: 60, failureThreshold: 10) and extend the API server startup probe to 5 minutes — the Helm chart defaults were way too aggressive for our load.

One thing worth calling out: we never set CPU requests/limits on the API server, scheduler, or DagProcessor. We got away with that in Airflow 2, but it matters a lot more now that the API server handles execution traffic too.

What's going wrong

1. API server keeps getting OOMKilled

This is the big one. Under load, the API server pods hit their memory limit and get killed (exit code 137). We first saw this with just ~50 DAG runs and 150–200 concurrent tasks — nowhere near our production load.

Here's what we're seeing:

Each Uvicorn worker sits at ~800Mi–1Gi under load
Memory usage correlates with the number of KubernetesExecutor pods, not UI traffic
When execution traffic overwhelms the API server, the UI goes down with it (503s)

Our best guess: Airflow 3 serves both the Core API (UI, REST) and the Execution API (task heartbeats, XCom pushes, state transitions) on the same Uvicorn workers. So when hundreds of worker pods are hammering the API server with heartbeats and XCom data, it creates memory pressure that takes down everything — including the UI.

We saw #58395 which describes something similar (fixed in 3.1.5 via DB query fixes). We're on 3.1.7 and still hitting it — our issue seems more about raw request volume than query inefficiency.

2. PgBouncer is the bottleneck

With 64 Celery workers + hundreds of K8s executor pods + schedulers + API servers + DagProcessors all going through a single PgBouncer pod, the connection pool gets saturated:

Liveness probes (airflow jobs check) queue up waiting for a DB connection
Heartbeat writes get delayed 30–60 seconds
KEDA's PostgreSQL trigger fails with "connection refused" when PgBouncer is overloaded
The UI reports components as unhealthy because heartbeat timestamps go stale

We've already bumped pool sizes from the defaults (metadataPoolSize: 10, maxClientConn: 100) up to 500 / 5000, but it still saturates at peak.

One thing I really want to understand: with AIP-72 in Airflow 3, are KubernetesExecutor worker pods still connecting directly to the metadata DB through PgBouncer? The pod template still includes SQL_ALCHEMY_CONN and the init containers still run airflow db check. #60271 seems to track this. If every K8s executor pod is opening its own PgBouncer connection, that would explain why our pool is exhausted.

3. API server takes forever to start

Each Uvicorn worker independently loads the full Airflow stack — FastAPI routes, providers, plugins, DAG parsing init, DB connection pools. With 6 workers, startup takes 4+ minutes. The Helm chart default startup probe (60s) is nowhere close to enough, and rolling deployments are painfully slow because of it.

4. False-positive health check failures

Even with SCHEDULER_HEALTH_CHECK_THRESHOLD=60, the UI flags components as unhealthy during peak load. They're actually fine — they just can't write heartbeats fast enough because PgBouncer is contended:

Triggerer: "Heartbeat recovered after 33.94 seconds"
DagProcessor: "Heartbeat recovered after 29.29 seconds"

What we'd like help with

Given our scale (450 DAGs, 8K concurrent tasks at peak, 80K daily), any guidance on these would be great:

Sizing and topology — What should the API server, scheduler, and worker setup look like at this scale? How many replicas, how many workers per replica, and what CPU/memory requests make sense? We've never set CPU requests on anything and we're starting to think that's a big gap.
PgBouncer — Is a single replica realistic at this scale, or should we run multiple? What pool sizes have worked for others? And the big question: do K8s executor pods still hit the DB directly in 3.1.7, or does everything go through the Execution API now? (#60271)
General lessons learned — If you've migrated a large-scale self-hosted Airflow 2 setup to Airflow 3, what do you wish you'd known going in?

What we've already tried

Bumped API server memory from 3Gi → 8Gi and added a third replica
Increased PgBouncer pool sizes from defaults to 500/5000, added CPU requests
Relaxed liveness probes everywhere (timeouts 20s → 60s, thresholds 5 → 10)
Bumped health check threshold (30 → 60) and heartbeat intervals (2s → 5s)
Removed cluster-autoscaler.kubernetes.io/safe-to-evict: "true" from the API server (was causing premature eviction)
Doubled WORKER_PODS_CREATION_BATCH_SIZE (16 → 32) and parallelism (1024 → 2048)
Extended API server startup probe to 5 minutes
Added max_prepared_statements = 100 to PgBouncer (fixed KEDA prepared statement errors)

Airflow 2 vs 3 — what changed

For context, here's a summary of the differences between our Airflow 2 production setup and what we've had to do for Airflow 3. The general trend is that everything needs more resources and more tolerance for slowness:

Area	Airflow 2.10.0	Airflow 3.1.7	Why
Scheduler memory	2–4Gi	8Gi	Scheduler is doing more work
Webserver → API server memory	3Gi	6–8Gi	API server is much heavier than the old Flask webserver
Worker memory	8Gi	12–16Gi
Celery concurrency	16	12–16	Reduced in smaller envs
PgBouncer pools	1000 / 500 / 5000	100 / 50 / 2000 (base), 500 in prod	Reduced for shared-RDS safety; prod overrides
Parallelism	64–1024	192–2048	Roughly 2x across all envs
Scheduler replicas (prod)	4	2	KubernetesExecutor race condition #57618
Liveness probe timeouts	20s	60s	DB contention makes probes slow
API server startup	~30s	~4 min	Uvicorn workers load the full stack sequentially
CPU requests	Never set	Still not set	Planning to add — probably a big gap

Happy to share Helm values, logs, or whatever else would help. Would really appreciate hearing from anyone dealing with similar stuff.

potiuk · 2026-02-18T13:45:33Z

potiuk
Feb 18, 2026
Collaborator

Some API server memory issues are being solved in 3.1.8 and some in 3.2.0 - you might want take a look if those will solve your issues (search for memory in recent PRs)
There are no DB connections from workers - actually PgBouncer is no longer recommended/needed because there are far less number of connections created by Airflow 3. *. The entrypoint checks for db connection indeed as you noticed CONNECTION_CHECK_MAX_COUNT set to 0 is a good workaround before k8 executor worker pod connecting to metadata db in Airflow 3 #60271 is fixed permanently. You can also configure connection pools with sqlalchemy connections to improve performance - since DB clients are done now from "fixed" number of processes - api-servers, scheduler, triggerer, the sqlalchemy pools make more sense to use rather than external pgbouncer pools (which adds overall memory and communication overhead)
Since we have changed to FastAPI - the recommendation from FastAPI itself is to have one worker per container - i.e. scale not by increasing a number of workers but increasing number of containers - i.e. api-server replicas, which should give more control over memory / CPU usage, health checks etc.. Given enough memory (needs experimenting) this should address startup time issues.

We cannot really give advice on concrete numbers -> https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html#notes-about-minimum-requirements explains why.

I think we also would like to learn from users like you on some best practices, so I guees we need to get more feedback and learnings (and replace some workarounds with permanent fixes) to get some good guidelines/best practices. So it would be great to hear if the above comment help you to optimize things.

1 reply

gilt-cl Feb 18, 2026
Author

Hey @potiuk,

First of all, thanks for the response. Its much appreciated.
Ill take a look at memory PRs, getting rid of PgBouncer and scaling the api-server some more (1 worker per container and more memory) and see if it has any affects.

I'll post here after ill do some checks and let's see.
Thanks 🙏🏻

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling Airflow 3 on EKS — API server OOMs, PgBouncer saturation, and health check flakiness at 8K concurrent tasks #62117

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Scaling Airflow 3 on EKS — API server OOMs, PgBouncer saturation, and health check flakiness at 8K concurrent tasks #62117

Uh oh!

gilt-cl Feb 18, 2026

Airflow 3 on EKS is way hungrier than Airflow 2 — hitting OOMs, PgBouncer bottlenecks, and flaky health checks at scale

Our setup

Current stress-test topology

What's going wrong

1. API server keeps getting OOMKilled

2. PgBouncer is the bottleneck

3. API server takes forever to start

4. False-positive health check failures

What we'd like help with

What we've already tried

Airflow 2 vs 3 — what changed

Replies: 1 comment · 1 reply

Uh oh!

potiuk Feb 18, 2026 Collaborator

Uh oh!

Uh oh!

gilt-cl Feb 18, 2026 Author

gilt-cl
Feb 18, 2026

Replies: 1 comment 1 reply

potiuk
Feb 18, 2026
Collaborator

gilt-cl Feb 18, 2026
Author