Problem
Containerfile:31 runs uvicorn with no --limit-concurrency. Each /execute request can spawn a subprocess that allocates up to 512 MB (the RLIMIT_AS ceiling from the minimal profile). With the default 704Mi pod memory limit and ~80 MB for the FastAPI parent, the pod can absorb roughly one concurrent execution before the cgroup OOM-killer fires.
In practice:
- 1 concurrent request: ~80 MB (parent) + 512 MB (subprocess) = ~592 MB — fits in 704Mi
- 2 concurrent requests: ~80 MB + 2 × 512 MB = ~1104 MB — exceeds 704Mi → cgroup OOM-kill
The current chart implicitly assumes the caller serializes requests, but nothing enforces that.
Options
--limit-concurrency 1 on uvicorn — simplest, guarantees only one in-flight request per pod. Scale horizontally via replicas.
- Semaphore in
pipeline.py — asyncio.Semaphore(1) around the subprocess spawn, returning 429 to excess callers. More informative to the caller than a connection queue.
- Both — semaphore for clean 429s,
--limit-concurrency as a backstop.
Option 3 is belt-and-suspenders but cleanest operationally. The semaphore limit could be made configurable via env var to support pods with higher memory limits.
Context
Found during review of #20 (Python 3.12 memory fix). This was already true at the old 200 MB default (4 × 200 = 800 > 256Mi), so it predates that PR.
Problem
Containerfile:31runs uvicorn with no--limit-concurrency. Each/executerequest can spawn a subprocess that allocates up to 512 MB (the RLIMIT_AS ceiling from the minimal profile). With the default 704Mi pod memory limit and ~80 MB for the FastAPI parent, the pod can absorb roughly one concurrent execution before the cgroup OOM-killer fires.In practice:
The current chart implicitly assumes the caller serializes requests, but nothing enforces that.
Options
--limit-concurrency 1on uvicorn — simplest, guarantees only one in-flight request per pod. Scale horizontally via replicas.pipeline.py—asyncio.Semaphore(1)around the subprocess spawn, returning 429 to excess callers. More informative to the caller than a connection queue.--limit-concurrencyas a backstop.Option 3 is belt-and-suspenders but cleanest operationally. The semaphore limit could be made configurable via env var to support pods with higher memory limits.
Context
Found during review of #20 (Python 3.12 memory fix). This was already true at the old 200 MB default (4 × 200 = 800 > 256Mi), so it predates that PR.