Background
PR #13255 adds an ngx.sleep(0) at the end of parse_streaming_response() (apisix/plugins/ai-providers/base.lua) to yield to the nginx scheduler inside the SSE streaming loop. Without it, when the upstream socket already has data buffered and the downstream client drains immediately, neither body_reader() nor ngx.flush() yields, so the loop monopolizes the worker CPU and blocks health checks and concurrent requests on the same worker.
As pointed out in #13255 (comment), that fix is a workaround: it prevents a single request from monopolizing the worker, but it does not solve the underlying problem.
Real problems still to solve
- One worker, one client — if a single SSE client keeps the upstream busy forever, that client still consumes one full worker for the entire lifetime of the stream.
ngx.sleep(0) only interleaves it with other coroutines on the same worker; it does not bound per-request CPU time.
- No backpressure / fairness across requests — a slow downstream client that never drains will keep the buffer full and the loop hot. We have no per-stream rate limiting or fair scheduling for SSE.
- No timeout for stalled streams — there is no upper bound on how long a streaming response can stay in the loop.
- Yield granularity is coarse —
ngx.sleep(0) after every chunk is cheap-ish but still adds an event-loop hop per SSE event; for very chatty providers this is wasteful.
Possible directions
- Move SSE proxying to a dedicated lightweight path that uses cosocket reads with explicit yield points and a configurable max chunks-per-yield.
- Add per-stream timeouts and total-duration limits configurable on the ai-proxy plugin.
- Investigate whether nginx
proxy_buffering off + native streaming (without going through the Lua body filter) can handle a subset of cases.
- Add a worker-level concurrency cap for streaming AI requests.
Acceptance
Background
PR #13255 adds an
ngx.sleep(0)at the end ofparse_streaming_response()(apisix/plugins/ai-providers/base.lua) to yield to the nginx scheduler inside the SSE streaming loop. Without it, when the upstream socket already has data buffered and the downstream client drains immediately, neitherbody_reader()norngx.flush()yields, so the loop monopolizes the worker CPU and blocks health checks and concurrent requests on the same worker.As pointed out in #13255 (comment), that fix is a workaround: it prevents a single request from monopolizing the worker, but it does not solve the underlying problem.
Real problems still to solve
ngx.sleep(0)only interleaves it with other coroutines on the same worker; it does not bound per-request CPU time.ngx.sleep(0)after every chunk is cheap-ish but still adds an event-loop hop per SSE event; for very chatty providers this is wasteful.Possible directions
proxy_buffering off+ native streaming (without going through the Lua body filter) can handle a subset of cases.Acceptance
ngx.sleep(0)workaround.