fix(ai-proxy): yield to scheduler in streaming SSE loop to avoid worker CPU starvation#13255
Merged
nic-6443 merged 2 commits intoapache:masterfrom Apr 20, 2026
Merged
Conversation
…er CPU starvation When an upstream LLM emits SSE chunks in a tight burst (e.g. a model hallucinating and producing tokens at 100+ per second), the streaming loop in parse_streaming_response can run for an extended period without yielding to the nginx scheduler. body_reader() (cosocket recv) only yields when the recv buffer is empty; if the kernel has already buffered several chunks, successive calls return immediately. ngx.flush(true) only yields when the downstream send buffer is full; a fast client drains immediately. So neither end of the loop guarantees a yield, and the SSE coroutine ends up monopolizing the worker — starving health checks, concurrent requests, and timer callbacks on the same worker. Add an explicit ngx.sleep(0) at the end of each loop iteration. This is a no-op timer that just yields the current coroutine, allowing other ready coroutines to run. The cost is negligible: in normal AI traffic chunks already arrive with inter-chunk gaps so an extra yield per chunk is invisible; in burst scenarios it caps per-coroutine runtime to one chunk's worth of work.
There was a problem hiding this comment.
Pull request overview
This PR mitigates OpenResty worker CPU starvation during bursty AI SSE streaming by forcing a cooperative yield in the AI provider streaming loop, ensuring other coroutines (health checks, concurrent requests, timers) can run even when both upstream reads and downstream flushes return immediately.
Changes:
- Add an explicit
ngx.sleep(0)yield at the end of each iteration of the SSE streaming loop. - Document in-code why the yield is necessary under bursty upstream + fast downstream conditions.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
moonming
previously approved these changes
Apr 20, 2026
membphis
requested changes
Apr 20, 2026
…e#13256 Per review feedback, the comment now states explicitly that the yield prevents one request from monopolizing the worker but does not bound per-stream CPU time, add backpressure, or time out stalled streams. A real fix is tracked in apache#13256.
membphis
approved these changes
Apr 20, 2026
moonming
approved these changes
Apr 20, 2026
shreemaan-abhishek
approved these changes
Apr 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Add an explicit
ngx.sleep(0)at the end of each iteration of the streaming SSE loop inapisix/plugins/ai-providers/base.lua::parse_streaming_response. This guarantees the coroutine yields to the nginx scheduler at least once per upstream chunk.Why
In production we observed worker processes pinned at 100% CPU during AI proxy traffic. Root cause: when an upstream LLM emits SSE chunks in a tight burst (e.g. a model hallucinating and producing tokens at 100+ per second, or upstreams that batch multiple SSE events into a single TCP segment), the streaming loop runs for an extended period without yielding.
Specifically:
body_reader()(cosocketsocket:receive()) only yields when the recv buffer is empty. If the kernel has already buffered several chunks, successive calls return immediately without yielding.ngx.flush(true)(used downstream) only yields when the send buffer is full. A fast downstream client drains immediately, so flush returns without yielding.Neither end of the loop guarantees a yield. The result: the SSE coroutine monopolizes the worker — starving health checks, concurrent requests on the same worker, and timer callbacks. Even modest traffic can saturate a single core because Lua coroutines on the same OpenResty worker share one OS thread.
ngx.sleep(0)is the canonical OpenResty primitive for this — it queues a 0-second timer and yields the current coroutine, letting the scheduler pick up any other ready coroutines, then resumes.Cost
body_reader()already yields naturally between chunks. The extrangx.sleep(0)is invisible.Test plan
This is a concurrency / scheduling fix where deterministic reproduction in test-nginx is difficult — burst behavior depends on TCP buffering between the mock upstream and the proxy, both of which run in the same nginx instance during tests, so timing rarely matches the real-world scenario. Existing streaming correctness tests (
t/plugin/ai-proxy*.t,t/plugin/ai-proxy-client-disconnect.t) cover that the loop still produces correct output and that the new yield doesn't break the disconnect-detection or limit-enforcement paths.Per the project's testing exception for "concurrency issues that are hard to simulate", I'm relying on existing tests for correctness regression coverage.