# perf(server): unblock event loop and grow concurrency budget#903
Conversation
Add ServerConfig fields to make uvicorn process count, concurrency limits, socket backlog, and event-loop/HTTP parser implementation configurable. Defaults preserve current behavior (workers=1) while enabling operators to scale a single pod across multiple Python processes when apiserver capacity allows. - pyproject.toml: switch to uvicorn[standard] for uvloop/httptools/watchfiles - config.py: ServerConfig.workers, limit_concurrency, backlog, loop, http - cli.py: thread new fields into uvicorn.run; force workers=1 under --reload - main.py: pass loop/http to dev __main__ entry - examples + configuration.md: document tunables and apiserver tradeoff Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pool Sandbox/snapshot/pool route handlers were async def but called synchronous service methods that issue blocking Kubernetes/Docker API requests (50-200 ms each). Each in-flight call stalled the entire event loop, serializing every concurrent request. Convert blocking-only handlers to sync def so FastAPI offloads them to the anyio threadpool, letting concurrent requests run in parallel. create_sandbox stays async (its service is async with cooperative polling). - api/lifecycle.py: 12 handlers async -> sync; drop manual to_thread in create_snapshot now that the route itself runs in the threadpool; drop unused asyncio import - api/pool.py: 5 pool handlers async -> sync - tests/test_routes_list_sandboxes.py: regression locks in threadpool parallelism (8 x 200 ms calls finish in ~250 ms, not ~1.6 s) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
list_custom_objects always issued a direct apiserver call, even though the informer is already watching the same namespace and serves get_custom_object from cache. Under multi-worker deployments the list QPS scales with workers x replicas and pressures the apiserver unnecessarily. Prefer the informer cache when synced and the label selector falls within the supported in-memory grammar (empty, bare key existence, key=value, comma-joined AND). Anything else falls back to the existing direct API path, preserving today's behavior. - services/k8s/label_selector.py: minimal parser/matcher for the subset of selectors callers in this repo actually emit - services/k8s/informer.py: WorkloadInformer.list() snapshot helper - services/k8s/client.py: list_custom_objects consults the cache first - tests/k8s: cover label_selector grammar + cache-hit/miss/fallback behavior on the client Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous round moved blocking list/get/delete handlers onto sync def routes so FastAPI offloads them to anyio's default threadpool. Two follow-up bottlenecks remain: 1. anyio's default threadpool is 40 tokens; bursts of concurrent sandbox CRUD requests start queueing once that ceiling is hit. 2. lifecycle.create_sandbox is async and the Kubernetes service body still issues sync K8s calls (_ensure_pvc_volumes, workload_provider create/get/delete) directly on the event loop. Each 50-200 ms round-trip stalls every other in-flight request, and the rate limiter's time.sleep makes it worse when read/write QPS is set. Add a configurable thread_pool_size (default 200) applied at lifespan startup, and wrap the blocking K8s calls inside the create path with asyncio.to_thread so the event loop stays responsive. - config.py: ServerConfig.thread_pool_size - main.py: lifespan sets anyio current_default_thread_limiter total_tokens - services/k8s/kubernetes_service.py: to_thread wraps the four sync K8s calls in create_sandbox / _wait_for_sandbox_ready - configuration.md, tests/test_config.py: doc and field tests Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f3763f4f16
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Importing opensandbox_server.main in the CLI eagerly constructed sandbox_service, restoring containers and starting expiration Timer threads in the supervisor process before uvicorn.run was called. With [server].workers > 1 that left orphan timers in the supervisor and (on spawn) duplicated them across workers. Read config and logging directly in the CLI so only worker processes initialize the service graph. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
list_custom_objects returns the informer cache snapshot once synced, but create/patch/delete previously left the cache untouched, so a list immediately after a write could include the old or freshly-deleted object until the watch event arrived. Add delete_from_cache to the informer and have the K8sClient write paths upsert or evict cache entries through a non-creating informer lookup. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 145555c178
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Docker expiration timers live in process-local state on DockerSandboxService, so each uvicorn worker schedules its own threading.Timer per sandbox. A renewal handled by one worker only updates that process's _sandbox_expirations, leaving other workers to fire stale timers at the pre-renewal time and remove the sandbox. Reject the combination at AppConfig validation until the Docker runtime grows shared expiration state. Kubernetes is unaffected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ble from TOML Remove the [server].workers field. Multi-worker mode exposed too many foot-guns (per-process Docker expiration timers racing on renew, k8s informer cache divergence, import-time side effects in the supervisor) and the supported way to scale on Kubernetes is replica count, not in-process worker fan-out. uvicorn now runs single-process; the deferred-import comment in cli.py is kept for the reload supervisor. Fix [server].limit_concurrency so the documented disable path actually works from TOML. TOML has no null literal, so Optional[int] could not be set to None: the field now accepts 0 as a sentinel and a field_validator collapses it to None before uvicorn sees it. Default 1024 is unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ae74253a66
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Summary
workers=1); operators dial up via the[server]TOML section.Changes (commit by commit)
1.
perf(server): expose uvicorn worker/concurrency knobs(745c1945)pyproject.toml:uvicorn→uvicorn[standard](pulls uvloop / httptools / watchfiles).ServerConfig:workers(default 1),limit_concurrency(1024),backlog(2048),loop("auto"),http("auto").cli.py: thread fields intouvicorn.run;--reloadforcesworkers=1and prints a notice.main.pydev__main__: passloop/http.configuration.md) + unit tests (tests/test_config.py).2.
perf(server): unblock event loop by running blocking routes in threadpool(77327880)api/lifecycle.py: 12 handlersasync def→ syncdef(list/get/patch/delete/pause/resume/renew sandbox + create/list/get/delete snapshot +get_sandbox_endpoint). FastAPI auto-offloads sync routes to the anyio threadpool.api/pool.py: 5 pool handlers same conversion.create_sandboxstays async (its service is genuinely async).asyncioimport and manualto_threadinsidecreate_snapshot.list_sandboxesfinishes in ~250 ms (vs 1.6 s serial floor).3.
perf(server): serve list_custom_objects from informer cache(35d9bf47)services/k8s/label_selector.py: minimal grammar (empty / bare key /key=value/ comma-AND); unsupported syntax →parse_selectorreturnsNoneand the caller falls back to the direct API path.WorkloadInformer.list()snapshot helper.K8sClient.list_custom_objectsconsults the informer cache when synced; otherwise unchanged.4.
perf(server): grow anyio threadpool, unblock create path(225e5ece)ServerConfig.thread_pool_size(default 200).lifespan:current_default_thread_limiter().total_tokens = thread_pool_size.kubernetes_service: wrap the four sync Kubernetes calls insidecreate_sandbox/_wait_for_sandbox_ready(_ensure_pvc_volumes,workload_provider.create_workload,get_workload,delete_workload) withasyncio.to_threadso the event loop stays responsive while the create path runs.Behavior changes
[server]keys (workers,limit_concurrency,backlog,thread_pool_size,loop,http). All additive; existing configs keep working.list_custom_objectsnow serves from the informer cache when synced; same eventual-consistency window as the existingget_custom_objectpath.Risks
thread_pool_size200 per process; oversize trades fd / apiserver QPS pressure for parallelism — tune withworkers × replicaCount.informer_resync_seconds(default 300 s); same as today'sget_custom_object.workers > 1means N × informer watch streams to the apiserver; default stays at 1 to keep apiserver baseline unchanged.Out of scope (follow-up PRs)
pool_service.list_poolscache (uses its own list path).list_podscache (CoreV1, arbitrary selectors).docker_service.create_sandboxevent-loop blocking.Testing
Breaking Changes
Checklist