# perf(server): unblock event loop and grow concurrency budget by Pangjiping · Pull Request #903 · alibaba/OpenSandbox

Pangjiping · 2026-05-17T07:43:30Z

Summary

Remove sync Kubernetes/Docker calls from the event loop so concurrent control-plane requests stop serializing.
Bump per-process concurrency knobs: uvicorn workers/limits, anyio threadpool size, informer-cached list path.
Defaults preserve current behavior (workers=1); operators dial up via the [server] TOML section.

Changes (commit by commit)

1. `perf(server): expose uvicorn worker/concurrency knobs` (`745c1945`)

pyproject.toml: uvicorn → uvicorn[standard] (pulls uvloop / httptools / watchfiles).
ServerConfig: workers (default 1), limit_concurrency (1024), backlog (2048), loop ("auto"), http ("auto").
cli.py: thread fields into uvicorn.run; --reload forces workers=1 and prints a notice.
main.py dev __main__: pass loop / http.
Docs (configuration.md) + unit tests (tests/test_config.py).

2. `perf(server): unblock event loop by running blocking routes in threadpool` (`77327880`)

api/lifecycle.py: 12 handlers async def → sync def (list/get/patch/delete/pause/resume/renew sandbox + create/list/get/delete snapshot + get_sandbox_endpoint). FastAPI auto-offloads sync routes to the anyio threadpool.
api/pool.py: 5 pool handlers same conversion.
create_sandbox stays async (its service is genuinely async).
Drop now-unused asyncio import and manual to_thread inside create_snapshot.
New regression: 8 × 200 ms concurrent list_sandboxes finishes in ~250 ms (vs 1.6 s serial floor).

3. `perf(server): serve list_custom_objects from informer cache` (`35d9bf47`)

New services/k8s/label_selector.py: minimal grammar (empty / bare key / key=value / comma-AND); unsupported syntax → parse_selector returns None and the caller falls back to the direct API path.
WorkloadInformer.list() snapshot helper.
K8sClient.list_custom_objects consults the informer cache when synced; otherwise unchanged.
Tests cover grammar plus cache-hit / unsynced-fallback / unsupported-selector-fallback paths.

4. `perf(server): grow anyio threadpool, unblock create path` (`225e5ece`)

ServerConfig.thread_pool_size (default 200).
lifespan: current_default_thread_limiter().total_tokens = thread_pool_size.
kubernetes_service: wrap the four sync Kubernetes calls inside create_sandbox / _wait_for_sandbox_ready (_ensure_pvc_volumes, workload_provider.create_workload, get_workload, delete_workload) with asyncio.to_thread so the event loop stays responsive while the create path runs.

Behavior changes

New [server] keys (workers, limit_concurrency, backlog, thread_pool_size, loop, http). All additive; existing configs keep working.
list_custom_objects now serves from the informer cache when synced; same eventual-consistency window as the existing get_custom_object path.
No change to API contracts, response shapes, error codes, or HTTP semantics.

Risks

thread_pool_size 200 per process; oversize trades fd / apiserver QPS pressure for parallelism — tune with workers × replicaCount.
Informer cache lag is bounded by watch latency (ms) and informer_resync_seconds (default 300 s); same as today's get_custom_object.
workers > 1 means N × informer watch streams to the apiserver; default stays at 1 to keep apiserver baseline unchanged.

Out of scope (follow-up PRs)

HPA Helm template.
Observability / Prometheus metrics.
pool_service.list_pools cache (uses its own list path).
list_pods cache (CoreV1, arbitrary selectors).
docker_service.create_sandbox event-loop blocking.

Testing

Not run (explain why)
Unit tests
Integration tests
e2e / manual verification

Breaking Changes

None
Yes (describe impact and migration path)

Checklist

Linked Issue or clearly described motivation closes server worker block #887
Added/updated docs (if needed)
Added/updated tests (if needed)
Security impact considered
Backward compatibility considered

Add ServerConfig fields to make uvicorn process count, concurrency limits, socket backlog, and event-loop/HTTP parser implementation configurable. Defaults preserve current behavior (workers=1) while enabling operators to scale a single pod across multiple Python processes when apiserver capacity allows. - pyproject.toml: switch to uvicorn[standard] for uvloop/httptools/watchfiles - config.py: ServerConfig.workers, limit_concurrency, backlog, loop, http - cli.py: thread new fields into uvicorn.run; force workers=1 under --reload - main.py: pass loop/http to dev __main__ entry - examples + configuration.md: document tunables and apiserver tradeoff Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…pool Sandbox/snapshot/pool route handlers were async def but called synchronous service methods that issue blocking Kubernetes/Docker API requests (50-200 ms each). Each in-flight call stalled the entire event loop, serializing every concurrent request. Convert blocking-only handlers to sync def so FastAPI offloads them to the anyio threadpool, letting concurrent requests run in parallel. create_sandbox stays async (its service is async with cooperative polling). - api/lifecycle.py: 12 handlers async -> sync; drop manual to_thread in create_snapshot now that the route itself runs in the threadpool; drop unused asyncio import - api/pool.py: 5 pool handlers async -> sync - tests/test_routes_list_sandboxes.py: regression locks in threadpool parallelism (8 x 200 ms calls finish in ~250 ms, not ~1.6 s) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

list_custom_objects always issued a direct apiserver call, even though the informer is already watching the same namespace and serves get_custom_object from cache. Under multi-worker deployments the list QPS scales with workers x replicas and pressures the apiserver unnecessarily. Prefer the informer cache when synced and the label selector falls within the supported in-memory grammar (empty, bare key existence, key=value, comma-joined AND). Anything else falls back to the existing direct API path, preserving today's behavior. - services/k8s/label_selector.py: minimal parser/matcher for the subset of selectors callers in this repo actually emit - services/k8s/informer.py: WorkloadInformer.list() snapshot helper - services/k8s/client.py: list_custom_objects consults the cache first - tests/k8s: cover label_selector grammar + cache-hit/miss/fallback behavior on the client Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous round moved blocking list/get/delete handlers onto sync def routes so FastAPI offloads them to anyio's default threadpool. Two follow-up bottlenecks remain: 1. anyio's default threadpool is 40 tokens; bursts of concurrent sandbox CRUD requests start queueing once that ceiling is hit. 2. lifecycle.create_sandbox is async and the Kubernetes service body still issues sync K8s calls (_ensure_pvc_volumes, workload_provider create/get/delete) directly on the event loop. Each 50-200 ms round-trip stalls every other in-flight request, and the rate limiter's time.sleep makes it worse when read/write QPS is set. Add a configurable thread_pool_size (default 200) applied at lifespan startup, and wrap the blocking K8s calls inside the create path with asyncio.to_thread so the event loop stays responsive. - config.py: ServerConfig.thread_pool_size - main.py: lifespan sets anyio current_default_thread_limiter total_tokens - services/k8s/kubernetes_service.py: to_thread wraps the four sync K8s calls in create_sandbox / _wait_for_sandbox_ready - configuration.md, tests/test_config.py: doc and field tests Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…uning

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3763f4f16

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Importing opensandbox_server.main in the CLI eagerly constructed sandbox_service, restoring containers and starting expiration Timer threads in the supervisor process before uvicorn.run was called. With [server].workers > 1 that left orphan timers in the supervisor and (on spawn) duplicated them across workers. Read config and logging directly in the CLI so only worker processes initialize the service graph. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

list_custom_objects returns the informer cache snapshot once synced, but create/patch/delete previously left the cache untouched, so a list immediately after a write could include the old or freshly-deleted object until the watch event arrived. Add delete_from_cache to the informer and have the K8sClient write paths upsert or evict cache entries through a non-creating informer lookup. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 145555c178

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Docker expiration timers live in process-local state on DockerSandboxService, so each uvicorn worker schedules its own threading.Timer per sandbox. A renewal handled by one worker only updates that process's _sandbox_expirations, leaving other workers to fire stale timers at the pre-renewal time and remove the sandbox. Reject the combination at AppConfig validation until the Docker runtime grows shared expiration state. Kubernetes is unaffected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ble from TOML Remove the [server].workers field. Multi-worker mode exposed too many foot-guns (per-process Docker expiration timers racing on renew, k8s informer cache divergence, import-time side effects in the supervisor) and the supported way to scale on Kubernetes is replica count, not in-process worker fan-out. uvicorn now runs single-process; the deferred-import comment in cli.py is kept for the reload supervisor. Fix [server].limit_concurrency so the documented disable path actually works from TOML. TOML has no null literal, so Optional[int] could not be set to None: the field now accepts 0 as a sentinel and a field_validator collapses it to None before uvicorn sees it. Default 1024 is unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ae74253a66

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

ninan-nn

LGTM

Pangjiping and others added 4 commits May 17, 2026 15:07

Pangjiping assigned Generalwin and ninan-nn May 17, 2026

Pangjiping requested review from Generalwin, hittyt, jwx0925 and ninan-nn as code owners May 17, 2026 07:43

Pangjiping added bug Something isn't working component/server labels May 17, 2026

Pangjiping added 2 commits May 18, 2026 12:34

Merge remote-tracking branch 'origin/main' into perf/server-uvicorn-t…

0b401b7

…uning

chore: trigger kubernetes mini e2e test

f3763f4

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

Comment thread server/opensandbox_server/cli.py Outdated

Comment thread server/opensandbox_server/services/k8s/client.py

Pangjiping and others added 2 commits May 18, 2026 13:40

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

Comment thread server/opensandbox_server/cli.py Outdated

Comment thread server/opensandbox_server/config.py Outdated

Comment thread server/opensandbox_server/cli.py Outdated

Generalwin previously approved these changes May 18, 2026

View reviewed changes

Pangjiping dismissed Generalwin’s stale review via ae74253 May 18, 2026 06:31

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

Comment thread server/opensandbox_server/services/k8s/client.py

Comment thread server/opensandbox_server/services/k8s/client.py

Generalwin approved these changes May 18, 2026

View reviewed changes

ninan-nn approved these changes May 18, 2026

View reviewed changes

Pangjiping merged commit a25dcb3 into alibaba:main May 18, 2026
20 of 21 checks passed

Pangjiping deleted the perf/server-uvicorn-tuning branch May 18, 2026 10:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# perf(server): unblock event loop and grow concurrency budget#903

# perf(server): unblock event loop and grow concurrency budget#903
Pangjiping merged 10 commits into
alibaba:mainfrom
Pangjiping:perf/server-uvicorn-tuning

Pangjiping commented May 17, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

ninan-nn left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Pangjiping commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes (commit by commit)

1. perf(server): expose uvicorn worker/concurrency knobs (745c1945)

2. perf(server): unblock event loop by running blocking routes in threadpool (77327880)

3. perf(server): serve list_custom_objects from informer cache (35d9bf47)

4. perf(server): grow anyio threadpool, unblock create path (225e5ece)

Behavior changes

Risks

Out of scope (follow-up PRs)

Testing

Breaking Changes

Checklist

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

ninan-nn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pangjiping commented May 17, 2026 •

edited

Loading

1. `perf(server): expose uvicorn worker/concurrency knobs` (`745c1945`)

2. `perf(server): unblock event loop by running blocking routes in threadpool` (`77327880`)

3. `perf(server): serve list_custom_objects from informer cache` (`35d9bf47`)

4. `perf(server): grow anyio threadpool, unblock create path` (`225e5ece`)