Skip to content

fix(k8s): replace blocking time.sleep with asyncio.sleep in _wait_for_sandbox_ready #841

Merged
Pangjiping merged 1 commit intoalibaba:mainfrom
qingyuppp:fix/k8s-wait-for-sandbox-blocking-sleep
May 7, 2026
Merged

fix(k8s): replace blocking time.sleep with asyncio.sleep in _wait_for_sandbox_ready #841
Pangjiping merged 1 commit intoalibaba:mainfrom
qingyuppp:fix/k8s-wait-for-sandbox-blocking-sleep

Conversation

@qingyuppp
Copy link
Copy Markdown
Contributor

Problem

In _wait_for_sandbox_ready, the polling loop has two sleep branches with inconsistent behavior:

if not workload:
    time.sleep(poll_interval_seconds)    # blocks the event loop
    continue

# ...

await asyncio.sleep(poll_interval_seconds)  # correct async sleep

When a workload is not yet visible in the K8s API immediately after creation, the code falls into the time.sleep branch. Since _wait_for_sandbox_ready is an async method, this blocks the entire event loop for poll_interval_seconds (default: 1s) per iteration.

Impact

  • FastAPI cannot handle any other requests during the blocked period, including /health
  • Kubernetes liveness probe times out → pod is killed and restarted
  • Affects every sandbox creation under K8s runtime, since the workload-not-found window occurs naturally right after creation

Fix

Replace time.sleep with await asyncio.sleep in the not workload branch, consistent with the sleep call later in the same loop.

File: server/opensandbox_server/services/k8s/kubernetes_service.py:199

# Before
time.sleep(poll_interval_seconds)

# After
await asyncio.sleep(poll_interval_seconds)

…_sandbox_ready

When the workload is not yet visible after creation, the polling loop
fell into a time.sleep branch instead of await asyncio.sleep. This
blocked the event loop during sandbox creation, preventing FastAPI from
handling other requests (including /health), which could cause liveness
probe failures and pod restarts.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 7, 2026 06:21
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 7, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an event-loop blocking bug in the Kubernetes sandbox readiness polling loop by replacing a synchronous sleep call with an async sleep inside the _wait_for_sandbox_ready coroutine. This prevents FastAPI/ASGI request handling (e.g., /health) from being stalled while waiting for the workload to appear in the Kubernetes API.

Changes:

  • Replace time.sleep(...) with await asyncio.sleep(...) when the workload is not yet visible during polling.
  • Keep polling behavior consistent across both sleep branches in the loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Pangjiping Pangjiping self-assigned this May 7, 2026
@Pangjiping Pangjiping added bug Something isn't working component/server labels May 7, 2026
Copy link
Copy Markdown
Collaborator

@Pangjiping Pangjiping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Pangjiping Pangjiping merged commit ad5c364 into alibaba:main May 7, 2026
21 of 23 checks passed
@qingyuppp
Copy link
Copy Markdown
Contributor Author

Thanks for merging.

@qingyuppp qingyuppp deleted the fix/k8s-wait-for-sandbox-blocking-sleep branch May 7, 2026 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working component/server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants