Skip to content

fix: release lock before streaming to prevent orphan on client disconnect#234

Closed
beran-t wants to merge 1 commit intoe2b-dev:mainfrom
beran-t:fix/lock-orphan-on-disconnect
Closed

fix: release lock before streaming to prevent orphan on client disconnect#234
beran-t wants to merge 1 commit intoe2b-dev:mainfrom
beran-t:fix/lock-orphan-on-disconnect

Conversation

@beran-t
Copy link

@beran-t beran-t commented Mar 23, 2026

Summary

Fixes #213asyncio.Lock in messaging.py not released on client disconnect, causing cascading timeouts.

  • Narrows the lock scope in ContextWebSocket.execute() so it only covers the prepare+send phase (Phase A), not result streaming (Phase B)
  • Adds try/finally around streaming to clean up the _executions entry even if the generator is abandoned
  • Uses dict.pop() instead of del for defensive cleanup

Root Cause

The lock was held for the entire async generator lifetime, including _wait_for_result() which blocks on await queue.get(). When a client disconnects (SDK timeout), Starlette abandons the generator while it's stuck at queue.get()aclose() is never called because Starlette only detects disconnects when trying to write. The lock stays held until the kernel finishes internally, blocking all subsequent executions.

Why This Fix Is Safe

Streaming results (Phase B) doesn't need the lock because:

  • Results are routed by unique message_id in _process_message() — no shared state during streaming
  • The kernel serializes execution internally — the lock doesn't need to enforce this
  • _global_env_vars (lazy init) and _cleanup_task (lifecycle management) are only accessed during Phase A

Reproduction & Verification

Tested on live E2B sandboxes with a custom code-interpreter-dev template built from this branch.

Before fix (stock code-interpreter template):

sleep(30) + 5s SDK timeout → print('hello') blocked 25.1s (orphaned lock)
sleep(60) + 3s timeout → 3 retries all timed out (cascade)

After fix (code-interpreter-dev template):

sleep(15) + 3s SDK timeout → print('hello') executes in 0.21s after kernel finishes
All 8 test cases pass: baseline, state persistence, env vars, error handling,
multiple contexts, JavaScript, lock-not-orphaned, no-cascade

Test plan

  • Baseline execution: print('hello') works in <1s
  • State persistence: variables survive across run_code calls
  • Environment variables: envs parameter works correctly
  • Error handling: ZeroDivisionError returned properly
  • Multiple contexts: independent state
  • JavaScript execution: console.log works
  • Lock not orphaned: after timeout+disconnect, next execution succeeds once kernel is free (0.21s, not 25s)
  • No cascade: retries after disconnect don't pile up behind each other

…t disconnect

Narrow the lock scope in ContextWebSocket.execute() so that the lock
is only held during the prepare+send phase, not during result streaming.

Previously, the lock was held for the entire generator lifetime including
the _wait_for_result() streaming loop.  When a client disconnected
(e.g. SDK timeout), Starlette abandoned the generator while it was
blocked at `await queue.get()`.  The lock stayed held until the kernel
finished internally, blocking all subsequent executions on the same
context and causing cascading timeouts.

The fix moves the streaming phase (Phase B) outside the `async with
self._lock` block.  This is safe because results are routed by unique
message_id in _process_message() — no shared state is accessed during
streaming.  A try/finally ensures the execution entry is cleaned up
even if the generator is abandoned.

Fixes e2b-dev#213
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 15ad5ab02e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 382 to 383
# Clean up env vars in a separate request after the main code has run
if env_vars:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Serialize env-var cleanup before allowing next execute

_cleanup_task is now created after the lock is released, so a concurrent call can enter execute() first, observe no pending cleanup, and queue its own code while the previous call’s per-request env_vars are still active. In practice, if request A sets env_vars and request B arrives before this assignment runs, B can execute with A’s environment, which is a regression from the old behavior that always waited for cleanup before starting the next execution.

Useful? React with 👍 / 👎.

devin-ai-integration bot added a commit that referenced this pull request Mar 24, 2026
Fixes #213 — asyncio.Lock in messaging.py not released on client disconnect,
causing cascading timeouts.

Changes:
- Narrow lock scope in ContextWebSocket.execute() to only cover the
  prepare+send phase (Phase A), releasing it before result streaming
  (Phase B). This prevents orphaned locks on client disconnect.
- Schedule env var cleanup task under the lock (before release) to
  avoid the race condition flagged in PRs #234/#235.
- Add POST /contexts/{id}/interrupt endpoint that calls Jupyter's
  kernel interrupt API, allowing clients to stop long-running code
  without restarting the kernel (preserves state).
- Add interrupt_code_context/interruptCodeContext to Python and JS SDKs.

Co-Authored-By: vasek <vasek.mlejnsky@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

asyncio.Lock in messaging.py not released on client disconnect → cascading timeouts

1 participant