Fix code context state isolation #249

ghostwriternr · 2025-11-25T21:03:16Z

Fixes state leakage between code contexts by assigning each context a dedicated executor process from creation to deletion. Removes maximum pool size limits to allow organic scaling.

Changes

Context-to-executor 1:1 binding with dedicated process lifecycle
Remove pool size caps for organic scaling
Add test verifying isolation across 12+ contexts

Each code context now gets a dedicated executor process from creation to deletion, preventing state leakage between contexts. Removed maximum pool size limits to allow organic scaling based on context count.

changeset-bot · 2025-11-25T21:03:19Z

🦋 Changeset detected

Latest commit: 2feea4e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@cloudflare/sandbox	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

…ation

cloudflare-workers-and-pages · 2025-11-25T21:04:55Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	sandbox-site	`8c8999c`	Commit Preview URL Branch Preview URL	Nov 25 2025, 09:05 PM

Add documentation explaining that each code context is completely isolated with a dedicated executor process. Includes example showing concurrent execution across multiple contexts. Related to cloudflare/sandbox-sdk#249 which fixed state isolation bug.

pkg-pr-new · 2025-11-25T21:14:51Z

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/sandbox-sdk/@cloudflare/sandbox@249

commit: 2feea4e

github-actions · 2025-11-25T21:14:52Z

🐳 Docker Image Published

FROM cloudflare/sandbox:0.0.0-pr-249-562e595

Version: 0.0.0-pr-249-562e595

You can use this Docker image with the preview package from this PR.

Add atomic concurrent execution checks, exit handler cleanup, error handling for executor release failures, per-language process limits, and language validation. Include concurrent execution test.

Adds documentation for improved code context isolation: - Each context now gets a dedicated executor process (1:1 binding) - Contexts prevent concurrent execution on the same context - Complete state isolation between contexts guaranteed Related to cloudflare/sandbox-sdk#249 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Remove redundant try-catch blocks that just rethrow errors, use try-finally pattern for guaranteed cleanup, simplify comments about atomicity, and enhance concurrent execution test to verify actual state isolation rather than just error counts.

Replace isAvailable flag with executionPromise to fix race condition. The promise acts as an atomic lock preventing concurrent execution on the same context.

Revert Dockerfile platform change that should not be committed. Increase concurrent test requests to 20 for better stress testing.

Use async-mutex to prevent race conditions when concurrent requests target the same context. Execute code before creating stream to hold mutex during actual execution.

…ation

Prevents race conditions when multiple contexts reserve executors concurrently, and detects crashed executors before execution.

Documents the fix for code context isolation bug where contexts leaked state after 10 executions. Each context now has dedicated executor resources throughout its lifetime, ensuring complete isolation. Changes: - Add context isolation note explaining dedicated resources - Emphasize isolation guarantees in createCodeContext() - Add cleanup best practice for deleteCodeContext() - Remove implied pool size limitations Related to cloudflare/sandbox-sdk#249 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add cleanup on context creation failure to prevent executor leaks - Serialize stateless executions with per-language mutex

Moved all mutex locking from InterpreterService to ProcessPoolManager, using per-executor locks instead of per-context locks. This enables parallel stateless execution via borrow/return pattern and simplifies code by centralizing all concurrency control in one layer.

Adds E2E test that verifies state isolation is maintained when executing code concurrently across 12 different contexts. This validates that the executor locking mechanism prevents race conditions while maintaining proper context isolation.

The test validates concurrent execution isolation, not concurrent creation. Creating contexts concurrently causes mutex-serialized spawning that times out in CI. Sequential creation matches the working test pattern and still validates the intended behavior.

claude

Claude Code Review

The isolation fix architecture is sound and the two-level locking strategy is well-designed. However, there are two critical bugs in the locking implementation that could cause the exact state leakage this PR aims to fix.

Critical Issues (Must Fix Before Merge)

Race condition in getExecutorLock() creates duplicate mutexes - Two concurrent executions can create separate mutex instances for the same executor, allowing concurrent execution and defeating the isolation guarantee. See inline comment.
Memory leak: Executor locks never cleaned up - The exit handler doesn't remove mutexes from executorLocks map, causing unbounded growth in long-running containers. See inline comment.

Important Issues

Test coverage gap - The test at line 582 doesn't actually verify isolation (it only checks for errors). The code would pass even with the bug. See inline comment for better test approach.
Documentation needs - Force-delete behavior and unlimited process defaults should be documented prominently to prevent user confusion and resource exhaustion.

Architecture Assessment

The 1:1 context-to-executor binding is the right design for stateful code interpreter use cases. The borrow/return pattern for stateless execution is correct. The per-language pool locks properly protect structural changes. Once the critical locking bugs are fixed, this will be production-ready.

Verdict: Needs fixes before merge - the race condition in executor lock creation can cause state isolation failures.

claude · 2025-11-26T16:36:11Z

packages/sandbox-container/src/runtime/process-pool.ts

    });
  }

+  private getExecutorLock(executorId: string): Mutex {


CRITICAL: Race condition in mutex creation

Two concurrent calls with the same executorId can both create separate Mutex instances:

Thread A checks executorLocks.get(id) → undefined

Thread B checks executorLocks.get(id) → undefined

Thread A creates Mutex1, sets it

Thread B creates Mutex2, sets it (overwrites Mutex1)

Thread A acquires Mutex1, Thread B acquires Mutex2

Both execute concurrently on same process → state corruption

This defeats the entire isolation fix.

Fix:

private executorLockCreationMutex = new Mutex(); private async getExecutorLock(executorId: string): Promise<Mutex> { let mutex = this.executorLocks.get(executorId); if (mutex) return mutex; return await this.executorLockCreationMutex.runExclusive(() => { // Double-check after acquiring lock let mutex = this.executorLocks.get(executorId); if (!mutex) { mutex = new Mutex(); this.executorLocks.set(executorId, mutex); } return mutex; }); }

Update all call sites from getExecutorLock(id) to await getExecutorLock(id) (lines 219, 227).

claude · 2025-11-26T16:36:11Z

packages/sandbox-container/src/runtime/process-pool.ts

+      lastUsed: new Date()
    };

+    // Register exit handler for cleanup (prevents memory leaks)


CRITICAL: Memory leak - executor locks never cleaned up

The exit handler removes the executor from all tracking structures but doesn't remove its mutex from executorLocks. Over time, this map grows unbounded with mutexes for dead processes.

Fix: Add cleanup after removing from available executors:

const available = this.availableExecutors.get(language); if (available) { const index = available.indexOf(interpreterProcess); if (index > -1) available.splice(index, 1); } // Clean up executor lock this.executorLocks.delete(interpreterProcess.id);

Note: releaseExecutorForContext at line 578 already has this cleanup, but the exit handler needs it too for cases where processes crash.

claude · 2025-11-26T16:36:11Z

tests/e2e/code-interpreter-workflow.test.ts

    );
  }, 120000);

+  test('should maintain isolation across many contexts (12+)', async () => {


Test coverage gap: Doesn't verify isolation

This test creates 12 contexts sequentially and executes const value = 2; in each. It only checks that executions succeed without errors.

Problem: The test would pass even if contexts shared state because:

Each context declares a new const value in its own scope

JavaScript allows this even if the process is shared

No verification that state from context N doesn't leak to N+1

To actually test isolation:

for (let i = 0; i < 12; i++) { const context = await createContext(); // Set unique value in global scope await executeCode(context, `globalThis.testValue_${i} = ${i};`); // Verify no leakage from previous contexts for (let j = 0; j < i; j++) { const checkResult = await executeCode( context, `typeof globalThis.testValue_${j}` ); expect(checkResult.results[0].text).toContain('undefined'); } // Verify own value is set const ownResult = await executeCode( context, `globalThis.testValue_${i}` ); expect(ownResult.results[0].text).toContain(String(i)); await deleteContext(context); }

This would catch the original bug where contexts shared executors after 10 executions.

claude · 2025-11-26T16:36:11Z

packages/sandbox-container/src/runtime/process-pool.ts

+    return this.availableExecutors.get(language) || [];
+  }
+
  async shutdown(): Promise<void> {


Memory leak in shutdown

The shutdown() method clears pools but doesn't clear the other tracking maps:

async shutdown(): Promise<void> { if (this.cleanupInterval) { clearInterval(this.cleanupInterval); } // Kill all processes for (const [executor, pool] of this.pools.entries()) { for (const process of pool) { process.process.kill(); } } // Clear all tracking structures this.pools.clear(); this.contextExecutors.clear(); this.availableExecutors.clear(); this.executorLocks.clear(); this.poolLocks.clear(); }

Not critical for production (singleton pattern) but causes memory leaks in test suites.

Fix code context state isolation

5e08bf6

Each code context now gets a dedicated executor process from creation to deletion, preventing state leakage between contexts. Removed maximum pool size limits to allow organic scaling based on context count.

Merge remote-tracking branch 'origin/main' into fix-code-context-isol…

8c8999c

…ation

agents-git-bot bot mentioned this pull request Nov 25, 2025

Docs: Fix code context state isolation cloudflare/cloudflare-docs#26749

Closed