Skip to content

Conversation

@ghostwriternr
Copy link
Member

Fixes state leakage between code contexts by assigning each context a dedicated executor process from creation to deletion. Removes maximum pool size limits to allow organic scaling.

Changes

  • Context-to-executor 1:1 binding with dedicated process lifecycle
  • Remove pool size caps for organic scaling
  • Add test verifying isolation across 12+ contexts

Each code context now gets a dedicated executor process from creation to deletion, preventing state leakage between contexts. Removed maximum pool size limits to allow organic scaling based on context count.
@changeset-bot
Copy link

changeset-bot bot commented Nov 25, 2025

🦋 Changeset detected

Latest commit: 2feea4e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@cloudflare/sandbox Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Nov 25, 2025

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
sandbox-site 8c8999c Commit Preview URL

Branch Preview URL
Nov 25 2025, 09:05 PM

agents-git-bot bot pushed a commit to cloudflare/cloudflare-docs that referenced this pull request Nov 25, 2025
Add documentation explaining that each code context is completely
isolated with a dedicated executor process. Includes example showing
concurrent execution across multiple contexts.

Related to cloudflare/sandbox-sdk#249 which fixed state isolation bug.
claude[bot]

This comment was marked as outdated.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 25, 2025

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/sandbox-sdk/@cloudflare/sandbox@249

commit: 2feea4e

@github-actions
Copy link
Contributor

github-actions bot commented Nov 25, 2025

🐳 Docker Image Published

FROM cloudflare/sandbox:0.0.0-pr-249-562e595

Version: 0.0.0-pr-249-562e595

You can use this Docker image with the preview package from this PR.

Add atomic concurrent execution checks, exit handler cleanup, error handling for executor release failures, per-language process limits, and language validation. Include concurrent execution test.
agents-git-bot bot pushed a commit to cloudflare/cloudflare-docs that referenced this pull request Nov 25, 2025
Adds documentation for improved code context isolation:
- Each context now gets a dedicated executor process (1:1 binding)
- Contexts prevent concurrent execution on the same context
- Complete state isolation between contexts guaranteed

Related to cloudflare/sandbox-sdk#249

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
claude[bot]

This comment was marked as outdated.

Remove redundant try-catch blocks that just rethrow errors, use
try-finally pattern for guaranteed cleanup, simplify comments about
atomicity, and enhance concurrent execution test to verify actual
state isolation rather than just error counts.
claude[bot]

This comment was marked as outdated.

Replace isAvailable flag with executionPromise to fix race condition.
The promise acts as an atomic lock preventing concurrent execution
on the same context.
claude[bot]

This comment was marked as outdated.

Revert Dockerfile platform change that should not be committed.
Increase concurrent test requests to 20 for better stress testing.
claude[bot]

This comment was marked as outdated.

Use async-mutex to prevent race conditions when concurrent requests
target the same context. Execute code before creating stream to hold
mutex during actual execution.
claude[bot]

This comment was marked as outdated.

claude[bot]

This comment was marked as outdated.

Prevents race conditions when multiple contexts reserve executors
concurrently, and detects crashed executors before execution.
agents-git-bot bot pushed a commit to cloudflare/cloudflare-docs that referenced this pull request Nov 26, 2025
Documents the fix for code context isolation bug where contexts leaked
state after 10 executions. Each context now has dedicated executor
resources throughout its lifetime, ensuring complete isolation.

Changes:
- Add context isolation note explaining dedicated resources
- Emphasize isolation guarantees in createCodeContext()
- Add cleanup best practice for deleteCodeContext()
- Remove implied pool size limitations

Related to cloudflare/sandbox-sdk#249

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
claude[bot]

This comment was marked as outdated.

- Add cleanup on context creation failure to prevent executor leaks
- Serialize stateless executions with per-language mutex
Moved all mutex locking from InterpreterService to ProcessPoolManager,
using per-executor locks instead of per-context locks. This enables
parallel stateless execution via borrow/return pattern and simplifies
code by centralizing all concurrency control in one layer.
Adds E2E test that verifies state isolation is maintained when
executing code concurrently across 12 different contexts. This
validates that the executor locking mechanism prevents race
conditions while maintaining proper context isolation.
claude[bot]

This comment was marked as outdated.

@ghostwriternr ghostwriternr enabled auto-merge (squash) November 26, 2025 14:39
The test validates concurrent execution isolation, not concurrent creation.
Creating contexts concurrently causes mutex-serialized spawning that times
out in CI. Sequential creation matches the working test pattern and still
validates the intended behavior.
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

The isolation fix architecture is sound and the two-level locking strategy is well-designed. However, there are two critical bugs in the locking implementation that could cause the exact state leakage this PR aims to fix.

Critical Issues (Must Fix Before Merge)

  1. Race condition in getExecutorLock() creates duplicate mutexes - Two concurrent executions can create separate mutex instances for the same executor, allowing concurrent execution and defeating the isolation guarantee. See inline comment.

  2. Memory leak: Executor locks never cleaned up - The exit handler doesn't remove mutexes from executorLocks map, causing unbounded growth in long-running containers. See inline comment.

Important Issues

  1. Test coverage gap - The test at line 582 doesn't actually verify isolation (it only checks for errors). The code would pass even with the bug. See inline comment for better test approach.

  2. Documentation needs - Force-delete behavior and unlimited process defaults should be documented prominently to prevent user confusion and resource exhaustion.

Architecture Assessment

The 1:1 context-to-executor binding is the right design for stateful code interpreter use cases. The borrow/return pattern for stateless execution is correct. The per-language pool locks properly protect structural changes. Once the critical locking bugs are fixed, this will be production-ready.

Verdict: Needs fixes before merge - the race condition in executor lock creation can cause state isolation failures.

});
}

private getExecutorLock(executorId: string): Mutex {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRITICAL: Race condition in mutex creation

Two concurrent calls with the same executorId can both create separate Mutex instances:

  1. Thread A checks executorLocks.get(id) → undefined
  2. Thread B checks executorLocks.get(id) → undefined
  3. Thread A creates Mutex1, sets it
  4. Thread B creates Mutex2, sets it (overwrites Mutex1)
  5. Thread A acquires Mutex1, Thread B acquires Mutex2
  6. Both execute concurrently on same process → state corruption

This defeats the entire isolation fix.

Fix:

private executorLockCreationMutex = new Mutex();

private async getExecutorLock(executorId: string): Promise<Mutex> {
  let mutex = this.executorLocks.get(executorId);
  if (mutex) return mutex;
  
  return await this.executorLockCreationMutex.runExclusive(() => {
    // Double-check after acquiring lock
    let mutex = this.executorLocks.get(executorId);
    if (!mutex) {
      mutex = new Mutex();
      this.executorLocks.set(executorId, mutex);
    }
    return mutex;
  });
}

Update all call sites from getExecutorLock(id) to await getExecutorLock(id) (lines 219, 227).

lastUsed: new Date()
};

// Register exit handler for cleanup (prevents memory leaks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRITICAL: Memory leak - executor locks never cleaned up

The exit handler removes the executor from all tracking structures but doesn't remove its mutex from executorLocks. Over time, this map grows unbounded with mutexes for dead processes.

Fix: Add cleanup after removing from available executors:

const available = this.availableExecutors.get(language);
if (available) {
  const index = available.indexOf(interpreterProcess);
  if (index > -1) available.splice(index, 1);
}

// Clean up executor lock
this.executorLocks.delete(interpreterProcess.id);

Note: releaseExecutorForContext at line 578 already has this cleanup, but the exit handler needs it too for cases where processes crash.

);
}, 120000);

test('should maintain isolation across many contexts (12+)', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage gap: Doesn't verify isolation

This test creates 12 contexts sequentially and executes const value = 2; in each. It only checks that executions succeed without errors.

Problem: The test would pass even if contexts shared state because:

  • Each context declares a new const value in its own scope
  • JavaScript allows this even if the process is shared
  • No verification that state from context N doesn't leak to N+1

To actually test isolation:

for (let i = 0; i < 12; i++) {
  const context = await createContext();
  
  // Set unique value in global scope
  await executeCode(context, `globalThis.testValue_${i} = ${i};`);
  
  // Verify no leakage from previous contexts
  for (let j = 0; j < i; j++) {
    const checkResult = await executeCode(
      context, 
      `typeof globalThis.testValue_${j}`
    );
    expect(checkResult.results[0].text).toContain('undefined');
  }
  
  // Verify own value is set
  const ownResult = await executeCode(
    context,
    `globalThis.testValue_${i}`
  );
  expect(ownResult.results[0].text).toContain(String(i));
  
  await deleteContext(context);
}

This would catch the original bug where contexts shared executors after 10 executions.

return this.availableExecutors.get(language) || [];
}

async shutdown(): Promise<void> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory leak in shutdown

The shutdown() method clears pools but doesn't clear the other tracking maps:

async shutdown(): Promise<void> {
  if (this.cleanupInterval) {
    clearInterval(this.cleanupInterval);
  }
  
  // Kill all processes
  for (const [executor, pool] of this.pools.entries()) {
    for (const process of pool) {
      process.process.kill();
    }
  }
  
  // Clear all tracking structures
  this.pools.clear();
  this.contextExecutors.clear();
  this.availableExecutors.clear();
  this.executorLocks.clear();
  this.poolLocks.clear();
}

Not critical for production (singleton pattern) but causes memory leaks in test suites.

@ghostwriternr ghostwriternr merged commit e69dce5 into main Nov 26, 2025
16 checks passed
@ghostwriternr ghostwriternr deleted the fix-code-context-isolation branch November 26, 2025 16:46
@github-actions github-actions bot mentioned this pull request Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant