Skip to content

ensure worker stub subrequest channel is kept alive until internal startRequest call#6547

Merged
git-bruh merged 1 commit intomainfrom
pkhanna/fix-worker-loader-crash
Apr 10, 2026
Merged

ensure worker stub subrequest channel is kept alive until internal startRequest call#6547
git-bruh merged 1 commit intomainfrom
pkhanna/fix-worker-loader-crash

Conversation

@git-bruh
Copy link
Copy Markdown
Member

When using newPromisedWorkerInterface, by the time the .then() callback is executed, it is possible for the subrequest channel itself to have gone out of scope due to GC kicking in, causing a crash when startRequestImpl tries to access the worker service

// This pattern triggers the bug (fetcher can get GC'd)
const result = await dynamicWorker.getEntrypoint().foo(...);

// This doesn't (Fetcher is kept in scope, keeping the subrequest channel alive)
const e = dynamicWorker.getEntrypoint();
const result = await e.foo(...);

Closes #6441

@git-bruh git-bruh requested review from a team as code owners April 10, 2026 05:41
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Apr 10, 2026

The fix is straightforward and correct. Let me summarize my analysis:

The change:

  • Old code: [this, metadata = kj::mv(metadata)] — raw this capture of SubrequestChannelImpl in a .then() callback
  • New code: [self = kj::addRef(*this), metadata = kj::mv(metadata)] — ref-counted capture via kj::addRef()

Why it's correct:

  1. SubrequestChannelImpl extends IoChannelFactory::SubrequestChannel which extends kj::Refcounted, so kj::addRef(*this) is valid and returns kj::Own<SubrequestChannelImpl>.
  2. The .then() callback outlives the caller's scope because it's wrapped in newPromisedWorkerInterface() — the caller gets back a WorkerInterface proxy immediately, and if the JS-side Fetcher gets GC'd, the SubrequestChannelImpl (and its raw this pointer) would be destroyed before the callback fires.
  3. The self = kj::addRef(*this) capture keeps the SubrequestChannelImpl alive until the callback completes.

The ActorClassImpl::whenReady() case at line 4311 uses [this] but is safe because the promise is awaited directly by the caller who owns the object — the object cannot be GC'd during the await.

This is a clean, minimal, correct fix. No test is included, but the issue is inherently GC-timing-dependent and difficult to reliably test. The fix addresses the root cause.

LGTM

github run

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 10, 2026

Merging this PR will not alter performance

✅ 70 untouched benchmarks
⏩ 129 skipped benchmarks1


Comparing pkhanna/fix-worker-loader-crash (81651ae) with main (6fc69a6)

Open in CodSpeed

Footnotes

  1. 129 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@git-bruh git-bruh force-pushed the pkhanna/fix-worker-loader-crash branch from 11c9e1e to 81651ae Compare April 10, 2026 07:15
@git-bruh git-bruh enabled auto-merge (squash) April 10, 2026 07:16
@git-bruh git-bruh merged commit 7fa78f7 into main Apr 10, 2026
33 of 34 checks passed
@git-bruh git-bruh deleted the pkhanna/fix-worker-loader-crash branch April 10, 2026 08:09
airhorns added a commit to airhorns/workerd that referenced this pull request Apr 10, 2026
Follow-up to cloudflare#6547, which fixed the deferred startup path but missed
two additional crash vectors for the same root cause (cloudflare#6441).

cloudflare#6547 fixed the `[this, ...]` capture in `SubrequestChannelImpl::
startRequest()` for the case where `isolate->service == kj::none`
(async startup not yet complete). However, the crash reported in cloudflare#6441
also reproduces on the synchronous startup path, and with the same
pattern on `ActorClassImpl::whenReady()`.

The core problem: when JS code chains temporary objects like

    loader.get(name, getCode).getEntrypoint().evaluate(args)

V8 can GC the Fetcher mid-request. This destroys the
SubrequestChannelImpl, which releases its Rc<WorkerStubImpl>, which
triggers WorkerStubImpl::unlink() → WorkerService::unlink(), clearing
the LinkedIoChannels. The child worker's IoContext still holds raw
pointers (via NullDisposer) to the WorkerService as its
IoChannelFactory and LimitEnforcer, so the next I/O operation (e.g.
an RPC callback to the parent) dereferences freed memory → SIGSEGV
or SIGBUS.

This remains 100% reproducible on current main using the reproduction
from cloudflare#6441 (@cloudflare/codemode DynamicWorkerExecutor).

Two additional fixes, both in WorkerLoaderNamespace:

- SubrequestChannelImpl::startRequestImpl(): Attach
  kj::addRef(*this) to the returned WorkerInterface, keeping the
  SubrequestChannelImpl (and thus WorkerStubImpl and WorkerService)
  alive for the full request duration. This is the fix for the
  synchronous startup path that cloudflare#6547 did not address.

- ActorClassImpl::whenReady(): Replace raw `[this]` capture with
  `[self = kj::addRef(*this)]` — same pattern as the
  SubrequestChannelImpl fix from cloudflare#6547, applied to the actor class
  deferred startup path.

## Reproduction

Requires `@cloudflare/codemode` and `wrangler`:

```json
// package.json
{ "dependencies": { "@cloudflare/codemode": "^0.3.2", "wrangler": "^4.77.0" } }
```

```jsonc
// wrangler.jsonc
{
  "name": "repro",
  "main": "src/index.ts",
  "compatibility_date": "2025-06-01",
  "compatibility_flags": ["nodejs_compat"],
  "worker_loaders": [{ "binding": "LOADER" }]
}
```

```ts
// src/index.ts
import { DynamicWorkerExecutor, resolveProvider } from '@cloudflare/codemode';
interface Env {
  LOADER: ConstructorParameters<typeof DynamicWorkerExecutor>[0]['loader'];
}
export default {
  async fetch(request: Request, env: Env) {
    const executor = new DynamicWorkerExecutor({ loader: env.LOADER, timeout: 30_000 });
    const tools = {
      get_items: async () =>
        Array.from({ length: 112 }, (_, i) => ({
          id: `item_${i}`, name: `Item ${i}`, memo: 'x'.repeat(220),
        })),
    };
    for (let i = 0; i < 6; i++) {
      const result = await executor.execute(
        `async () => { return await codemode.get_items(); }`,
        [resolveProvider({ name: 'codemode', tools })]
      );
      if (result.error) return Response.json({ round: i, error: result.error }, { status: 500 });
    }
    return Response.json({ ok: true });
  },
};
```

Then: `wrangler dev` and `curl http://localhost:8787` → segfault every time.

To test a local workerd build against this reproduction:

    MINIFLARE_WORKERD_PATH=bazel-bin/src/workerd/server/workerd wrangler dev

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
airhorns added a commit to airhorns/workerd that referenced this pull request Apr 10, 2026
Follow-up to cloudflare#6547, which fixed the deferred startup path but missed
two additional crash vectors for the same root cause (cloudflare#6441).

startRequest()` for the case where `isolate->service == kj::none`
(async startup not yet complete). However, the crash reported in cloudflare#6441
also reproduces on the synchronous startup path, and with the same
pattern on `ActorClassImpl::whenReady()`.

The core problem: when JS code chains temporary objects like

    loader.get(name, getCode).getEntrypoint().evaluate(args)

V8 can GC the Fetcher mid-request. This destroys the
SubrequestChannelImpl, which releases its Rc<WorkerStubImpl>, which
triggers WorkerStubImpl::unlink() → WorkerService::unlink(), clearing
the LinkedIoChannels. The child worker's IoContext still holds raw
pointers (via NullDisposer) to the WorkerService as its
IoChannelFactory and LimitEnforcer, so the next I/O operation (e.g.
an RPC callback to the parent) dereferences freed memory → SIGSEGV
or SIGBUS.

This remains 100% reproducible on current main using the reproduction
from cloudflare#6441 (@cloudflare/codemode DynamicWorkerExecutor).

Two additional fixes, both in WorkerLoaderNamespace:

- SubrequestChannelImpl::startRequestImpl(): Attach
  kj::addRef(*this) to the returned WorkerInterface, keeping the
  SubrequestChannelImpl (and thus WorkerStubImpl and WorkerService)
  alive for the full request duration. This is the fix for the
  synchronous startup path that cloudflare#6547 did not address.

- ActorClassImpl::whenReady(): Replace raw `[this]` capture with
  `[self = kj::addRef(*this)]` — same pattern as the
  SubrequestChannelImpl fix from cloudflare#6547, applied to the actor class
  deferred startup path.

Requires `@cloudflare/codemode` and `wrangler`:

```json
// package.json
{ "dependencies": { "@cloudflare/codemode": "^0.3.2", "wrangler": "^4.77.0" } }
```

```jsonc
// wrangler.jsonc
{
  "name": "repro",
  "main": "src/index.ts",
  "compatibility_date": "2025-06-01",
  "compatibility_flags": ["nodejs_compat"],
  "worker_loaders": [{ "binding": "LOADER" }]
}
```

```ts
// src/index.ts
import { DynamicWorkerExecutor, resolveProvider } from '@cloudflare/codemode';
interface Env {
  LOADER: ConstructorParameters<typeof DynamicWorkerExecutor>[0]['loader'];
}
export default {
  async fetch(request: Request, env: Env) {
    const executor = new DynamicWorkerExecutor({ loader: env.LOADER, timeout: 30_000 });
    const tools = {
      get_items: async () =>
        Array.from({ length: 112 }, (_, i) => ({
          id: `item_${i}`, name: `Item ${i}`, memo: 'x'.repeat(220),
        })),
    };
    for (let i = 0; i < 6; i++) {
      const result = await executor.execute(
        `async () => { return await codemode.get_items(); }`,
        [resolveProvider({ name: 'codemode', tools })]
      );
      if (result.error) return Response.json({ round: i, error: result.error }, { status: 500 });
    }
    return Response.json({ ok: true });
  },
};
```

Then: `wrangler dev` and `curl http://localhost:8787` → segfault every time.

To test a local workerd build against this reproduction:

    MINIFLARE_WORKERD_PATH=bazel-bin/src/workerd/server/workerd wrangler dev

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wrangler dev segfaults in local worker_loaders during repeated DynamicWorkerExecutor executions

3 participants