Conversation
|
The fix is straightforward and correct. Let me summarize my analysis: The change:
Why it's correct:
The This is a clean, minimal, correct fix. No test is included, but the issue is inherently GC-timing-dependent and difficult to reliably test. The fix addresses the root cause. LGTM |
Merging this PR will not alter performance
Comparing Footnotes
|
11c9e1e to
81651ae
Compare
Follow-up to cloudflare#6547, which fixed the deferred startup path but missed two additional crash vectors for the same root cause (cloudflare#6441). cloudflare#6547 fixed the `[this, ...]` capture in `SubrequestChannelImpl:: startRequest()` for the case where `isolate->service == kj::none` (async startup not yet complete). However, the crash reported in cloudflare#6441 also reproduces on the synchronous startup path, and with the same pattern on `ActorClassImpl::whenReady()`. The core problem: when JS code chains temporary objects like loader.get(name, getCode).getEntrypoint().evaluate(args) V8 can GC the Fetcher mid-request. This destroys the SubrequestChannelImpl, which releases its Rc<WorkerStubImpl>, which triggers WorkerStubImpl::unlink() → WorkerService::unlink(), clearing the LinkedIoChannels. The child worker's IoContext still holds raw pointers (via NullDisposer) to the WorkerService as its IoChannelFactory and LimitEnforcer, so the next I/O operation (e.g. an RPC callback to the parent) dereferences freed memory → SIGSEGV or SIGBUS. This remains 100% reproducible on current main using the reproduction from cloudflare#6441 (@cloudflare/codemode DynamicWorkerExecutor). Two additional fixes, both in WorkerLoaderNamespace: - SubrequestChannelImpl::startRequestImpl(): Attach kj::addRef(*this) to the returned WorkerInterface, keeping the SubrequestChannelImpl (and thus WorkerStubImpl and WorkerService) alive for the full request duration. This is the fix for the synchronous startup path that cloudflare#6547 did not address. - ActorClassImpl::whenReady(): Replace raw `[this]` capture with `[self = kj::addRef(*this)]` — same pattern as the SubrequestChannelImpl fix from cloudflare#6547, applied to the actor class deferred startup path. ## Reproduction Requires `@cloudflare/codemode` and `wrangler`: ```json // package.json { "dependencies": { "@cloudflare/codemode": "^0.3.2", "wrangler": "^4.77.0" } } ``` ```jsonc // wrangler.jsonc { "name": "repro", "main": "src/index.ts", "compatibility_date": "2025-06-01", "compatibility_flags": ["nodejs_compat"], "worker_loaders": [{ "binding": "LOADER" }] } ``` ```ts // src/index.ts import { DynamicWorkerExecutor, resolveProvider } from '@cloudflare/codemode'; interface Env { LOADER: ConstructorParameters<typeof DynamicWorkerExecutor>[0]['loader']; } export default { async fetch(request: Request, env: Env) { const executor = new DynamicWorkerExecutor({ loader: env.LOADER, timeout: 30_000 }); const tools = { get_items: async () => Array.from({ length: 112 }, (_, i) => ({ id: `item_${i}`, name: `Item ${i}`, memo: 'x'.repeat(220), })), }; for (let i = 0; i < 6; i++) { const result = await executor.execute( `async () => { return await codemode.get_items(); }`, [resolveProvider({ name: 'codemode', tools })] ); if (result.error) return Response.json({ round: i, error: result.error }, { status: 500 }); } return Response.json({ ok: true }); }, }; ``` Then: `wrangler dev` and `curl http://localhost:8787` → segfault every time. To test a local workerd build against this reproduction: MINIFLARE_WORKERD_PATH=bazel-bin/src/workerd/server/workerd wrangler dev Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Follow-up to cloudflare#6547, which fixed the deferred startup path but missed two additional crash vectors for the same root cause (cloudflare#6441). startRequest()` for the case where `isolate->service == kj::none` (async startup not yet complete). However, the crash reported in cloudflare#6441 also reproduces on the synchronous startup path, and with the same pattern on `ActorClassImpl::whenReady()`. The core problem: when JS code chains temporary objects like loader.get(name, getCode).getEntrypoint().evaluate(args) V8 can GC the Fetcher mid-request. This destroys the SubrequestChannelImpl, which releases its Rc<WorkerStubImpl>, which triggers WorkerStubImpl::unlink() → WorkerService::unlink(), clearing the LinkedIoChannels. The child worker's IoContext still holds raw pointers (via NullDisposer) to the WorkerService as its IoChannelFactory and LimitEnforcer, so the next I/O operation (e.g. an RPC callback to the parent) dereferences freed memory → SIGSEGV or SIGBUS. This remains 100% reproducible on current main using the reproduction from cloudflare#6441 (@cloudflare/codemode DynamicWorkerExecutor). Two additional fixes, both in WorkerLoaderNamespace: - SubrequestChannelImpl::startRequestImpl(): Attach kj::addRef(*this) to the returned WorkerInterface, keeping the SubrequestChannelImpl (and thus WorkerStubImpl and WorkerService) alive for the full request duration. This is the fix for the synchronous startup path that cloudflare#6547 did not address. - ActorClassImpl::whenReady(): Replace raw `[this]` capture with `[self = kj::addRef(*this)]` — same pattern as the SubrequestChannelImpl fix from cloudflare#6547, applied to the actor class deferred startup path. Requires `@cloudflare/codemode` and `wrangler`: ```json // package.json { "dependencies": { "@cloudflare/codemode": "^0.3.2", "wrangler": "^4.77.0" } } ``` ```jsonc // wrangler.jsonc { "name": "repro", "main": "src/index.ts", "compatibility_date": "2025-06-01", "compatibility_flags": ["nodejs_compat"], "worker_loaders": [{ "binding": "LOADER" }] } ``` ```ts // src/index.ts import { DynamicWorkerExecutor, resolveProvider } from '@cloudflare/codemode'; interface Env { LOADER: ConstructorParameters<typeof DynamicWorkerExecutor>[0]['loader']; } export default { async fetch(request: Request, env: Env) { const executor = new DynamicWorkerExecutor({ loader: env.LOADER, timeout: 30_000 }); const tools = { get_items: async () => Array.from({ length: 112 }, (_, i) => ({ id: `item_${i}`, name: `Item ${i}`, memo: 'x'.repeat(220), })), }; for (let i = 0; i < 6; i++) { const result = await executor.execute( `async () => { return await codemode.get_items(); }`, [resolveProvider({ name: 'codemode', tools })] ); if (result.error) return Response.json({ round: i, error: result.error }, { status: 500 }); } return Response.json({ ok: true }); }, }; ``` Then: `wrangler dev` and `curl http://localhost:8787` → segfault every time. To test a local workerd build against this reproduction: MINIFLARE_WORKERD_PATH=bazel-bin/src/workerd/server/workerd wrangler dev Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When using
newPromisedWorkerInterface, by the time the.then()callback is executed, it is possible for the subrequest channel itself to have gone out of scope due to GC kicking in, causing a crash whenstartRequestImpltries to access the worker serviceCloses #6441