Fix Durable Object ordering when mixing RPC and fetch calls#6562
Fix Durable Object ordering when mixing RPC and fetch calls#6562threepointone wants to merge 1 commit intomainfrom
Conversation
|
LGTM |
When a caller fires interleaved stub.rpc() and stub.fetch() calls on the same DO stub without awaiting, all fetch calls were processed before all RPC calls, regardless of send order. This violated the expected E-order guarantee for actors. Root cause: fetch reaches the InputGate in ~1 async hop (via WorkerEntrypoint::request() -> context.run() -> InputGate::wait()), while RPC takes ~4+ hops (customEvent -> JsRpcSessionCustomEvent::run() -> Cap'n Proto session setup -> capability fulfillment -> pipelined call dispatch -> JsRpcTargetBase::call() -> kj::yield() -> ctx.run() -> InputGate::wait()). Since all operations originate from the same synchronous JS execution, fetch calls always reached the FIFO queue before any RPC calls. Fix: eagerly acquire the InputGate position in JsRpcSessionCustomEvent::run() — at the same level where WorkerEntrypoint::request() acquires it — then thread the lock through to the first ctx.run() call via the existing IoContext::run(func, Maybe<InputGate::Lock>) overload. The lock is consumed by the first RPC method invocation and is NOT held for the session lifetime. The kj::yield() in JsRpcTargetBase::call() and ExternalPusher ordering are unaffected. Fixes #6561 Made-with: Cursor
42bf758 to
3cad02c
Compare
|
This is a clever solution and I mostly like it. One possible issue is, technically, this would allow a client (of the capnp RPC interface) to lock up a DO by sending the initial call to open an RPC session, and then failing to send the actual call. It basically gives clients the ability to take a lock on a DO, which obviously they shouldn't have. This is an internal RPC interface, though, not exposed to malicious clients. So we don't really have to worry about this being used in some sort of a DoS attack. But I suppose it's technically possible that due to a bug or a poorly-timed network hiccup, the "call" message may be significantly delayed after the "open RPC session" message. Ugh I can't quite convince myself that this is fine, nor can I think of an easy way to make it safe. A much deeper, more thorough solution would be to do what I've always wanted to do and make the KJ event loop ordering fully depth-first. I think that would just solve this problem inherently. But, that's a pretty delicate change with impact across the whole runtime that should be strictly good in theory, but may have unintended consequences (stuff unintentionally depending on the current ordering)... hmm. Meanwhile, though, I wonder how you're using this. I'm not sure you can actually safely rely on this in practice, due to hibernation. If you are relying on an initial call to initialize some in-memory state, and need that in-memory state to be initialized for the second call... there's no guarantee that the DO doesn't hibernate in between. True, if you make these calls in rapid succession, it's extremely unlikely that hibernation would happen, but it's always theoretically possible if a packet gets lost somewhere resulting in a delay, or if the runtime suddenly decides it needs to shut the DO down for an update or whatnot. The safe thing to do is to have the first call return an RpcTarget, and the second call is pipelined on that. That always guarantees the two calls land on the same instance (or, rarely, the second call fails). Could that approach work for your use case? I personally have been using this pattern a fair amount and feel like it turns out nicely. |
Summary
Fixes #6561.
When a caller fires interleaved
stub.rpc()andstub.fetch()calls on the same Durable Object stub without awaiting, all fetch calls were processed before all RPC calls, regardless of send order. This violated the expected E-order guarantee.Same-type ordering (pure RPC or pure fetch) was already correct. This fix addresses the cross-type ordering.
Root Cause
stub.fetch()andstub.rpc()both create aWorkerInterfaceviaActorChannel::startRequest(), but they call different methods on it —request()for fetch,customEvent()for RPC. These two paths reach the DO'sInputGatethrough a very different number of async hops:fetch path (~1 hop to InputGate):
RPC path (~4+ hops to InputGate):
Since all operations originate from the same synchronous JS execution, fetch calls (fewer hops) always enqueued at the InputGate before any RPC calls (more hops).
Fix
Eagerly acquire the InputGate position in
JsRpcSessionCustomEvent::run(), right afterdelivered()— the same level whereWorkerEntrypoint::request()acquires it viacontext.run(). The lock is stored onEntrypointJsRpcTargetand consumed by the firstctx.run()call via the existingIoContext::run(func, Maybe<InputGate::Lock>)overload (the same pattern used byensureConstructedImpl).Key properties:
kj::yield()inJsRpcTargetBase::call()still runs (InputGate position is already reserved, so ExternalPusher ordering is preserved)KJ_IF_SOME(a, ioctx.getActor()))TransientJsRpcTarget(within-session stubs) is unaffected — onlyEntrypointJsRpcTargetuses the pre-acquired lockServerTopLevelMembraneensures exactly one call pergetClientForOneCall()session, so the lock always matches the single callPerformance
The InputGate lock is acquired a few event loop turns earlier than before — during the Cap'n Proto session setup and
kj::yield()period (microseconds). ExternalPusher calls (pushByteStream,pushAbortSignal) don't need the InputGate, so they are unaffected. Other events wait one extra turn at most. For single-caller latency, the gate is free andwait()returns immediately.Changes
worker-rpc.c++: 27 lines — early InputGate acquisition inJsRpcSessionCustomEvent::run(), lock threading throughJsRpcTargetBasetoctx.run()js-rpc-test.js: NewOrderingActorDO class + 3 test cases (mixedRpcFetchOrdering,pureRpcOrdering,pureFetchOrdering)js-rpc-test.wd-test: RegisterOrderingActorbinding and namespaceTest plan
mixedRpcFetchOrdering— interleaves 20 RPC and fetch calls, asserts send-order delivery (the primary bug)pureRpcOrdering— 20 fire-and-forget RPC calls preserve order (regression)pureFetchOrdering— 20 fire-and-forget fetch calls preserve order (regression)eOrderTestpasses (ExternalPusher + e-order regression)receiveStubOverRpcpasses (service stub RPC ordering regression)namedActorBindingpasses (DO RPC regression)js-rpc-test@suite passes locallyReproduction
Repro repo: https://github.com/threepointone/test-do-rpc-fetch-ordering
Live demo:
curl https://test-ordering.threepointone.workers.dev/test-all?n=20Root cause analysis, fix design, and implementation developed in collaboration with Cursor.
Made with Cursor