Defer TCP socket connect() until DO output gate clears. by vaibhavSharma-cf · Pull Request #6533 · cloudflare/workerd

vaibhavSharma-cf · 2026-04-08T19:21:26Z

Adds TCP_SOCKET_CONNECT_OUTPUT_GATE autogate that defers httpClient->connect() until IoContext::waitForOutputLocks() resolves. Socket is returned immediately backed by kj::newPromisedStream so reads/writes queue until the real connection arrives. Connect task lives in ConnectionData alongside tlsStarter to prevent
TlsStarterCallback UAF on Socket GC.

vaibhavSharma-cf · 2026-04-08T19:25:36Z

/bonk review

ask-bonk

Review Summary

[HIGH] Use-after-free: addressStr captured as non-owning kj::StringPtr in coroutine, then the owning kj::String is moved away — the coroutine will dereference freed memory when it resumes after waitForOutputLocks().
[MEDIUM] ConnectionData member ordering: connectTask is destroyed after tlsStarter — while currently safe due to coroutine cancellation semantics, this contradicts the stated intent and is fragile against future changes. Moving connectTask after tlsStarter would make the safety explicit.
[LOW] Stale comment on connectImpl — the original TODO was removed but the replacement comment is misleading about where the gate wait actually occurs.

This review was generated by an AI assistant and may contain errors. Please verify all findings independently.

src/workerd/api/sockets.c++

ask-bonk · 2026-04-08T19:32:24Z

src/workerd/api/sockets.h

+    kj::Maybe<kj::Promise<void>> connectTask;
    // tlsStarter must be declared after connectionStream so that it is destroyed first,
    // since it holds a reference that keeps the connection alive.
    kj::Own<kj::TlsStarterCallback> tlsStarter;


[MEDIUM] Destruction order concern: C++ destroys members in reverse declaration order, so the current order is:

tlsStarter destroyed first (declared last)

connectTask destroyed second

watchForDisconnectTask destroyed third

connectionStream destroyed last

This means tlsStarter is freed before connectTask is cancelled. The PR description says the purpose of storing connectTask here is to "cancel the connect task before tlsStarter is destroyed" — but the current declaration order achieves the opposite.

In practice this is currently safe because cancelling a KJ coroutine just destroys the coroutine frame without running more code, so the dangling httpConnectSettings.tlsStarter pointer inside the frame is never dereferenced during destruction. However, this is a fragile invariant — if anyone later adds cleanup logic or attaches destructors that touch tlsStarter, this becomes a real UAF.

Suggestion: move connectTask after tlsStarter to match the stated intent:

Suggested change

kj::Maybe<kj::Promise<void>> connectTask;

// tlsStarter must be declared after connectionStream so that it is destroyed first,

// since it holds a reference that keeps the connection alive.

kj::Own<kj::TlsStarterCallback> tlsStarter;

kj::Maybe<kj::Promise<void>> connectTask;

// tlsStarter must be declared after connectionStream so that it is destroyed first,

// since it holds a reference that keeps the connection alive.

kj::Own<kj::TlsStarterCallback> tlsStarter;

should become:

kj::Own<kj::TlsStarterCallback> tlsStarter; // connectTask must be declared after tlsStarter so that it is destroyed first, // since the coroutine frame holds a raw pointer to tlsStarter via httpConnectSettings. kj::Maybe<kj::Promise<void>> connectTask;

Note: this would also require reordering the constructor initializer list accordingly.

src/workerd/api/sockets.c++

ask-bonk · 2026-04-08T19:32:45Z

Review posted on PR #6533 with 3 inline comments. Here's a summary of the findings:

[HIGH] Use-after-free on addressStr (sockets.c++:299): The doConnect coroutine captures addressStr as a non-owning kj::StringPtr, but the owning kj::String is moved into setupSocket() on line 323 after the coroutine is created. When the coroutine resumes after co_await ioContext.waitForOutputLocks(), it dereferences a dangling pointer. Fix: have the coroutine take ownership of the string (change parameter to kj::String).

[MEDIUM] ConnectionData member destruction order (sockets.h:189-192): The declared order puts connectTask before tlsStarter, meaning tlsStarter is destroyed first (reverse order). This contradicts the PR description's stated intent of cancelling the connect task before tlsStarter is destroyed. While currently safe due to coroutine cancellation semantics, the ordering should be swapped to make the safety guarantee explicit and resilient to future changes.

[LOW] Misleading function name (sockets.c++:354): The comment on connectImpl explains the output gate is now handled inside connectImplNoOutputLock, but the function name implies it does not handle output locks.

github run

jqmmes · 2026-04-09T11:48:13Z

src/workerd/api/tests/socket-output-gate-test.wd-test

+      name = "main",
+      worker = (
+        modules = [
+          (name = "worker", esModule = embed "socket-output-gate-test.js"),


Can we also get a no-autogate variant of this test, just to show the behavior change when the autogate is not enabled?

I have removed this test case as it can be the case that putCompleted is set to true immediately after storage put, making this test flaky. Instead I have added test case specific to connectImplNoOutputLock with blocking/unblocking output gate.

src/workerd/api/sockets.c++

dom96

What is the rationale for doing this? Why do we need to prevent network outputs while storage writes are pending?

It also doesn't seem like that is what is happening in this PR. It is actually blocking socket connections completely, shouldn't only writes be blocked instead? Focusing on blocking writes may also simplify the implementation, since connect itself is synchronous, so the code to block it is quite involved and requires these fulfillers (which I don't think are ideal).

Adds TCP_SOCKET_CONNECT_OUTPUT_GATE autogate that defers httpClient->connect() until IoContext::waitForOutputLocks() resolves. Socket is returned immediately backed by kj::newPromisedStream so reads/writes queue until the real connection arrives.

kentonv · 2026-04-09T18:24:55Z

What is the rationale for doing this? Why do we need to prevent network outputs while storage writes are pending?

That's how output gates work. We don't want applications to have to wait for writes to be confirmed durable before we allow them to continue executing. Instead, we make it impossible to observe that the app has continued executing until the write has become durable. If the write ends up failing, then we prevent the rest of the world from ever knowing that the app kept running -- so from the PoV of the rest of the world, the app stopped exactly where the failed write was performed. This is a huge optimization because it allows an app to perform several writes in rapid succession with only one round trip to durable storage (SRS followers).

https://blog.cloudflare.com/durable-objects-easy-fast-correct-choose-three/

shouldn't only writes be blocked instead?

Anything that can be observed by the outside world needs to be blocked. That means we have to block both writes and connection initiation -- the remote end can observe that it has received a connection, and in theory you could imagine a protocol where merely connecting to a remote service serves to notify it of something, thus having side effects.

kentonv · 2026-04-09T18:27:45Z

src/workerd/api/sockets.c++

-  return result;
+
+  if (util::Autogate::isEnabled(util::AutogateKey::TCP_SOCKET_CONNECT_OUTPUT_GATE)) {
+    // Deferred-connect path: return a Socket backed by a promised stream immediately, deferring


I think you can make this a lot simpler. Use newPromisedWorkerInterface() to wrap the underlying output channel that connect() is being called on, then fulfill the promise when the output gate resolves.

vaibhavSharma-cf requested review from a team as code owners April 8, 2026 19:21

vaibhavSharma-cf self-assigned this Apr 8, 2026

vaibhavSharma-cf requested review from a-robinson and jqmmes April 8, 2026 19:28

ask-bonk bot requested changes Apr 8, 2026

View reviewed changes

jqmmes reviewed Apr 9, 2026

View reviewed changes

src/workerd/api/sockets.c++ Outdated Show resolved Hide resolved

dom96 requested changes Apr 9, 2026

View reviewed changes

vaibhavSharma-cf force-pushed the vaibhav/socket-output-gate branch from afc70fa to 2093ffb Compare April 9, 2026 15:08

vaibhavSharma-cf force-pushed the vaibhav/socket-output-gate branch from 2093ffb to 828423e Compare April 9, 2026 15:10

kentonv reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer TCP socket connect() until DO output gate clears.#6533

Defer TCP socket connect() until DO output gate clears.#6533
vaibhavSharma-cf wants to merge 1 commit intomainfrom
vaibhav/socket-output-gate

vaibhavSharma-cf commented Apr 8, 2026

Uh oh!

vaibhavSharma-cf commented Apr 8, 2026

Uh oh!

ask-bonk bot left a comment

Uh oh!

Uh oh!

ask-bonk bot Apr 8, 2026

Uh oh!

Uh oh!

ask-bonk bot commented Apr 8, 2026

Uh oh!

jqmmes Apr 9, 2026

Uh oh!

vaibhavSharma-cf Apr 9, 2026

Uh oh!

Uh oh!

dom96 left a comment

Uh oh!

kentonv commented Apr 9, 2026

Uh oh!

kentonv Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vaibhavSharma-cf commented Apr 8, 2026

Uh oh!

vaibhavSharma-cf commented Apr 8, 2026

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

Uh oh!

ask-bonk bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ask-bonk bot commented Apr 8, 2026

Uh oh!

jqmmes Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

vaibhavSharma-cf Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dom96 left a comment

Choose a reason for hiding this comment

Uh oh!

kentonv commented Apr 9, 2026

Uh oh!

kentonv Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants