-
Notifications
You must be signed in to change notification settings - Fork 1
Async Postgres and Pipelining
vanilla talks to PostgreSQL through pg_async, a native wire‑protocol client — no libpq. It speaks the v3 frontend/backend protocol directly (SCRAM‑SHA‑256 auth, the extended query protocol: Parse/Bind/Describe/Execute/Sync), frames messages off the recv buffer without copying, and integrates with the Architecture#the-async-runtime-opt-in via park/resume so a worker is never blocked on the database.
The HttpArena DB endpoints (async-db, fortunes) are reads. They are not WAL/fsync‑bound; they're bound by how efficiently the client can issue queries and decode rows. A native, non‑blocking client that can pipeline many queries on one connection and decode rows with zero copying is the right lever for that, in a way libpq's synchronous, copy‑heavy API is not. (crud create/update are writes and are ultimately DB‑ceiling‑bound; the client still matters for not blocking the worker.)
protocol.v framing + message builders + row iteration (pure, I/O-free)
▲ builders append to a caller buffer; parsers return borrowed slices
client.v connection state machine (startup, SCRAM, ReadyForQuery)
conn_async.v non-blocking submit/flush/drain; the per-conn pipeline
pool.v per-worker pool; acquire_pipelined() picks the least-loaded conn
Row decoding reads column lengths at their offset without slicing the payload (slicing allocated an array descriptor per column read — it was ~31 % of the async‑db per‑request CPU profile). Values are returned as borrowed []u8 views valid only inside the resume callback.
A single Postgres connection can carry up to max_inflight = 8 queries in flight at once. acquire_pipelined() picks the connection with the shortest pipeline rather than requiring an idle one, so the server multiplexes many clients' queries onto few connections.
On the reactor side, when a second distinct client parks on a connection that already has a watch, the WatchEntry promotes to a FIFO queue of ParkSlots. The invariant that makes everything work:
queue[k]↔ the connection'sinflight[k]— the k‑th parked client corresponds to the k‑th in‑flight query. One readable edge drains replies in submission order, and each reply's continuation gets its client's result.
drain_pipelined runs the head continuations until one cannot complete yet (.suspend); by FIFO, if the front query isn't ready no later one is either, so it stops. Replies are popped with queue.delete(0).
The queue buffer is reused, not re‑allocated each cycle: promote/orphan push into the retained buffer, the drain‑to‑empty path resets len=0 instead of assigning a fresh []ParkSlot{}. Under -gc none a fresh empty array per drain cycle was the dominant per‑request allocation on the pipelined path. See Memory Management under gc none.
A pooled DB connection must not be closed when the client that parked on it disconnects mid‑query — closing it would force a reconnect and a fresh SCRAM/PBKDF2 handshake on the next borrow (expensive, and pointless). So pooled fds are armed with watch_persistent, and on a mid‑query client disconnect the runtime:
-
Tombstones the parked slot (
ParkSlot.dead = true) instead of closing the fd. The slot stays in the queue so the connection's in‑flight FIFO stays aligned;drain_pipelinedstill consumes the orphaned reply in order (against a throwaway buffer) and discards it. - Identifies the dead client by
client_fd, never by re‑looking‑upconns[], so a reused fd (ABA) can't be mistaken for the disconnected client.
For a connection parked alone (no queue) on a persistent fd, reactor_orphan_single converts the single watch into a one‑slot dead tombstone, reusing the same drain path. The net effect: pool connections stay warm and open across client churn.
When every pooled connection is at max_inflight, the pool is saturated. Rather than block the worker, park() sheds the request: acquire_pipelined() (or async_submit/async_flush) fails, and the call site returns a caller‑chosen fallback response.
This is a deliberate design — blocking would collapse throughput on the closed‑loop benchmark clients — but the status a shed returns is a real decision:
-
Read endpoints shed to a benign empty
200({"items":[],"count":0}etc.). -
crud write/get shed to
503 Service Unavailable— the honest "server momentarily out of DB pipeline capacity" status. (Earlier these returned400/404, which misreported a backpressure shed as a client error and showed up as spurious 4xx in the benchmark; see Gotchas and Lessons Learned#the-crud-4xx-that-wasnt-a-bug.)
Genuine errors are unchanged: a malformed JSON body is still 400, a missing item is still 404. Only the shed fallback is 503.
The deeper questions — should the pool expose an explicit shed signal so callers don't guess the status; is max_inflight=8 the right number (raising it reduces shedding but costs max_inflight × 16 KiB of reply buffer per connection, against the memory budget); should a transient saturation queue briefly instead of shedding — are tracked in vanilla#51.