ShapeCache Bottlenecks Under Thundering Herd
Analysis of where load piles up when many shape requests arrive concurrently (both ?offset=-1 initial requests and requests carrying an existing handle), specifically the load that lands before AdmissionControl can shed it.
Request flow (relevant slice)
Electric.Plug.ServeShapePlug pipeline:
:resolve_existing_shape — runs before admission. SQLite read.
:check_admission — ETS counter gate (Electric.AdmissionControl).
:load_shape — calls into ShapeCache.
:serve_shape_response.
So every request — even the ones that will be rejected by admission — performs a SQLite read first.
Bottleneck 1 — Pre-admission SQLite reads
resolve_existing_shape runs on the Bandit request process and queries the shape DB (ShapeStatus.handle_for_shape/2). Under a thundering herd this saturates the SQLite read pool before AdmissionControl gets a chance to reject anything. Admission control therefore can't actually shed load on the hottest path — it can only shed work that happens after the read.
The WriteBuffer ETS lookup (WriteBuffer.lookup_handle/2) short-circuits recent writes, so steady-state hits are cheap, but cold/missing shapes still fall through to a real read connection checkout.
Bottleneck 2 — ShapeCache GenServer mailbox
ShapeCache.get_or_create_shape_handle/3:
with {:ok, handle} <- fetch_handle_by_shape(shape, stack_id),
{:ok, offset} <- fetch_latest_offset(stack_id, handle) do
{handle, offset}
else
:error ->
GenServer.call(name(stack_id), {:create_or_wait_shape_handle, ...}, @call_timeout)
end
For an offset=-1 request on a shape that doesn't exist yet, the fast path fails and every caller funnels into a single GenServer.call against the per-stack ShapeCache process. Even with internal coalescing (see below), all calls serialize through one mailbox: 1000 requests for the same shape become 1000 sequential mailbox entries.
Bottleneck 3 — Write connection contention inside the GenServer
Inside the GenServer, maybe_create_shape calls ShapeStatus.handle_for_shape_critical/2, which uses the write connection:
def handle_for_shape_critical(stack_id, %Shape{} = shape, timeout \\ 10_000) do
checkout_fun = &checkout_write!(stack_id, :handle_for_shape_critical, &1, timeout)
handle_for_shape_inner(stack_id, shape, checkout_fun)
end
Coalescing already exists: handle_for_shape_inner checks WriteBuffer.lookup_handle/2 first, so once the leader has written the shape, queued requests for the same shape return from ETS without touching SQLite.
But:
- Different shape hashes still need the write connection.
- The write connection is shared with
WriteBuffer flushing — see @max_drain_per_cycle 100 in write_buffer.ex and its comment about "yield[ing] the write connection to handle_for_shape_critical/2 reasonably often." This is an explicit, known design tension.
- The
:critical path uses the write connection precisely because SQLite WAL mode does not guarantee that a fresh read connection sees the latest committed write — so the read pool can't safely substitute here.
Bottleneck 4 — Downstream synchronous work the GenServer waits on
maybe_create_shape is not just a metadata insert. While holding the GenServer, it (transitively) starts the consumer/snapshotter machinery — Electric.Shapes.Consumer.Snapshotter.start_link and the surrounding supervisor wiring. The Snapshotter itself runs in {:continue, :start_snapshot}, so the Postgres query is async, but the setup (process registration, PublicationManager.add_shape, supervisor child start) happens on the synchronous path. While ShapeCache is in that critical section, the mailbox keeps growing.
So the queue depth observed during a herd is a function of:
mailbox_growth_rate = incoming_rps − 1 / shape_setup_latency
If shape_setup_latency regresses (slow PG, slow disk, lock waits), every request behind it stalls — including requests that would have been cheap (coalesced or for unrelated shapes).
Summary of where load actually lands pre-admission
| Layer |
Per-request cost |
Serialization point |
Notes |
resolve_existing_shape |
1 SQLite read (or ETS hit) |
Read pool |
Runs on every request, including those that will be rejected |
check_admission |
ETS counter |
None |
Cheap |
ShapeCache.get_or_create_shape_handle fast path |
ETS / read pool |
Read pool |
Coalesces at WriteBuffer level |
ShapeCache GenServer call |
Mailbox queue + write connection |
Single GenServer, single write connection |
The real chokepoint for new shapes |
| Snapshotter/Consumer setup |
Process start + PublicationManager |
Held inside GenServer call |
Tail latency here = head-of-line blocking |
Why coalescing alone is not enough
Coalescing (the WriteBuffer ETS short-circuit inside handle_for_shape_inner) eliminates redundant SQLite work during a herd on the same shape. It does not eliminate:
- Redundant
GenServer.calls (each request still queues and is served sequentially, even if its work inside is fast).
- Pre-admission read load on every incoming request.
- Head-of-line blocking from any one slow shape setup.
Directions worth considering
These are sketches, not proposals — each has trade-offs.
-
Move the existence check ahead of admission, or make admission ahead of the SQLite read. If admission rejects, no read should happen.
-
Pre-GenServer ETS-based dedup of in-flight creations. A :ets.insert_new keyed by shape hash designates one leader; followers subscribe via Registry/monitor and wait for the result. This collapses N concurrent GenServer.calls for the same shape into 1 call + N waiters, keeping the mailbox short.
-
Pull setup work out of the synchronous GenServer path. Reduce what maybe_create_shape does while the GenServer is "busy" so head-of-line blocking shrinks. Anything not strictly needed to return a handle should be cast/continue'd.
-
Bound or shed at the GenServer boundary. Today @call_timeout = 30_000 lets very deep mailbox queues build before anyone fails fast. A mailbox-depth-aware reject (in concert with admission control) would convert latency into explicit backpressure.
-
Per-shape sharding of ShapeCache. Splitting the single per-stack GenServer into a pool sharded by shape hash removes the global mailbox bottleneck while preserving per-shape ordering guarantees.
ShapeCache Bottlenecks Under Thundering Herd
Analysis of where load piles up when many shape requests arrive concurrently (both
?offset=-1initial requests and requests carrying an existing handle), specifically the load that lands before AdmissionControl can shed it.Request flow (relevant slice)
Electric.Plug.ServeShapePlugpipeline::resolve_existing_shape— runs before admission. SQLite read.:check_admission— ETS counter gate (Electric.AdmissionControl).:load_shape— calls intoShapeCache.:serve_shape_response.So every request — even the ones that will be rejected by admission — performs a SQLite read first.
Bottleneck 1 — Pre-admission SQLite reads
resolve_existing_shaperuns on theBanditrequest process and queries the shape DB (ShapeStatus.handle_for_shape/2). Under a thundering herd this saturates the SQLite read pool before AdmissionControl gets a chance to reject anything. Admission control therefore can't actually shed load on the hottest path — it can only shed work that happens after the read.The WriteBuffer ETS lookup (
WriteBuffer.lookup_handle/2) short-circuits recent writes, so steady-state hits are cheap, but cold/missing shapes still fall through to a real read connection checkout.Bottleneck 2 —
ShapeCacheGenServer mailboxShapeCache.get_or_create_shape_handle/3:For an
offset=-1request on a shape that doesn't exist yet, the fast path fails and every caller funnels into a singleGenServer.callagainst the per-stackShapeCacheprocess. Even with internal coalescing (see below), all calls serialize through one mailbox: 1000 requests for the same shape become 1000 sequential mailbox entries.Bottleneck 3 — Write connection contention inside the GenServer
Inside the GenServer,
maybe_create_shapecallsShapeStatus.handle_for_shape_critical/2, which uses the write connection:Coalescing already exists:
handle_for_shape_innerchecksWriteBuffer.lookup_handle/2first, so once the leader has written the shape, queued requests for the same shape return from ETS without touching SQLite.But:
WriteBufferflushing — see@max_drain_per_cycle 100inwrite_buffer.exand its comment about "yield[ing] the write connection tohandle_for_shape_critical/2reasonably often." This is an explicit, known design tension.:criticalpath uses the write connection precisely because SQLite WAL mode does not guarantee that a fresh read connection sees the latest committed write — so the read pool can't safely substitute here.Bottleneck 4 — Downstream synchronous work the GenServer waits on
maybe_create_shapeis not just a metadata insert. While holding the GenServer, it (transitively) starts the consumer/snapshotter machinery —Electric.Shapes.Consumer.Snapshotter.start_linkand the surrounding supervisor wiring. The Snapshotter itself runs in{:continue, :start_snapshot}, so the Postgres query is async, but the setup (process registration,PublicationManager.add_shape, supervisor child start) happens on the synchronous path. WhileShapeCacheis in that critical section, the mailbox keeps growing.So the queue depth observed during a herd is a function of:
If
shape_setup_latencyregresses (slow PG, slow disk, lock waits), every request behind it stalls — including requests that would have been cheap (coalesced or for unrelated shapes).Summary of where load actually lands pre-admission
resolve_existing_shapecheck_admissionShapeCache.get_or_create_shape_handlefast pathShapeCacheGenServer callPublicationManagerWhy coalescing alone is not enough
Coalescing (the WriteBuffer ETS short-circuit inside
handle_for_shape_inner) eliminates redundant SQLite work during a herd on the same shape. It does not eliminate:GenServer.calls (each request still queues and is served sequentially, even if its work inside is fast).Directions worth considering
These are sketches, not proposals — each has trade-offs.
Move the existence check ahead of admission, or make admission ahead of the SQLite read. If admission rejects, no read should happen.
Pre-GenServer ETS-based dedup of in-flight creations. A
:ets.insert_newkeyed by shape hash designates one leader; followers subscribe via Registry/monitor and wait for the result. This collapses N concurrentGenServer.calls for the same shape into 1 call + N waiters, keeping the mailbox short.Pull setup work out of the synchronous GenServer path. Reduce what
maybe_create_shapedoes while the GenServer is "busy" so head-of-line blocking shrinks. Anything not strictly needed to return a handle should be cast/continue'd.Bound or shed at the GenServer boundary. Today
@call_timeout = 30_000lets very deep mailbox queues build before anyone fails fast. A mailbox-depth-aware reject (in concert with admission control) would convert latency into explicit backpressure.Per-shape sharding of
ShapeCache. Splitting the single per-stack GenServer into a pool sharded by shape hash removes the global mailbox bottleneck while preserving per-shape ordering guarantees.