Skip to content

ShapeCache: coalesce concurrent create_or_wait callers via GenServer-owned per-shape lock + polling #4372

@alco

Description

@alco

Background

Electric.ShapeCache.get_or_create_shape_handle/3 is called from Api.load_shape_info on every shape request that misses the in-process fast path. The fast path checks ShapeStatus.fetch_handle_by_shape (WriteBuffer ETS, then SQLite read connection) followed by fetch_latest_offset (Storage). On miss, the caller falls through to GenServer.call(ShapeCache, {:create_or_wait_shape_handle, …}, 30_000).

When N concurrent requests for the same uncached shape arrive, all of them miss the fast path and all of them enqueue a GenServer.call. Handler 1 does the actual creation work (safe_maybe_create_shapeadd_shape + start_shape); handlers 2..N short-circuit via fetch_handle_by_shape_critical finding the shape that handler 1 just created. So the work is deduplicated, but every caller still occupies a mailbox slot for the full duration of handler 1's creation plus its own dispatch latency. Under filesystem tail latency, shape creation reaches multiple seconds and the queue depth is real.

This is a sibling of #4370 (EtsInspector) and #4371 (StatusMonitor) under the thundering-herd umbrella #4266. The cheap admission control (#4292 / #4291 / PR #4359) bounds the concurrent population that can reach ShapeCache at all via the :initial bucket, but does not prevent admitted requests for the same shape from piling into the ShapeCache mailbox.

Goal

Prevent the ShapeCache mailbox from accumulating concurrent calls for the same fresh shape. Only one caller per unique shape-creation event should issue a GenServer.call; every other caller should observe the in-progress state via ETS and poll until the result is ready.

Design

The lock is GenServer-owned for writes, caller-readable for routing decisions. Callers never write to the lock table, eliminating any stale-claim concerns.

Caller flow

1. fast_path(shape) hit? → return
2. :ets.member(shape_create_lock(stack_id), shape_hash)?
     true  → PollWait.until(poll_predicate, @call_timeout, <ShapeCache opts>)
     false → GenServer.call(name(stack_id), {:create_or_wait_shape_handle, …}, @call_timeout)

Server handler (replaces today's body)

case fetch_handle_by_shape_critical(stack_id, shape) do
  {:ok, handle} -> reply {handle, fetch_latest_offset(...)}
  :error        ->
    :ets.insert(shape_create_lock(stack_id), {shape_hash, self()})
    result = safe_maybe_create_shape(shape, state)
    :ets.delete(shape_create_lock(stack_id), shape_hash)
    reply result
end

Poll predicate

with {:ok, handle} <- ShapeStatus.fetch_handle_by_shape(stack_id, shape),
     true          <- ShapeStatus.shape_has_been_activated?(stack_id, handle),
     {:ok, offset} <- fetch_latest_offset(stack_id, handle) do
  {:ready, {handle, offset}}
else
  _ -> :not_ready
end

shape_has_been_activated?/2 checks the last_used_timestamp that update_last_read_time_to_now/2 sets at the very end of start_shape, so it serves as a "creation fully succeeded" signal. If creation fails, clean_shape removes the row and waiters never observe :ready, eventually timing out.

Properties

  • Mailbox growth bounded by a single race window. Once a shape's lock is set, all new callers bypass the GenServer entirely. The mailbox grows only during the window between "first caller's call queued" and "handler dequeues and sets the lock" — i.e. mailbox dispatch latency in the common case, the previous shape's creation time in the contended case (and even then, capped at "one entry per fast-path-miss within the window").
  • No :wait reply needed. Any caller whose call reaches the handler with the lock unset is either the creator (does the work) or finds the shape already exists via the critical fetch (reply with the cached result). There's no third response code.
  • No stale-lock recovery needed. Callers don't write the lock, so they can't strand it. The GenServer always clears the lock in the same handler invocation that set it. On GenServer restart, init/1 wipes the lock table.
  • Specific error reasons are lost for polling waiters. They get :timeout after the deadline. Same trade-off as EtsInspector: prevent mailbox overload and avoid serving orphaned replies #4370 / StatusMonitor: replace mailbox-based wait_until with adaptive per-process polling #4371.

Backoff considerations — PollWait must accept per-caller config

This is the part to be aware of when implementing.

The defaults in #4371 (initial 25ms → doubling → 500ms cap) are tuned for stack-readiness waiting, where the underlying event flips on a second-to-minute timescale and aggressive polling is wasted. ShapeCache is the opposite: a single shape creation completes in tens to low-hundreds of milliseconds in the common case. Polling at a 500ms cap would routinely add hundreds of milliseconds of latency for no benefit.

ShapeCache's caller should pass its own backoff:

PollWait.until(poll_predicate, @call_timeout,
  initial_interval: 5,
  max_interval: 100,
  backoff: 2.0
)

Schedule: 5 → 10 → 20 → 40 → 80 → 100 → 100 … ms. Most shape creations are answered within the first 2–3 polls.

The PollWait primitive proposed in #4371 already accepts these as opts, so no primitive change is needed — only that the implementation honour the per-call overrides without baking the StatusMonitor defaults into a global config.

Crash safety

If the GenServer crashes mid-creation, its state is lost but the named, public lock table persists. On restart, init/1 runs :ets.delete_all_objects(shape_create_lock(stack_id)) to wipe stale entries. Any in-flight waiters that were polling time out after their existing deadline; new requests start fresh.

Suggested execution

  1. Depends on Electric.PollWait landing as part of StatusMonitor: replace mailbox-based wait_until with adaptive per-process polling #4371.
  2. Add the per-shape lock ETS table (shape_create_lock:<stack_id>), created in ShapeCache's init/1 with :public, :set, :named_table, read_concurrency: true, write_concurrency: :auto. Cleared in init/1 to handle crash recovery.
  3. Modify get_or_create_shape_handle/3 to add the :ets.member check before GenServer.call, with the polling fallback.
  4. Modify the handle_call({:create_or_wait_shape_handle, …}, …) body to set/clear the lock around safe_maybe_create_shape/2.
  5. Test:
    • Concurrent fan-out for the same fresh shape: assert mailbox depth is bounded by the brief-window count, and N–1 waiters receive the result via polling.
    • Crash recovery: kill the GenServer mid-creation, assert lock table is empty after restart, assert new requests succeed.
    • Backoff: assert poll cadence matches the configured schedule (within jitter).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions