ShapeCache: coalesce concurrent create_or_wait callers via GenServer-owned per-shape lock + polling

## Background

`Electric.ShapeCache.get_or_create_shape_handle/3` is called from `Api.load_shape_info` on every shape request that misses the in-process fast path. The fast path checks `ShapeStatus.fetch_handle_by_shape` (WriteBuffer ETS, then SQLite read connection) followed by `fetch_latest_offset` (Storage). On miss, the caller falls through to `GenServer.call(ShapeCache, {:create_or_wait_shape_handle, …}, 30_000)`.

When N concurrent requests for the same uncached shape arrive, all of them miss the fast path and all of them enqueue a `GenServer.call`. Handler 1 does the actual creation work (`safe_maybe_create_shape` → `add_shape` + `start_shape`); handlers 2..N short-circuit via `fetch_handle_by_shape_critical` finding the shape that handler 1 just created. So the *work* is deduplicated, but every caller still occupies a mailbox slot for the full duration of handler 1's creation plus its own dispatch latency. Under filesystem tail latency, shape creation reaches multiple seconds and the queue depth is real.

This is a sibling of #4370 (EtsInspector) and #4371 (StatusMonitor) under the thundering-herd umbrella #4266. The cheap admission control (#4292 / #4291 / PR #4359) bounds the concurrent population that can reach ShapeCache at all via the `:initial` bucket, but does not prevent admitted requests for the same shape from piling into the ShapeCache mailbox.

## Goal

Prevent the ShapeCache mailbox from accumulating concurrent calls for the same fresh shape. Only one caller per unique shape-creation event should issue a `GenServer.call`; every other caller should observe the in-progress state via ETS and poll until the result is ready.

## Design

The lock is **GenServer-owned for writes, caller-readable for routing decisions**. Callers never write to the lock table, eliminating any stale-claim concerns.

### Caller flow

```
1. fast_path(shape) hit? → return
2. :ets.member(shape_create_lock(stack_id), shape_hash)?
     true  → PollWait.until(poll_predicate, @call_timeout, <ShapeCache opts>)
     false → GenServer.call(name(stack_id), {:create_or_wait_shape_handle, …}, @call_timeout)
```

### Server handler (replaces today's body)

```elixir
case fetch_handle_by_shape_critical(stack_id, shape) do
  {:ok, handle} -> reply {handle, fetch_latest_offset(...)}
  :error        ->
    :ets.insert(shape_create_lock(stack_id), {shape_hash, self()})
    result = safe_maybe_create_shape(shape, state)
    :ets.delete(shape_create_lock(stack_id), shape_hash)
    reply result
end
```

### Poll predicate

```elixir
with {:ok, handle} <- ShapeStatus.fetch_handle_by_shape(stack_id, shape),
     true          <- ShapeStatus.shape_has_been_activated?(stack_id, handle),
     {:ok, offset} <- fetch_latest_offset(stack_id, handle) do
  {:ready, {handle, offset}}
else
  _ -> :not_ready
end
```

`shape_has_been_activated?/2` checks the `last_used_timestamp` that `update_last_read_time_to_now/2` sets at the very end of `start_shape`, so it serves as a "creation fully succeeded" signal. If creation fails, `clean_shape` removes the row and waiters never observe `:ready`, eventually timing out.

## Properties

- **Mailbox growth bounded by a single race window.** Once a shape's lock is set, all new callers bypass the GenServer entirely. The mailbox grows only during the window between "first caller's call queued" and "handler dequeues and sets the lock" — i.e. mailbox dispatch latency in the common case, the previous shape's creation time in the contended case (and even then, capped at "one entry per fast-path-miss within the window").
- **No `:wait` reply needed.** Any caller whose call reaches the handler with the lock unset is either the creator (does the work) or finds the shape already exists via the critical fetch (reply with the cached result). There's no third response code.
- **No stale-lock recovery needed.** Callers don't write the lock, so they can't strand it. The GenServer always clears the lock in the same handler invocation that set it. On GenServer restart, `init/1` wipes the lock table.
- **Specific error reasons are lost for polling waiters.** They get `:timeout` after the deadline. Same trade-off as #4370 / #4371.

## Backoff considerations — PollWait must accept per-caller config

This is the part to be aware of when implementing.

The defaults in #4371 (initial 25ms → doubling → 500ms cap) are tuned for stack-readiness waiting, where the underlying event flips on a second-to-minute timescale and aggressive polling is wasted. ShapeCache is the opposite: a single shape creation completes in tens to low-hundreds of milliseconds in the common case. Polling at a 500ms cap would routinely add hundreds of milliseconds of latency for no benefit.

ShapeCache's caller should pass its own backoff:

```elixir
PollWait.until(poll_predicate, @call_timeout,
  initial_interval: 5,
  max_interval: 100,
  backoff: 2.0
)
```

Schedule: **5 → 10 → 20 → 40 → 80 → 100 → 100 …** ms. Most shape creations are answered within the first 2–3 polls.

The `PollWait` primitive proposed in #4371 already accepts these as opts, so no primitive change is needed — only that the implementation honour the per-call overrides without baking the StatusMonitor defaults into a global config.

## Crash safety

If the GenServer crashes mid-creation, its state is lost but the named, public lock table persists. On restart, `init/1` runs `:ets.delete_all_objects(shape_create_lock(stack_id))` to wipe stale entries. Any in-flight waiters that were polling time out after their existing deadline; new requests start fresh.

## Suggested execution

1. Depends on `Electric.PollWait` landing as part of #4371.
2. Add the per-shape lock ETS table (`shape_create_lock:<stack_id>`), created in ShapeCache's `init/1` with `:public, :set, :named_table, read_concurrency: true, write_concurrency: :auto`. Cleared in `init/1` to handle crash recovery.
3. Modify `get_or_create_shape_handle/3` to add the `:ets.member` check before `GenServer.call`, with the polling fallback.
4. Modify the `handle_call({:create_or_wait_shape_handle, …}, …)` body to set/clear the lock around `safe_maybe_create_shape/2`.
5. Test:
   - Concurrent fan-out for the same fresh shape: assert mailbox depth is bounded by the brief-window count, and N–1 waiters receive the result via polling.
   - Crash recovery: kill the GenServer mid-creation, assert lock table is empty after restart, assert new requests succeed.
   - Backoff: assert poll cadence matches the configured schedule (within jitter).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ShapeCache: coalesce concurrent create_or_wait callers via GenServer-owned per-shape lock + polling #4372

Background

Goal

Design

Caller flow

Server handler (replaces today's body)

Poll predicate

Properties

Backoff considerations — PollWait must accept per-caller config

Crash safety

Suggested execution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ShapeCache: coalesce concurrent create_or_wait callers via GenServer-owned per-shape lock + polling #4372

Description

Background

Goal

Design

Caller flow

Server handler (replaces today's body)

Poll predicate

Properties

Backoff considerations — PollWait must accept per-caller config

Crash safety

Suggested execution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions