Skip to content

fix(sync-service): restore admission control release in error handlers#4130

Merged
msfstef merged 3 commits intomainfrom
msfstef/fix-admission-control-electric-boogaloo
Apr 16, 2026
Merged

fix(sync-service): restore admission control release in error handlers#4130
msfstef merged 3 commits intomainfrom
msfstef/fix-admission-control-electric-boogaloo

Conversation

@msfstef
Copy link
Copy Markdown
Contributor

@msfstef msfstef commented Apr 15, 2026

Summary

  • fix: move admission control before shape creation #4101 removed ensure_admission_control_release calls from handle_errors, assuming register_before_send callbacks would handle it — but Plug.ErrorHandler passes the original conn (before plugs ran) to handle_errors, so callbacks registered by check_admission are lost and permits leak on every error.
  • Restored ensure_admission_control_release calls in both handle_errors clauses with a nil guard for safety and comments explaining why register_before_send doesn't work here.
  • Extracted admission_kind/1 helper to share offset logic between check_admission and ensure_admission_control_release.
  • Added regression tests covering RuntimeError, DBConnection.ConnectionError, and nil stack_id error paths.

Test plan

  • Permit release on RuntimeError — fails without fix, passes with fix
  • Permit release on DBConnection.ConnectionError — covers the first handle_errors clause
  • Nil stack_id guard — verifies no spurious ETS entry when config is unavailable
  • Full serve_shape_plug_test.exs suite passes (38 tests)

🤖 Generated with Claude Code

#4101 removed ensure_admission_control_release calls from handle_errors,
assuming register_before_send callbacks would fire. They don't —
Plug.ErrorHandler uses the original conn (before plugs ran) in its catch
clause, so the callbacks registered by check_admission are lost and
permits leak on every error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@msfstef msfstef added the claude label Apr 15, 2026
@msfstef msfstef requested review from alco and icehaunter April 15, 2026 16:54
@claude
Copy link
Copy Markdown

claude Bot commented Apr 15, 2026

Claude Code Review\n\n### Summary\n\nIteration 3 is a single commit (2b72da51) that replaces the hardcoded :ets.lookup(:electric_admission_control, nil) assertion with refute Repatch.called?(Electric.AdmissionControl, :release, 3). This directly addresses the suggestion from iteration 2 — the test no longer couples to the internal ETS table name atom. The production fix is unchanged and correct.\n\n### What's Working Well\n\n- ETS coupling removed: The Repatch spy approach tests observable behavior (was release called?) rather than internal structure.\n- Test name updated: "does not call release when exception occurs before config is assigned" precisely describes the guarded invariant.\n\n### Issues Found\n\nNo critical or important issues.\n\n#### Suggestions (Nice to Have)\n\nRepatch arity check may not match the actual call site\n\nFile: packages/sync-service/test/electric/plug/serve_shape_plug_test.exs:1148\n\nElectric.AdmissionControl.release has a default argument for opts, so Elixir compiles it into two BEAM exports: release/2 and release/3. The call in ensure_admission_control_release/1 passes two arguments, dispatching through release/2. Based on other test-suite usages (e.g. Repatch.called?(Postgrex, :transaction, 2, ...) specifically checks the 2-arg form), Repatch records the arity at the call site. If someone removed the nil guard and called release(nil, kind) — still a 2-arg call — the arity-3 assertion would silently miss the regression.\n\nSuggested fix:\n\nelixir\nrefute Repatch.called?(Electric.AdmissionControl, :release, 2)\n\n\nNot blocking — the nil guard itself is correct and the first two tests validate the happy-path release behavior.\n\n### Issue Conformance\n\nNo linked issue (unchanged). PR description is clear and accurate.\n\n### Previous Review Status\n\n- ✅ Suggestion (iter 2) — Hardcoded ETS table name: addressed by switching to Repatch.spy / Repatch.called?.\n\n---\n\n_Review iteration: 3 | 2026-04-15_

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.20%. Comparing base (59a96b8) to head (2b72da5).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4130   +/-   ##
=======================================
  Coverage   89.20%   89.20%           
=======================================
  Files          25       25           
  Lines        2520     2520           
  Branches      633      641    +8     
=======================================
  Hits         2248     2248           
  Misses        270      270           
  Partials        2        2           
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 94.30% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 89.20% <ø> (ø)
unit-tests 89.20% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Guard ensure_admission_control_release against nil stack_id to prevent
  spurious ETS entries when config is unavailable on the conn
- Extract admission_kind/1 helper to DRY up the offset check between
  check_admission and ensure_admission_control_release
- Extract build_plug_opts/1 and call_plug_expecting_crash/1 test helpers
  to reduce duplication across admission control tests
- Add test verifying nil stack_id doesn't pollute ETS

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 15, 2026

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit fe206b9
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/69dfc5c5aa0cda00084ddfd8
😎 Deploy Preview https://deploy-preview-4130--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Replace direct ETS table lookup with refute Repatch.called? to avoid
coupling the plug test to AdmissionControl's internal table name.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@msfstef
Copy link
Copy Markdown
Contributor Author

msfstef commented Apr 15, 2026

Opened #4131 as well to supplement this - currently after a redeploy, all requests coming in would be marked as "existing" because they are not -1, but they would be for inexistent shapes so they would get a 409 and trigger a shape creation

This effectively means that the existing cap is what would be used for the initial onslaught of requests, which is not great for the semantics of the admission control.

@msfstef msfstef merged commit 0fd30fc into main Apr 16, 2026
50 of 51 checks passed
@msfstef msfstef deleted the msfstef/fix-admission-control-electric-boogaloo branch April 16, 2026 08:12
alco added a commit that referenced this pull request Apr 21, 2026
…4131)

## Summary

> Stacked on #4130

Admission control currently classifies requests as `:initial` (strict
limit, default 300) or `:existing` (generous limit, default 10,000)
based solely on whether `offset == "-1"`. This is a proxy signal that
breaks down in two critical scenarios:

### Problem 1: Restarts and redeploys — stale handles bypass initial
limits

After a restart or redeploy with fresh storage, **all connected clients
reconnect with handles pointing to shapes that no longer exist**. These
requests carry a handle and a non-`-1` offset, so they are classified as
`:existing` — even though the 409 redirect flow will create a brand new
shape for each one. Under load, this means hundreds or thousands of
shape creations happening under the permissive `:existing` bucket, which
is exactly the overload scenario admission control exists to prevent.

### Problem 2: Shared shapes — initial requests penalised unnecessarily

When many clients subscribe to the same shape (e.g. a shared table), the
first client creates it. Every subsequent client sends `offset=-1` (no
handle yet) and gets classified as `:initial` — counted against the
strict 300 limit — despite the shape already existing and the request
being a cheap lookup. Under a burst of new subscribers this can trigger
503s for requests that would have been trivially served.

## Approach

Add a `resolve_existing_shape` plug between `validate_request` and
`check_admission` that checks whether the shape actually exists, then
classify based on reality rather than the offset string:

| Case | Check | Cost | Classification |
|---|---|---|---|
| No handle, shape exists | `fetch_handle_by_shape` (ETS + SQLite) |
**Replaces** same lookup in `load_shape` — net zero | `:existing` |
| No handle, shape doesn't exist | `fetch_handle_by_shape` → `:error` |
One read on the creation path (already expensive) | `:initial` |
| Handle present, shape alive | `fetch_handle_by_shape` finds shape |
Same lookup, classifies accurately | `:existing` |
| Handle present, shape dead | `fetch_handle_by_shape` → `:error` | One
read, correctly gates creation | `:initial` |

### Why the cost is worth it

- **No-handle case saves work**: `fetch_handle_by_shape` in
`resolve_existing_shape` replaces the identical call inside
`ShapeCache.get_or_create_shape_handle` that `load_shape` would do. The
handle is set on the request params, so `load_shape` takes the cheap
`resolve_shape_handle` path instead.
- **Handle-present case classifies accurately**: The same
`fetch_handle_by_shape` call checks shape existence by definition. The
alternative — not checking — means a restart/redeploy sends all
reconnecting clients through the `:existing` bucket while they all
trigger shape creation.
- **Handle is kept for dead shapes**: We don't clear the handle when the
shape is dead — `load_shape` still needs it for the 409 redirect flow.
We only use the existence check for admission classification.

## Admission release on error — approaches tried

Because `check_admission` now runs before `load_shape` (it needs to know
whether the shape exists, which `resolve_existing_shape` determines
up-front), any exception in `load_shape` or `serve_shape_response` must
still release the permit that was acquired earlier. We iterated through
three designs to get this right:

### Attempt 1 — `register_before_send` callback on the conn

`check_admission` attached a `register_before_send` callback that
released the permit before `send_resp`. This fails for two reasons:

1. **Streaming**: `before_send` fires when `send_chunked` *starts*
streaming, not when it finishes. The permit would be returned while
chunks were still in flight, and if the ErrorHandler also ended the
telemetry span via `before_send`, we'd lose `streaming_bytes_sent` and
`duration` metrics.
2. **Error path**: `Plug.ErrorHandler` captures the *original* conn
passed into `call/2`, not the accumulated conn. So the before_send
callback attached by `check_admission` was simply absent from the conn
seen by `handle_errors`, meaning exceptions in `load_shape` leaked the
permit entirely.

### Attempt 2 — outer/inner `Plug.Builder` split (commit 6971b8a, now
reverted)

Following José Valim's suggestion in plug#486: split into an outer
pipeline (`ServeShapePlug`) and inner pipeline (`ServeShapePlug.Inner`).
The inner module's `Plug.ErrorHandler` sees the conn with full outer
state — including `conn.private[:admission_kind]` and the before_send
callback. Exceptions in `load_shape` now released correctly.

Costs of this approach:
- Two modules, with five `@doc false` public callbacks so the inner
module can delegate back.
- Custom `halt/1` override (can't use before_send for telemetry span end
because of the streaming problem in Attempt 1 — so `halt` had to end the
span explicitly).
- Four-branch `admission_kind/1` fallback chain including an
offset-based last-resort for the outer ErrorHandler where neither
`check_admission` nor `resolve_existing_shape` had run.
- Duplicated `@impl Plug.ErrorHandler` in both modules.

### Attempt 3 — `try/catch` + `try/after` in `call/2` (current, credit
to @alco)

[@alco](https://github.com/alco)
[proposed](be1008b)
collapsing back to a single module by overriding `call/2` directly:

```elixir
def call(conn, opts) do
  conn = conn |> assign(:config, opts) |> fetch_query_params() |> start_telemetry_span()

  try do
    conn =
      try do
        super(conn, opts)
      catch
        kind, reason -> handle_caught(conn, kind, reason, __STACKTRACE__)
      end

    emit_shape_telemetry(conn)
    conn
  after
    OpentelemetryTelemetry.end_telemetry_span(OpenTelemetry, %{})
    release_admission_permit()
  end
end
```

`check_admission` stashes the acquired permit in the process dictionary;
the `after` clause releases it unconditionally on every exit path
(success, halt, uncaught exception). The OTEL span is popped in the same
`after` block.

**Why this is strictly better than Attempt 2:**
- Fewer moving parts: single module, no cross-module callbacks, no
`halt/1` override, `admission_kind/1` collapses to two branches
(`:shape_exists` true/false).
- Streaming is correct by construction: `super(conn, opts)` only returns
once `Api.Response.send_stream/2` has synchronously drained the body, so
`emit_shape_telemetry` captures the accurate `streaming_bytes_sent` and
`duration` *before* the span is popped.
- Every defensive fallback we'd accumulated in Attempt 2 existed because
`Plug.ErrorHandler` sees the wrong conn — Attempt 3 removes the root
cause, so the fallbacks can go with it.

Net −56 lines versus Attempt 2, and strictly more correct on the
streaming path.

## Test plan

- [x] New test: request for non-existing shape with `offset=-1` is
classified as `:initial` (rejected when slots full)
- [x] New test: request for existing shape with `offset=-1` is
classified as `:existing` (succeeds when initial slots full)
- [x] New test: request with dead handle is classified as `:initial`
(rejected when initial slots full)
- [x] New test: permit released when `load_shape` raises `RuntimeError`
- [x] New test: permit released when `load_shape` raises
`DBConnection.ConnectionError`
- [x] New test: correct `:existing` permit released when shape exists
and offset is `-1`
- [x] New test: correct `:initial` permit released when shape does not
exist
- [x] New test: release not called when exception occurs before
`check_admission` runs
- [x] Full router test suite passes (87 tests)
- [x] Full serve_shape_plug test suite passes (40 tests)
- [x] Full admission-control test suite passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Oleksii Sholik <oleksii@sholik.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants