fix(sync-service): restore admission control release in error handlers by msfstef · Pull Request #4130 · electric-sql/electric

msfstef · 2026-04-15T16:54:15Z

Summary

fix: move admission control before shape creation #4101 removed ensure_admission_control_release calls from handle_errors, assuming register_before_send callbacks would handle it — but Plug.ErrorHandler passes the original conn (before plugs ran) to handle_errors, so callbacks registered by check_admission are lost and permits leak on every error.
Restored ensure_admission_control_release calls in both handle_errors clauses with a nil guard for safety and comments explaining why register_before_send doesn't work here.
Extracted admission_kind/1 helper to share offset logic between check_admission and ensure_admission_control_release.
Added regression tests covering RuntimeError, DBConnection.ConnectionError, and nil stack_id error paths.

Test plan

Permit release on RuntimeError — fails without fix, passes with fix
Permit release on DBConnection.ConnectionError — covers the first handle_errors clause
Nil stack_id guard — verifies no spurious ETS entry when config is unavailable
Full serve_shape_plug_test.exs suite passes (38 tests)

🤖 Generated with Claude Code

#4101 removed ensure_admission_control_release calls from handle_errors, assuming register_before_send callbacks would fire. They don't — Plug.ErrorHandler uses the original conn (before plugs ran) in its catch clause, so the callbacks registered by check_admission are lost and permits leak on every error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude · 2026-04-15T16:58:14Z

Claude Code Review\n\n### Summary\n\nIteration 3 is a single commit (`2b72da51`) that replaces the hardcoded `:ets.lookup(:electric_admission_control, nil)` assertion with `refute Repatch.called?(Electric.AdmissionControl, :release, 3)`. This directly addresses the suggestion from iteration 2 — the test no longer couples to the internal ETS table name atom. The production fix is unchanged and correct.\n\n### What's Working Well\n\n- ETS coupling removed: The Repatch spy approach tests observable behavior (was `release` called?) rather than internal structure.\n- Test name updated: `"does not call release when exception occurs before config is assigned"` precisely describes the guarded invariant.\n\n### Issues Found\n\nNo critical or important issues.\n\n#### Suggestions (Nice to Have)\n\nRepatch arity check may not match the actual call site\n\nFile: `packages/sync-service/test/electric/plug/serve_shape_plug_test.exs:1148`\n\n`Electric.AdmissionControl.release` has a default argument for `opts`, so Elixir compiles it into two BEAM exports: `release/2` and `release/3`. The call in `ensure_admission_control_release/1` passes two arguments, dispatching through `release/2`. Based on other test-suite usages (e.g. `Repatch.called?(Postgrex, :transaction, 2, ...)` specifically checks the 2-arg form), Repatch records the arity at the call site. If someone removed the nil guard and called `release(nil, kind)` — still a 2-arg call — the arity-3 assertion would silently miss the regression.\n\nSuggested fix:\n\n`elixir\nrefute Repatch.called?(Electric.AdmissionControl, :release, 2)\n`\n\nNot blocking — the nil guard itself is correct and the first two tests validate the happy-path release behavior.\n\n### Issue Conformance\n\nNo linked issue (unchanged). PR description is clear and accurate.\n\n### Previous Review Status\n\n- ✅ Suggestion (iter 2) — Hardcoded ETS table name: addressed by switching to `Repatch.spy` / `Repatch.called?`.\n\n---\n\n_Review iteration: 3 | 2026-04-15_

codecov · 2026-04-15T16:59:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.20%. Comparing base (59a96b8) to head (2b72da5).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4130   +/-   ##
=======================================
  Coverage   89.20%   89.20%           
=======================================
  Files          25       25           
  Lines        2520     2520           
  Branches      633      641    +8     
=======================================
  Hits         2248     2248           
  Misses        270      270           
  Partials        2        2

Flag	Coverage Δ
packages/experimental	`87.73% <ø> (ø)`
packages/react-hooks	`86.48% <ø> (ø)`
packages/start	`82.83% <ø> (ø)`
packages/typescript-client	`94.30% <ø> (ø)`
packages/y-electric	`56.05% <ø> (ø)`
typescript	`89.20% <ø> (ø)`
unit-tests	`89.20% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Guard ensure_admission_control_release against nil stack_id to prevent spurious ETS entries when config is unavailable on the conn - Extract admission_kind/1 helper to DRY up the offset check between check_admission and ensure_admission_control_release - Extract build_plug_opts/1 and call_plug_expecting_crash/1 test helpers to reduce duplication across admission control tests - Add test verifying nil stack_id doesn't pollute ETS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

netlify · 2026-04-15T17:08:30Z

✅ Deploy Preview for electric-next ready!

Name	Link
🔨 Latest commit	`fe206b9`
🔍 Latest deploy log	https://app.netlify.com/projects/electric-next/deploys/69dfc5c5aa0cda00084ddfd8
😎 Deploy Preview	https://deploy-preview-4130--electric-next.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Replace direct ETS table lookup with refute Repatch.called? to avoid coupling the plug test to AdmissionControl's internal table name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

msfstef · 2026-04-15T17:51:53Z

Opened #4131 as well to supplement this - currently after a redeploy, all requests coming in would be marked as "existing" because they are not -1, but they would be for inexistent shapes so they would get a 409 and trigger a shape creation

This effectively means that the existing cap is what would be used for the initial onslaught of requests, which is not great for the semantics of the admission control.

@alco

…4131) ## Summary > Stacked on #4130 Admission control currently classifies requests as `:initial` (strict limit, default 300) or `:existing` (generous limit, default 10,000) based solely on whether `offset == "-1"`. This is a proxy signal that breaks down in two critical scenarios: ### Problem 1: Restarts and redeploys — stale handles bypass initial limits After a restart or redeploy with fresh storage, **all connected clients reconnect with handles pointing to shapes that no longer exist**. These requests carry a handle and a non-`-1` offset, so they are classified as `:existing` — even though the 409 redirect flow will create a brand new shape for each one. Under load, this means hundreds or thousands of shape creations happening under the permissive `:existing` bucket, which is exactly the overload scenario admission control exists to prevent. ### Problem 2: Shared shapes — initial requests penalised unnecessarily When many clients subscribe to the same shape (e.g. a shared table), the first client creates it. Every subsequent client sends `offset=-1` (no handle yet) and gets classified as `:initial` — counted against the strict 300 limit — despite the shape already existing and the request being a cheap lookup. Under a burst of new subscribers this can trigger 503s for requests that would have been trivially served. ## Approach Add a `resolve_existing_shape` plug between `validate_request` and `check_admission` that checks whether the shape actually exists, then classify based on reality rather than the offset string: | Case | Check | Cost | Classification | |---|---|---|---| | No handle, shape exists | `fetch_handle_by_shape` (ETS + SQLite) | **Replaces** same lookup in `load_shape` — net zero | `:existing` | | No handle, shape doesn't exist | `fetch_handle_by_shape` → `:error` | One read on the creation path (already expensive) | `:initial` | | Handle present, shape alive | `fetch_handle_by_shape` finds shape | Same lookup, classifies accurately | `:existing` | | Handle present, shape dead | `fetch_handle_by_shape` → `:error` | One read, correctly gates creation | `:initial` | ### Why the cost is worth it - **No-handle case saves work**: `fetch_handle_by_shape` in `resolve_existing_shape` replaces the identical call inside `ShapeCache.get_or_create_shape_handle` that `load_shape` would do. The handle is set on the request params, so `load_shape` takes the cheap `resolve_shape_handle` path instead. - **Handle-present case classifies accurately**: The same `fetch_handle_by_shape` call checks shape existence by definition. The alternative — not checking — means a restart/redeploy sends all reconnecting clients through the `:existing` bucket while they all trigger shape creation. - **Handle is kept for dead shapes**: We don't clear the handle when the shape is dead — `load_shape` still needs it for the 409 redirect flow. We only use the existence check for admission classification. ## Admission release on error — approaches tried Because `check_admission` now runs before `load_shape` (it needs to know whether the shape exists, which `resolve_existing_shape` determines up-front), any exception in `load_shape` or `serve_shape_response` must still release the permit that was acquired earlier. We iterated through three designs to get this right: ### Attempt 1 — `register_before_send` callback on the conn `check_admission` attached a `register_before_send` callback that released the permit before `send_resp`. This fails for two reasons: 1. **Streaming**: `before_send` fires when `send_chunked` *starts* streaming, not when it finishes. The permit would be returned while chunks were still in flight, and if the ErrorHandler also ended the telemetry span via `before_send`, we'd lose `streaming_bytes_sent` and `duration` metrics. 2. **Error path**: `Plug.ErrorHandler` captures the *original* conn passed into `call/2`, not the accumulated conn. So the before_send callback attached by `check_admission` was simply absent from the conn seen by `handle_errors`, meaning exceptions in `load_shape` leaked the permit entirely. ### Attempt 2 — outer/inner `Plug.Builder` split (commit 6971b8a, now reverted) Following José Valim's suggestion in plug#486: split into an outer pipeline (`ServeShapePlug`) and inner pipeline (`ServeShapePlug.Inner`). The inner module's `Plug.ErrorHandler` sees the conn with full outer state — including `conn.private[:admission_kind]` and the before_send callback. Exceptions in `load_shape` now released correctly. Costs of this approach: - Two modules, with five `@doc false` public callbacks so the inner module can delegate back. - Custom `halt/1` override (can't use before_send for telemetry span end because of the streaming problem in Attempt 1 — so `halt` had to end the span explicitly). - Four-branch `admission_kind/1` fallback chain including an offset-based last-resort for the outer ErrorHandler where neither `check_admission` nor `resolve_existing_shape` had run. - Duplicated `@impl Plug.ErrorHandler` in both modules. ### Attempt 3 — `try/catch` + `try/after` in `call/2` (current, credit to @alco) [@alco](https://github.com/alco) [proposed](be1008b) collapsing back to a single module by overriding `call/2` directly: ```elixir def call(conn, opts) do conn = conn |> assign(:config, opts) |> fetch_query_params() |> start_telemetry_span() try do conn = try do super(conn, opts) catch kind, reason -> handle_caught(conn, kind, reason, __STACKTRACE__) end emit_shape_telemetry(conn) conn after OpentelemetryTelemetry.end_telemetry_span(OpenTelemetry, %{}) release_admission_permit() end end ``` `check_admission` stashes the acquired permit in the process dictionary; the `after` clause releases it unconditionally on every exit path (success, halt, uncaught exception). The OTEL span is popped in the same `after` block. **Why this is strictly better than Attempt 2:** - Fewer moving parts: single module, no cross-module callbacks, no `halt/1` override, `admission_kind/1` collapses to two branches (`:shape_exists` true/false). - Streaming is correct by construction: `super(conn, opts)` only returns once `Api.Response.send_stream/2` has synchronously drained the body, so `emit_shape_telemetry` captures the accurate `streaming_bytes_sent` and `duration` *before* the span is popped. - Every defensive fallback we'd accumulated in Attempt 2 existed because `Plug.ErrorHandler` sees the wrong conn — Attempt 3 removes the root cause, so the fallbacks can go with it. Net −56 lines versus Attempt 2, and strictly more correct on the streaming path. ## Test plan - [x] New test: request for non-existing shape with `offset=-1` is classified as `:initial` (rejected when slots full) - [x] New test: request for existing shape with `offset=-1` is classified as `:existing` (succeeds when initial slots full) - [x] New test: request with dead handle is classified as `:initial` (rejected when initial slots full) - [x] New test: permit released when `load_shape` raises `RuntimeError` - [x] New test: permit released when `load_shape` raises `DBConnection.ConnectionError` - [x] New test: correct `:existing` permit released when shape exists and offset is `-1` - [x] New test: correct `:initial` permit released when shape does not exist - [x] New test: release not called when exception occurs before `check_admission` runs - [x] Full router test suite passes (87 tests) - [x] Full serve_shape_plug test suite passes (40 tests) - [x] Full admission-control test suite passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Oleksii Sholik <oleksii@sholik.dev>

msfstef added the claude label Apr 15, 2026

msfstef requested review from alco and icehaunter April 15, 2026 16:54

icehaunter approved these changes Apr 15, 2026

View reviewed changes

test: use Repatch.spy to assert release is not called for nil stack_id

2b72da5

Replace direct ETS table lookup with refute Repatch.called? to avoid coupling the plug test to AdmissionControl's internal table name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

msfstef mentioned this pull request Apr 15, 2026

fix: classify admission control requests by actual shape existence #4131

Merged

11 tasks

msfstef merged commit 0fd30fc into main Apr 16, 2026
50 of 51 checks passed

msfstef deleted the msfstef/fix-admission-control-electric-boogaloo branch April 16, 2026 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sync-service): restore admission control release in error handlers#4130

fix(sync-service): restore admission control release in error handlers#4130
msfstef merged 3 commits intomainfrom
msfstef/fix-admission-control-electric-boogaloo

msfstef commented Apr 15, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Apr 15, 2026

Uh oh!

msfstef commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

msfstef commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

claude Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

netlify Bot commented Apr 15, 2026

✅ Deploy Preview for electric-next ready!

Uh oh!

msfstef commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

msfstef commented Apr 15, 2026 •

edited

Loading

claude Bot commented Apr 15, 2026 •

edited

Loading

codecov Bot commented Apr 15, 2026 •

edited

Loading