fix(sync-service): restore admission control release in error handlers#4130
fix(sync-service): restore admission control release in error handlers#4130
Conversation
#4101 removed ensure_admission_control_release calls from handle_errors, assuming register_before_send callbacks would fire. They don't — Plug.ErrorHandler uses the original conn (before plugs ran) in its catch clause, so the callbacks registered by check_admission are lost and permits leak on every error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Claude Code Review\n\n### Summary\n\nIteration 3 is a single commit (
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4130 +/- ##
=======================================
Coverage 89.20% 89.20%
=======================================
Files 25 25
Lines 2520 2520
Branches 633 641 +8
=======================================
Hits 2248 2248
Misses 270 270
Partials 2 2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Guard ensure_admission_control_release against nil stack_id to prevent spurious ETS entries when config is unavailable on the conn - Extract admission_kind/1 helper to DRY up the offset check between check_admission and ensure_admission_control_release - Extract build_plug_opts/1 and call_plug_expecting_crash/1 test helpers to reduce duplication across admission control tests - Add test verifying nil stack_id doesn't pollute ETS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Replace direct ETS table lookup with refute Repatch.called? to avoid coupling the plug test to AdmissionControl's internal table name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Opened #4131 as well to supplement this - currently after a redeploy, all requests coming in would be marked as "existing" because they are not -1, but they would be for inexistent shapes so they would get a 409 and trigger a shape creation This effectively means that the existing cap is what would be used for the initial onslaught of requests, which is not great for the semantics of the admission control. |
…4131) ## Summary > Stacked on #4130 Admission control currently classifies requests as `:initial` (strict limit, default 300) or `:existing` (generous limit, default 10,000) based solely on whether `offset == "-1"`. This is a proxy signal that breaks down in two critical scenarios: ### Problem 1: Restarts and redeploys — stale handles bypass initial limits After a restart or redeploy with fresh storage, **all connected clients reconnect with handles pointing to shapes that no longer exist**. These requests carry a handle and a non-`-1` offset, so they are classified as `:existing` — even though the 409 redirect flow will create a brand new shape for each one. Under load, this means hundreds or thousands of shape creations happening under the permissive `:existing` bucket, which is exactly the overload scenario admission control exists to prevent. ### Problem 2: Shared shapes — initial requests penalised unnecessarily When many clients subscribe to the same shape (e.g. a shared table), the first client creates it. Every subsequent client sends `offset=-1` (no handle yet) and gets classified as `:initial` — counted against the strict 300 limit — despite the shape already existing and the request being a cheap lookup. Under a burst of new subscribers this can trigger 503s for requests that would have been trivially served. ## Approach Add a `resolve_existing_shape` plug between `validate_request` and `check_admission` that checks whether the shape actually exists, then classify based on reality rather than the offset string: | Case | Check | Cost | Classification | |---|---|---|---| | No handle, shape exists | `fetch_handle_by_shape` (ETS + SQLite) | **Replaces** same lookup in `load_shape` — net zero | `:existing` | | No handle, shape doesn't exist | `fetch_handle_by_shape` → `:error` | One read on the creation path (already expensive) | `:initial` | | Handle present, shape alive | `fetch_handle_by_shape` finds shape | Same lookup, classifies accurately | `:existing` | | Handle present, shape dead | `fetch_handle_by_shape` → `:error` | One read, correctly gates creation | `:initial` | ### Why the cost is worth it - **No-handle case saves work**: `fetch_handle_by_shape` in `resolve_existing_shape` replaces the identical call inside `ShapeCache.get_or_create_shape_handle` that `load_shape` would do. The handle is set on the request params, so `load_shape` takes the cheap `resolve_shape_handle` path instead. - **Handle-present case classifies accurately**: The same `fetch_handle_by_shape` call checks shape existence by definition. The alternative — not checking — means a restart/redeploy sends all reconnecting clients through the `:existing` bucket while they all trigger shape creation. - **Handle is kept for dead shapes**: We don't clear the handle when the shape is dead — `load_shape` still needs it for the 409 redirect flow. We only use the existence check for admission classification. ## Admission release on error — approaches tried Because `check_admission` now runs before `load_shape` (it needs to know whether the shape exists, which `resolve_existing_shape` determines up-front), any exception in `load_shape` or `serve_shape_response` must still release the permit that was acquired earlier. We iterated through three designs to get this right: ### Attempt 1 — `register_before_send` callback on the conn `check_admission` attached a `register_before_send` callback that released the permit before `send_resp`. This fails for two reasons: 1. **Streaming**: `before_send` fires when `send_chunked` *starts* streaming, not when it finishes. The permit would be returned while chunks were still in flight, and if the ErrorHandler also ended the telemetry span via `before_send`, we'd lose `streaming_bytes_sent` and `duration` metrics. 2. **Error path**: `Plug.ErrorHandler` captures the *original* conn passed into `call/2`, not the accumulated conn. So the before_send callback attached by `check_admission` was simply absent from the conn seen by `handle_errors`, meaning exceptions in `load_shape` leaked the permit entirely. ### Attempt 2 — outer/inner `Plug.Builder` split (commit 6971b8a, now reverted) Following José Valim's suggestion in plug#486: split into an outer pipeline (`ServeShapePlug`) and inner pipeline (`ServeShapePlug.Inner`). The inner module's `Plug.ErrorHandler` sees the conn with full outer state — including `conn.private[:admission_kind]` and the before_send callback. Exceptions in `load_shape` now released correctly. Costs of this approach: - Two modules, with five `@doc false` public callbacks so the inner module can delegate back. - Custom `halt/1` override (can't use before_send for telemetry span end because of the streaming problem in Attempt 1 — so `halt` had to end the span explicitly). - Four-branch `admission_kind/1` fallback chain including an offset-based last-resort for the outer ErrorHandler where neither `check_admission` nor `resolve_existing_shape` had run. - Duplicated `@impl Plug.ErrorHandler` in both modules. ### Attempt 3 — `try/catch` + `try/after` in `call/2` (current, credit to @alco) [@alco](https://github.com/alco) [proposed](be1008b) collapsing back to a single module by overriding `call/2` directly: ```elixir def call(conn, opts) do conn = conn |> assign(:config, opts) |> fetch_query_params() |> start_telemetry_span() try do conn = try do super(conn, opts) catch kind, reason -> handle_caught(conn, kind, reason, __STACKTRACE__) end emit_shape_telemetry(conn) conn after OpentelemetryTelemetry.end_telemetry_span(OpenTelemetry, %{}) release_admission_permit() end end ``` `check_admission` stashes the acquired permit in the process dictionary; the `after` clause releases it unconditionally on every exit path (success, halt, uncaught exception). The OTEL span is popped in the same `after` block. **Why this is strictly better than Attempt 2:** - Fewer moving parts: single module, no cross-module callbacks, no `halt/1` override, `admission_kind/1` collapses to two branches (`:shape_exists` true/false). - Streaming is correct by construction: `super(conn, opts)` only returns once `Api.Response.send_stream/2` has synchronously drained the body, so `emit_shape_telemetry` captures the accurate `streaming_bytes_sent` and `duration` *before* the span is popped. - Every defensive fallback we'd accumulated in Attempt 2 existed because `Plug.ErrorHandler` sees the wrong conn — Attempt 3 removes the root cause, so the fallbacks can go with it. Net −56 lines versus Attempt 2, and strictly more correct on the streaming path. ## Test plan - [x] New test: request for non-existing shape with `offset=-1` is classified as `:initial` (rejected when slots full) - [x] New test: request for existing shape with `offset=-1` is classified as `:existing` (succeeds when initial slots full) - [x] New test: request with dead handle is classified as `:initial` (rejected when initial slots full) - [x] New test: permit released when `load_shape` raises `RuntimeError` - [x] New test: permit released when `load_shape` raises `DBConnection.ConnectionError` - [x] New test: correct `:existing` permit released when shape exists and offset is `-1` - [x] New test: correct `:initial` permit released when shape does not exist - [x] New test: release not called when exception occurs before `check_admission` runs - [x] Full router test suite passes (87 tests) - [x] Full serve_shape_plug test suite passes (40 tests) - [x] Full admission-control test suite passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Oleksii Sholik <oleksii@sholik.dev>
Summary
ensure_admission_control_releasecalls fromhandle_errors, assumingregister_before_sendcallbacks would handle it — butPlug.ErrorHandlerpasses the original conn (before plugs ran) tohandle_errors, so callbacks registered bycheck_admissionare lost and permits leak on every error.ensure_admission_control_releasecalls in bothhandle_errorsclauses with a nil guard for safety and comments explaining whyregister_before_senddoesn't work here.admission_kind/1helper to share offset logic betweencheck_admissionandensure_admission_control_release.RuntimeError,DBConnection.ConnectionError, and nilstack_iderror paths.Test plan
RuntimeError— fails without fix, passes with fixDBConnection.ConnectionError— covers the firsthandle_errorsclausestack_idguard — verifies no spurious ETS entry when config is unavailableserve_shape_plug_test.exssuite passes (38 tests)🤖 Generated with Claude Code