Problem
Retry-After is currently calculated in several separate places in packages/sync-service, mostly as fixed values. This makes retry behavior inconsistent and prevents the server from adapting during overload, redeploy, or stack transition events.
This is especially risky because the TypeScript client treats server Retry-After as a floor and combines it with client-side jitter. A fixed server value like 10 can dominate early client backoff attempts and cause many clients to retry together.
Current call sites
lib/electric/plug/serve_shape_plug.ex
- DB connection errors set
retry-after: 10 directly.
- admission-control overload calls a private
calculate_retry_after/2, currently random 6..10 seconds.
lib/electric/shapes/api.ex
- shape creation failure:
retry_after: 1
- stack not ready after timeout:
retry_after: 5
- snapshot creation errors:
retry_after: 10
- long-poll timeout while stack is not up/read-only:
retry_after: 10
lib/electric/shapes/api/params.ex
- DB unavailable while validating shape params:
retry_after: 1
lib/electric/shapes/api/response.ex
- only serializes the value into the
retry-after header and exposes it via CORS; this should likely remain a dumb response-layer concern.
AdmissionControl already owns the core overload decision, per-stack/per-kind counters, and reject telemetry, but it currently returns only {:error, :overloaded}. The caller then invents the retry delay locally.
Proposal
Centralize Retry-After calculation in logic rooted in Electric.AdmissionControl.
For admission overloads, try_acquire/3 should either return retry metadata, e.g. {:error, {:overloaded, retry_after}}, or expose a central helper that callers use immediately after rejection.
The retry policy should use exponential backoff with jitter, keyed by overload context such as:
stack_id
- request kind:
:initial or :existing
- reason: admission overload, stack not ready, DB unavailable, snapshot pressure, etc.
Suggested behavior:
- use jitter for every transient
503
- increase delays under sustained rejection/overload
- reset or decay the backoff when capacity recovers
- cap the initial implementation, e.g. at
120s
- keep
Cache-Control: no-store and Surrogate-Control: no-store on transient overload responses
- emit telemetry including
retry_after, reason, kind, current count, limit, and rejection/backoff state
Other transient 503 paths should call the same policy instead of hard-coding 1, 5, or 10.
Acceptance criteria
- No hard-coded sync-service transient
Retry-After values remain in request handling paths.
- Admission-control rejection responses get
Retry-After from the central policy.
- The policy has unit coverage for exponential growth, jitter bounds, cap behavior, and reset/decay.
- Router/API tests assert policy behavior rather than specific static values.
- Docs that mention fixed
Retry-After values are updated if the public behavior changes.
Open questions
- Should the exponential “attempt” be server-side overload streak state, client-provided attempt metadata, or a hybrid?
- Should DB unavailable, stack not ready, and snapshot pressure share one policy with different baselines, or use separate reason-specific profiles?
Problem
Retry-Afteris currently calculated in several separate places inpackages/sync-service, mostly as fixed values. This makes retry behavior inconsistent and prevents the server from adapting during overload, redeploy, or stack transition events.This is especially risky because the TypeScript client treats server
Retry-Afteras a floor and combines it with client-side jitter. A fixed server value like10can dominate early client backoff attempts and cause many clients to retry together.Current call sites
lib/electric/plug/serve_shape_plug.exretry-after: 10directly.calculate_retry_after/2, currently random6..10seconds.lib/electric/shapes/api.exretry_after: 1retry_after: 5retry_after: 10retry_after: 10lib/electric/shapes/api/params.exretry_after: 1lib/electric/shapes/api/response.exretry-afterheader and exposes it via CORS; this should likely remain a dumb response-layer concern.AdmissionControl already owns the core overload decision, per-stack/per-kind counters, and reject telemetry, but it currently returns only
{:error, :overloaded}. The caller then invents the retry delay locally.Proposal
Centralize
Retry-Aftercalculation in logic rooted inElectric.AdmissionControl.For admission overloads,
try_acquire/3should either return retry metadata, e.g.{:error, {:overloaded, retry_after}}, or expose a central helper that callers use immediately after rejection.The retry policy should use exponential backoff with jitter, keyed by overload context such as:
stack_id:initialor:existingSuggested behavior:
503120sCache-Control: no-storeandSurrogate-Control: no-storeon transient overload responsesretry_after, reason, kind, current count, limit, and rejection/backoff stateOther transient
503paths should call the same policy instead of hard-coding1,5, or10.Acceptance criteria
Retry-Aftervalues remain in request handling paths.Retry-Afterfrom the central policy.Retry-Aftervalues are updated if the public behavior changes.Open questions