Skip to content

Multiple static Retry-After values throughout the codebase #4295

@alco

Description

@alco

Problem

Retry-After is currently calculated in several separate places in packages/sync-service, mostly as fixed values. This makes retry behavior inconsistent and prevents the server from adapting during overload, redeploy, or stack transition events.

This is especially risky because the TypeScript client treats server Retry-After as a floor and combines it with client-side jitter. A fixed server value like 10 can dominate early client backoff attempts and cause many clients to retry together.

Current call sites

  • lib/electric/plug/serve_shape_plug.ex
    • DB connection errors set retry-after: 10 directly.
    • admission-control overload calls a private calculate_retry_after/2, currently random 6..10 seconds.
  • lib/electric/shapes/api.ex
    • shape creation failure: retry_after: 1
    • stack not ready after timeout: retry_after: 5
    • snapshot creation errors: retry_after: 10
    • long-poll timeout while stack is not up/read-only: retry_after: 10
  • lib/electric/shapes/api/params.ex
    • DB unavailable while validating shape params: retry_after: 1
  • lib/electric/shapes/api/response.ex
    • only serializes the value into the retry-after header and exposes it via CORS; this should likely remain a dumb response-layer concern.

AdmissionControl already owns the core overload decision, per-stack/per-kind counters, and reject telemetry, but it currently returns only {:error, :overloaded}. The caller then invents the retry delay locally.

Proposal

Centralize Retry-After calculation in logic rooted in Electric.AdmissionControl.

For admission overloads, try_acquire/3 should either return retry metadata, e.g. {:error, {:overloaded, retry_after}}, or expose a central helper that callers use immediately after rejection.

The retry policy should use exponential backoff with jitter, keyed by overload context such as:

  • stack_id
  • request kind: :initial or :existing
  • reason: admission overload, stack not ready, DB unavailable, snapshot pressure, etc.

Suggested behavior:

  • use jitter for every transient 503
  • increase delays under sustained rejection/overload
  • reset or decay the backoff when capacity recovers
  • cap the initial implementation, e.g. at 120s
  • keep Cache-Control: no-store and Surrogate-Control: no-store on transient overload responses
  • emit telemetry including retry_after, reason, kind, current count, limit, and rejection/backoff state

Other transient 503 paths should call the same policy instead of hard-coding 1, 5, or 10.

Acceptance criteria

  • No hard-coded sync-service transient Retry-After values remain in request handling paths.
  • Admission-control rejection responses get Retry-After from the central policy.
  • The policy has unit coverage for exponential growth, jitter bounds, cap behavior, and reset/decay.
  • Router/API tests assert policy behavior rather than specific static values.
  • Docs that mention fixed Retry-After values are updated if the public behavior changes.

Open questions

  • Should the exponential “attempt” be server-side overload streak state, client-provided attempt metadata, or a hybrid?
  • Should DB unavailable, stack not ready, and snapshot pressure share one policy with different baselines, or use separate reason-specific profiles?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions