Multiple static Retry-After values throughout the codebase

## Problem

`Retry-After` is currently calculated in several separate places in `packages/sync-service`, mostly as fixed values. This makes retry behavior inconsistent and prevents the server from adapting during overload, redeploy, or stack transition events.

This is especially risky because the TypeScript client treats server `Retry-After` as a floor and combines it with client-side jitter. A fixed server value like `10` can dominate early client backoff attempts and cause many clients to retry together.

## Current call sites

- `lib/electric/plug/serve_shape_plug.ex`
  - DB connection errors set `retry-after: 10` directly.
  - admission-control overload calls a private `calculate_retry_after/2`, currently random `6..10` seconds.
- `lib/electric/shapes/api.ex`
  - shape creation failure: `retry_after: 1`
  - stack not ready after timeout: `retry_after: 5`
  - snapshot creation errors: `retry_after: 10`
  - long-poll timeout while stack is not up/read-only: `retry_after: 10`
- `lib/electric/shapes/api/params.ex`
  - DB unavailable while validating shape params: `retry_after: 1`
- `lib/electric/shapes/api/response.ex`
  - only serializes the value into the `retry-after` header and exposes it via CORS; this should likely remain a dumb response-layer concern.

AdmissionControl already owns the core overload decision, per-stack/per-kind counters, and reject telemetry, but it currently returns only `{:error, :overloaded}`. The caller then invents the retry delay locally.

## Proposal

Centralize `Retry-After` calculation in logic rooted in `Electric.AdmissionControl`.

For admission overloads, `try_acquire/3` should either return retry metadata, e.g. `{:error, {:overloaded, retry_after}}`, or expose a central helper that callers use immediately after rejection.

The retry policy should use exponential backoff with jitter, keyed by overload context such as:

- `stack_id`
- request kind: `:initial` or `:existing`
- reason: admission overload, stack not ready, DB unavailable, snapshot pressure, etc.

Suggested behavior:

- use jitter for every transient `503`
- increase delays under sustained rejection/overload
- reset or decay the backoff when capacity recovers
- cap the initial implementation, e.g. at `120s`
- keep `Cache-Control: no-store` and `Surrogate-Control: no-store` on transient overload responses
- emit telemetry including `retry_after`, reason, kind, current count, limit, and rejection/backoff state

Other transient `503` paths should call the same policy instead of hard-coding `1`, `5`, or `10`.

## Acceptance criteria

- No hard-coded sync-service transient `Retry-After` values remain in request handling paths.
- Admission-control rejection responses get `Retry-After` from the central policy.
- The policy has unit coverage for exponential growth, jitter bounds, cap behavior, and reset/decay.
- Router/API tests assert policy behavior rather than specific static values.
- Docs that mention fixed `Retry-After` values are updated if the public behavior changes.

## Open questions

- Should the exponential “attempt” be server-side overload streak state, client-provided attempt metadata, or a hybrid?
- Should DB unavailable, stack not ready, and snapshot pressure share one policy with different baselines, or use separate reason-specific profiles?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple static Retry-After values throughout the codebase #4295

Problem

Current call sites

Proposal

Acceptance criteria

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multiple static Retry-After values throughout the codebase #4295

Description

Problem

Current call sites

Proposal

Acceptance criteria

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions