-
-
Notifications
You must be signed in to change notification settings - Fork 96
Description
Summary
Add a circuit breaker mechanism to Fedify's activity delivery pipeline. When a remote server repeatedly fails to receive activities, Fedify should stop hammering it with retries and instead hold outbound activities until the server shows signs of recovery.
Problem
Fedify's current retry logic treats every delivery failure the same way: it retries with exponential backoff until the retry limit is exhausted, regardless of whether the remote server is temporarily slow or has been unreachable for weeks. This has a few consequences:
- Worker queues accumulate a long tail of hopeless retry attempts to dead instances, consuming resources and obscuring genuinely actionable failures.
- A server with many followers on a struggling instance generates a disproportionate amount of noise.
- There is no way to distinguish “this delivery failed once” from “this server has not responded in two weeks.”
Proposed solution
Circuit breaker states
The circuit breaker for each remote host transitions between three states:
- Closed — normal operation; deliveries proceed as usual.
- Open — the host is considered unreachable; deliveries are held rather than attempted.
- Half-open — a recovery probe is sent; if it succeeds, the circuit closes, otherwise it reopens.
Default transition conditions
| Transition | Condition |
|---|---|
| Closed → Open | 5 consecutive failures within a 10-minute window |
| Open → Half-open | 30 minutes have elapsed since the circuit opened |
| Half-open → Closed | Probe delivery succeeds |
| Half-open → Open | Probe delivery fails |
These defaults are intentionally conservative. A short-lived outage should not trip the circuit.
Configuration
All parameters are overridable via createFederation():
const federation = createFederation<void>({
kv: ...,
queue: ...,
circuitBreaker: {
failureThreshold: 5,
failureWindow: Temporal.Duration.from({ minutes: 10 }),
recoveryDelay: Temporal.Duration.from({ minutes: 30 }),
},
});Passing circuitBreaker: false disables the feature entirely for users who prefer to manage retry behavior themselves.
Activity handling when the circuit is open
Activities destined for an open-circuit host are not discarded. Instead, they are requeued with a deferred delivery time corresponding to the next half-open probe window. If the circuit remains open past a configurable TTL (default: 7 days), held activities are dropped and the permanent failure handler is invoked, consistent with how exhausted retries are handled today.
State persistence
Circuit breaker state is stored per remote host in the KvStore. This has two benefits:
- State survives process restarts. Without this, a server restart would reset all open circuits, immediately resuming delivery attempts against hosts that were already known to be unreachable.
- In deployments with multiple worker nodes, all nodes share the same circuit state, preventing one node from opening a circuit while another continues attempting delivery.
The KV key structure follows the existing Fedify conventions, e.g. ["_fedify", "circuit", "mastodon.social"].
Observability
When the OpenTelemetry metrics support (tracked in #619) is in place, circuit breaker state transitions will emit span events on the active outbox span:
activitypub.circuit_breaker.open— with attributesactivitypub.remote.hostandactivitypub.circuit_breaker.failure_countactivitypub.circuit_breaker.half_open— with attributeactivitypub.remote.hostactivitypub.circuit_breaker.closed— with attributesactivitypub.remote.hostandactivitypub.circuit_breaker.recovery_duration_ms
A counter metric activitypub.circuit_breaker.state_change with a activitypub.circuit_breaker.state attribute (open, half_open, closed) will also be recorded, making it straightforward to alert on a sudden spike in circuit openings.
onCircuitBreakerStateChange callback
For users not using OpenTelemetry, a callback hook provides an integration point for custom logging or alerting:
const federation = createFederation<void>({
kv: ...,
queue: ...,
circuitBreaker: {
onStateChange(remoteHost, previousState, newState) {
logger.warn(
`Circuit breaker for ${remoteHost}: ${previousState} → ${newState}`
);
},
},
});Dependency
This feature depends on #619 (OpenTelemetry metrics and span events) for the observability layer, though the core circuit breaker logic can be implemented independently.
Scope
Changes are limited to @fedify/fedify. The KvStore interface requires no modification; the circuit breaker uses the existing API.