Skip to content

Circuit breaker for outbound activity delivery #620

@dahlia

Description

@dahlia

Summary

Add a circuit breaker mechanism to Fedify's activity delivery pipeline. When a remote server repeatedly fails to receive activities, Fedify should stop hammering it with retries and instead hold outbound activities until the server shows signs of recovery.

Problem

Fedify's current retry logic treats every delivery failure the same way: it retries with exponential backoff until the retry limit is exhausted, regardless of whether the remote server is temporarily slow or has been unreachable for weeks. This has a few consequences:

  • Worker queues accumulate a long tail of hopeless retry attempts to dead instances, consuming resources and obscuring genuinely actionable failures.
  • A server with many followers on a struggling instance generates a disproportionate amount of noise.
  • There is no way to distinguish “this delivery failed once” from “this server has not responded in two weeks.”

Proposed solution

Circuit breaker states

The circuit breaker for each remote host transitions between three states:

  • Closed — normal operation; deliveries proceed as usual.
  • Open — the host is considered unreachable; deliveries are held rather than attempted.
  • Half-open — a recovery probe is sent; if it succeeds, the circuit closes, otherwise it reopens.

Default transition conditions

Transition Condition
Closed → Open 5 consecutive failures within a 10-minute window
Open → Half-open 30 minutes have elapsed since the circuit opened
Half-open → Closed Probe delivery succeeds
Half-open → Open Probe delivery fails

These defaults are intentionally conservative. A short-lived outage should not trip the circuit.

Configuration

All parameters are overridable via createFederation():

const federation = createFederation<void>({
  kv: ...,
  queue: ...,
  circuitBreaker: {
    failureThreshold: 5,
    failureWindow: Temporal.Duration.from({ minutes: 10 }),
    recoveryDelay: Temporal.Duration.from({ minutes: 30 }),
  },
});

Passing circuitBreaker: false disables the feature entirely for users who prefer to manage retry behavior themselves.

Activity handling when the circuit is open

Activities destined for an open-circuit host are not discarded. Instead, they are requeued with a deferred delivery time corresponding to the next half-open probe window. If the circuit remains open past a configurable TTL (default: 7 days), held activities are dropped and the permanent failure handler is invoked, consistent with how exhausted retries are handled today.

State persistence

Circuit breaker state is stored per remote host in the KvStore. This has two benefits:

  • State survives process restarts. Without this, a server restart would reset all open circuits, immediately resuming delivery attempts against hosts that were already known to be unreachable.
  • In deployments with multiple worker nodes, all nodes share the same circuit state, preventing one node from opening a circuit while another continues attempting delivery.

The KV key structure follows the existing Fedify conventions, e.g. ["_fedify", "circuit", "mastodon.social"].

Observability

When the OpenTelemetry metrics support (tracked in #619) is in place, circuit breaker state transitions will emit span events on the active outbox span:

  • activitypub.circuit_breaker.open — with attributes activitypub.remote.host and activitypub.circuit_breaker.failure_count
  • activitypub.circuit_breaker.half_open — with attribute activitypub.remote.host
  • activitypub.circuit_breaker.closed — with attributes activitypub.remote.host and activitypub.circuit_breaker.recovery_duration_ms

A counter metric activitypub.circuit_breaker.state_change with a activitypub.circuit_breaker.state attribute (open, half_open, closed) will also be recorded, making it straightforward to alert on a sudden spike in circuit openings.

onCircuitBreakerStateChange callback

For users not using OpenTelemetry, a callback hook provides an integration point for custom logging or alerting:

const federation = createFederation<void>({
  kv: ...,
  queue: ...,
  circuitBreaker: {
    onStateChange(remoteHost, previousState, newState) {
      logger.warn(
        `Circuit breaker for ${remoteHost}: ${previousState}${newState}`
      );
    },
  },
});

Dependency

This feature depends on #619 (OpenTelemetry metrics and span events) for the observability layer, though the core circuit breaker logic can be implemented independently.

Scope

Changes are limited to @fedify/fedify. The KvStore interface requires no modification; the circuit breaker uses the existing API.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions