Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
3f9a49d
Add circuit breaker state tracking
dahlia May 25, 2026
4c3e9b3
Hold queued delivery when hosts fail
dahlia May 25, 2026
fce3f14
Document circuit breaker observability
dahlia May 25, 2026
9ea93ea
Limit circuit breaker setup to outbox queues
dahlia May 25, 2026
04621b7
Keep permanent failures out of circuits
dahlia May 26, 2026
3619eec
Honor retry give-up for Retry-After
dahlia May 26, 2026
b8b08e2
Recover stale half-open circuits
dahlia May 26, 2026
3c66361
Ignore invalid Retry-After delays
dahlia May 26, 2026
71c623f
Prune stale circuit failures
dahlia May 26, 2026
c9c5e6e
Close circuits on permanent 4xx
dahlia May 26, 2026
0bae148
Include ports in remote host metrics
dahlia May 26, 2026
893f904
Keep delivered activities from retrying
dahlia May 26, 2026
542f34d
Drop expired held probes after failure
dahlia May 26, 2026
c3b507f
Clear circuits for permanent failures
dahlia May 26, 2026
ebfcf82
Preserve retries on circuit state errors
dahlia May 26, 2026
5dea048
Avoid dropping held messages after recovery
dahlia May 26, 2026
5220d1c
Honor Retry-After without circuit breaker
dahlia May 26, 2026
31457b6
Keep half-open probes single-flight
dahlia May 27, 2026
c4481c0
Cap held circuit delays at TTL
dahlia May 27, 2026
a283db9
Report half-open circuit holds
dahlia May 27, 2026
35fb904
Validate Retry-After and probe telemetry
dahlia May 27, 2026
2a40f71
Record held span after circuit opens
dahlia May 27, 2026
dfac216
Preserve Retry-After ordering
dahlia May 27, 2026
bfb253a
Bound circuit probe CAS retries
dahlia May 27, 2026
6f8cac6
Normalize held circuit state spans
dahlia May 27, 2026
e740311
Add a PR link to the changelog
dahlia May 27, 2026
49b524b
Harden circuit breaker state handling
dahlia May 27, 2026
ed6b000
Cover circuit breaker ordering metadata
dahlia May 27, 2026
5589400
Simplify circuit failure branching
dahlia May 27, 2026
81e53e3
Require positive circuit durations
dahlia May 27, 2026
1a62d44
Honor retry give-up before holding
dahlia May 27, 2026
0b09904
Clamp negative retry delays by sign
dahlia May 27, 2026
1fd8072
Honor Retry-After on unavailable inboxes
dahlia May 27, 2026
8be20da
Ignore malformed held timestamps
dahlia May 27, 2026
0ee2da0
Reject empty failure windows
dahlia May 27, 2026
91b4cae
Keep local errors out of circuits
dahlia May 27, 2026
e6b5e3d
Parse asctime Retry-After dates as UTC
dahlia May 27, 2026
ad62414
Fail open before circuit probes
dahlia May 27, 2026
4d0f011
Simplify circuit failure bookkeeping
dahlia May 27, 2026
e2a32c2
Normalize send transport errors
dahlia May 27, 2026
4522794
Cover negative calendar retry delays
dahlia May 27, 2026
a25476c
Cover 503 Retry-After ordering
dahlia May 27, 2026
d293300
Bound custom failure history
dahlia May 27, 2026
9f97b9b
Honor Retry-After circuit holds
dahlia May 27, 2026
7aaea0c
Skip open circuit failure writes
dahlia May 27, 2026
180b09a
Memoize queued actor IDs
dahlia May 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 26 additions & 6 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,22 @@ To be released.
operators distinguish a slow-draining queue from a queue that sees
less traffic. [[#316], [#740], [#759]]

- Added an outbound delivery circuit breaker for queued outbox delivery.
Fedify now tracks consecutive network and HTTP 5xx delivery failures
per remote host (including any non-default port), stores the state in
the configured `KvStore`, and requeues messages held by an open circuit
instead of repeatedly sending to an unreachable server. The circuit
breaker is enabled by default for queued outbox delivery and can be
disabled with
`circuitBreaker: false`; applications can customize the failure policy,
recovery delay, held activity TTL, release interval, and state/drop
callbacks. HTTP 429 responses do not count as circuit failures and
`Retry-After` is respected when present. State changes are exposed
through `activitypub.circuit_breaker.state_change` metrics and
`activitypub.circuit_breaker.state_change` span events, and expired
held activities call the outbox permanent failure handler with
`reason: "circuit-breaker-ttl"`. [[#620], [#778]]

- Added OpenTelemetry metrics for ActivityPub fanout and activity
lifecycle events, complementing the per-recipient
`activitypub.delivery.*` counters and the per-task
Expand Down Expand Up @@ -155,10 +171,11 @@ To be released.
Instruments share an `activitypub.lookup.kind` and (where
applicable) `activitypub.lookup.result` attribute drawn from small,
spec-bounded enumerations. `activitypub.remote.host` records the
URL hostname only; `http.response.status_code` is recorded when an
HTTP response was observed; `activitypub.cache.enabled` is
recorded on the key and document fetch metrics whenever Fedify can
confidently report the cache layer's presence. Key IDs, actor
URL host, including any non-default port; `http.response.status_code`
is recorded when an HTTP response was observed;
`activitypub.cache.enabled` is recorded on the key and document
fetch metrics whenever Fedify can confidently report the cache
layer's presence. Key IDs, actor
IDs, object IDs, JSON-LD context URLs, full URLs, and fediverse
handles are deliberately excluded so attacker-controlled remotes
cannot inflate metric cardinality. The existing
Expand Down Expand Up @@ -193,8 +210,9 @@ To be released.
`webfinger.resource.scheme` is bucketed to a small allow list
(`acct`, `http`, `https`, `mailto`, or `other`) so an
attacker-controlled query string cannot inflate metric
cardinality; `activitypub.remote.host` records the URL hostname
only. Full resource URIs, lookup URLs, and handle strings are
cardinality; `activitypub.remote.host` records the URL host,
including any non-default port. Full resource URIs, lookup URLs,
and handle strings are
deliberately excluded; they remain on the corresponding spans
(`webfinger.lookup`, `webfinger.handle`,
`activitypub.get_actor_handle`) for trace-level investigation.
Expand All @@ -221,6 +239,7 @@ To be released.
[#316]: https://github.com/fedify-dev/fedify/issues/316
[#418]: https://github.com/fedify-dev/fedify/issues/418
[#619]: https://github.com/fedify-dev/fedify/issues/619
[#620]: https://github.com/fedify-dev/fedify/issues/620
[#735]: https://github.com/fedify-dev/fedify/issues/735
[#736]: https://github.com/fedify-dev/fedify/issues/736
[#737]: https://github.com/fedify-dev/fedify/issues/737
Expand All @@ -241,6 +260,7 @@ To be released.
[#771]: https://github.com/fedify-dev/fedify/pull/771
[#772]: https://github.com/fedify-dev/fedify/pull/772
[#777]: https://github.com/fedify-dev/fedify/pull/777
[#778]: https://github.com/fedify-dev/fedify/pull/778

### @fedify/fixture

Expand Down
1 change: 1 addition & 0 deletions docs/.vitepress/config.mts
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,7 @@ const MANUAL = {
{ text: "Pragmatics", link: "/manual/pragmatics.md" },
{ text: "Key–value store", link: "/manual/kv.md" },
{ text: "Message queue", link: "/manual/mq.md" },
{ text: "Circuit breaker", link: "/manual/circuit-breaker.md" },
{ text: "Integration", link: "/manual/integration.md" },
{ text: "Migration", link: "/manual/migrate.md" },
{ text: "Relay", link: "/manual/relay.md" },
Expand Down
175 changes: 175 additions & 0 deletions docs/manual/circuit-breaker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
Circuit breaker
===============

*This API is available since Fedify 2.3.0.*

Fedify's outbound delivery circuit breaker protects queued ActivityPub
delivery from repeatedly hammering a remote server that is down or returning
server errors. It applies to queued outbox delivery: activities delivered
through a configured `MessageQueue` are tracked per remote inbox host, and an
unhealthy host can temporarily hold further deliveries until a recovery probe
is due.


Enabling and disabling
----------------------

The circuit breaker is enabled by default for queued outbox delivery. To
disable it, pass `circuitBreaker: false` to `createFederation()`:

~~~~ typescript
import { createFederation } from "@fedify/fedify";

const federation = createFederation<void>({
kv,
queue,
circuitBreaker: false,
});
~~~~

To customize the defaults, pass a `CircuitBreakerOptions` object:

~~~~ typescript
import { createFederation } from "@fedify/fedify";

const federation = createFederation<void>({
kv,
queue,
circuitBreaker: {
failureThreshold: 5,
failureWindow: { minutes: 10 },
recoveryDelay: { minutes: 30 },
heldActivityTtl: { days: 7 },
releaseInterval: { seconds: 1 },
},
});
~~~~

The default policy opens a remote host's circuit after five consecutive
counted failures within ten minutes. When the circuit is open, Fedify
requeues affected outbox messages instead of sending them. After the
`recoveryDelay`, one message is allowed through as a half-open probe. If it
succeeds, the circuit closes; if it fails, the circuit opens again.
While the probe is in flight, other held messages continue to be requeued at
`releaseInterval`. If the worker running the probe stops before recording a
success or failure, Fedify treats the half-open probe as stale after another
`recoveryDelay` and allows a replacement probe.


What counts as a failure
------------------------

Fedify counts these delivery failures toward the circuit:

- network errors, including failed `fetch()` calls
- HTTP 5xx responses from the remote inbox

Fedify does not count these responses as circuit failures:

- HTTP 429 responses; the `Retry-After` header is respected when present
- HTTP 4xx responses that are not configured as permanent delivery failures
- configured permanent delivery failures, such as `404` or `410` by default

Any reachable HTTP 4xx response clears the consecutive failure history for
that host because it proves the remote server can be reached.


Custom failure policy
---------------------

You can replace the numeric threshold/window policy with a callback. The
callback receives the full consecutive failure timestamp list for the remote
host and returns whether the circuit should open:

~~~~ typescript
const federation = createFederation<void>({
kv,
queue,
circuitBreaker: {
failure(timestamps) {
return timestamps.length >= 10;
},
},
});
~~~~

The callback form is mutually exclusive with `failureThreshold` and
`failureWindow`.


Held activity expiry
--------------------

Activities held by an open circuit are requeued until the remote host recovers
or the held activity exceeds `heldActivityTtl`, which defaults to seven days.
When a held activity expires, Fedify drops it, records it as an abandoned
outbox activity, calls `circuitBreaker.onActivityDrop` when configured, and
calls the outbox permanent failure handler with
`reason: "circuit-breaker-ttl"`.

~~~~ typescript
const federation = createFederation<void>({
kv,
queue,
circuitBreaker: {
onActivityDrop(remoteHost, details) {
console.warn("Dropped held activity", {
remoteHost,
inbox: details.inbox.href,
activityId: details.activityId,
heldSince: details.heldSince.toString(),
});
},
},
});

federation.setOutboxPermanentFailureHandler((_ctx, failure) => {
if (failure.reason === "circuit-breaker-ttl") {
// The remote host did not recover before the held activity expired.
return;
}

// Existing HTTP permanent-failure handling, such as 404 or 410 cleanup.
});
~~~~


Storage and concurrency
-----------------------

Circuit state is stored in the configured `KvStore` under the
`["_fedify", "circuit", remoteHost]` key prefix by default. The stored value
has this shape:

~~~~ typescript
{
state: "closed" | "open" | "half-open",
failures: string[],
opened?: string,
}
~~~~

For multi-worker deployments, use a `KvStore` implementation that supports
`cas()` so competing workers do not overwrite each other's state transitions.
Fedify still works without CAS, but it logs a warning because concurrent
workers can race when opening or closing the same host's circuit.


Observability
-------------

State changes are emitted through the `onStateChange` callback and through
OpenTelemetry:

- `activitypub.circuit_breaker.state_change` counter with
`activitypub.remote.host` and `activitypub.circuit_breaker.state`
- `activitypub.circuit_breaker.state_change` span event on the queued
outbox worker span with the previous and new state
- `activitypub.circuit_breaker.held` span event on the queued outbox worker
span when an open circuit holds a delivery

The circuit breaker deliberately records only the remote host, not full inbox
URLs, actor IDs, or activity IDs, to keep metric cardinality bounded. For the
full metric and span attribute lists, see the [OpenTelemetry] manual.

[OpenTelemetry]: ./opentelemetry.md
Loading