You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create a production monitoring guide that shows how to turn Fedify's OpenTelemetry metrics into dashboard panels and alert rules.
Fedify already documents OpenTelemetry tracing in docs/manual/opentelemetry.md, and the deployment guide already lists the federation signals operators should monitor in docs/manual/deploy.md. Once the metrics work under #316 lands, operators will need examples that connect those metrics to production questions they actually ask: is the queue falling behind, are inbox handlers slow, are outbound deliveries failing, and are signatures failing more often than normal?
Problem
The current documentation stops before the dashboard and alerting layer. It explains tracing and names important production signals, but it does not show how to build an initial dashboard for a Fedify service.
Without that guidance, each operator has to infer which metric belongs in which panel, which labels are safe to aggregate by, and which failures should page someone instead of only appearing in traces or logs. This is easy to get wrong in federation code because raw actor IDs, object IDs, inbox URLs, and remote URLs can create unbounded metric cardinality.
Proposed solution
Document this in a new manual page at docs/manual/monitoring.md.
The guide should cover:
A minimal dashboard layout for federation health: queue backlog, inbox processing latency, outbound delivery attempts, outbound delivery failure rate, permanent delivery failures, and signature verification latency.
Alert examples for growing queue backlog, outbound failure spikes, sustained inbox latency, sustained increases in 404/410 responses from remote inbox delivery attempts, and sudden or sustained increases in signature verification failures.
Guidance that 404/410 spikes should usually be investigation alerts rather than paging alerts, since remote account deletion and instance churn are normal fediverse behavior.
Notes that alert thresholds are starting points, not defaults suitable for every deployment.
One OpenTelemetry Collector example that exposes metrics to Prometheus or forwards them to a backend through OTLP.
A short boundary between Fedify metrics and metrics owned by the runtime, database, queue backend, or host platform.
Cardinality rules for dashboard and alert authors.
Raw URLs, actor IDs, object IDs, and inbox URLs should not be used as metric labels. Hostnames should only be used where the metric design explicitly calls for a normalized, bounded remote-host attribute.
The guide should distinguish Fedify metrics from process, runtime, database, queue-backend, and host-platform metrics. Examples include CPU, RSS, event-loop lag, GC pauses, database connection pool saturation, Redis latency, PostgreSQL query latency, and platform queue depth.
Update docs/.vitepress/config.mts so the page appears in the manual navigation.
Link to the new page from the Observability in production section in docs/manual/deploy.md and from docs/manual/opentelemetry.md once metrics are documented there.
Scope
This is a documentation issue. It should not add new metrics, change metric names, or define behavior that belongs in the implementation issues under #316.
The guide should assume the relevant Fedify metrics already exist or are being finalized separately. If a required metric is missing, leave that section blocked on the relevant #316 sub-issue instead of inventing a metric name.
The guide should stay vendor-neutral. Prometheus and OpenTelemetry Collector examples are useful because they are common integration points, but vendor-specific setup should stop at the collector boundary and link out for full setup instructions.
Acceptance criteria
docs/manual/monitoring.md exists and has enough detail for an operator to create a first dashboard without reading Fedify source code.
The page is listed in the VitePress manual sidebar.
docs/manual/deploy.md links to the new guide from the production observability section.
docs/manual/opentelemetry.md links to the new guide after the Fedify metrics reference is available.
The guide distinguishes Fedify metrics from process, runtime, database, queue-backend, and host-platform metrics.
Summary
Create a production monitoring guide that shows how to turn Fedify's OpenTelemetry metrics into dashboard panels and alert rules.
Fedify already documents OpenTelemetry tracing in docs/manual/opentelemetry.md, and the deployment guide already lists the federation signals operators should monitor in docs/manual/deploy.md. Once the metrics work under #316 lands, operators will need examples that connect those metrics to production questions they actually ask: is the queue falling behind, are inbox handlers slow, are outbound deliveries failing, and are signatures failing more often than normal?
Problem
The current documentation stops before the dashboard and alerting layer. It explains tracing and names important production signals, but it does not show how to build an initial dashboard for a Fedify service.
Without that guidance, each operator has to infer which metric belongs in which panel, which labels are safe to aggregate by, and which failures should page someone instead of only appearing in traces or logs. This is easy to get wrong in federation code because raw actor IDs, object IDs, inbox URLs, and remote URLs can create unbounded metric cardinality.
Proposed solution
Document this in a new manual page at docs/manual/monitoring.md.
The guide should cover:
404/410responses from remote inbox delivery attempts, and sudden or sustained increases in signature verification failures.404/410spikes should usually be investigation alerts rather than paging alerts, since remote account deletion and instance churn are normal fediverse behavior.Raw URLs, actor IDs, object IDs, and inbox URLs should not be used as metric labels. Hostnames should only be used where the metric design explicitly calls for a normalized, bounded remote-host attribute.
The guide should distinguish Fedify metrics from process, runtime, database, queue-backend, and host-platform metrics. Examples include CPU, RSS, event-loop lag, GC pauses, database connection pool saturation, Redis latency, PostgreSQL query latency, and platform queue depth.
Update docs/.vitepress/config.mts so the page appears in the manual navigation.
Link to the new page from the
Observability in productionsection in docs/manual/deploy.md and from docs/manual/opentelemetry.md once metrics are documented there.Scope
This is a documentation issue. It should not add new metrics, change metric names, or define behavior that belongs in the implementation issues under #316.
The guide should assume the relevant Fedify metrics already exist or are being finalized separately. If a required metric is missing, leave that section blocked on the relevant #316 sub-issue instead of inventing a metric name.
The guide should stay vendor-neutral. Prometheus and OpenTelemetry Collector examples are useful because they are common integration points, but vendor-specific setup should stop at the collector boundary and link out for full setup instructions.
Acceptance criteria
Open questions