Skip to content

Production monitoring dashboards and alerting guide #743

@dahlia

Description

@dahlia

Summary

Create a production monitoring guide that shows how to turn Fedify's OpenTelemetry metrics into dashboard panels and alert rules.

Fedify already documents OpenTelemetry tracing in docs/manual/opentelemetry.md, and the deployment guide already lists the federation signals operators should monitor in docs/manual/deploy.md. Once the metrics work under #316 lands, operators will need examples that connect those metrics to production questions they actually ask: is the queue falling behind, are inbox handlers slow, are outbound deliveries failing, and are signatures failing more often than normal?

Problem

The current documentation stops before the dashboard and alerting layer. It explains tracing and names important production signals, but it does not show how to build an initial dashboard for a Fedify service.

Without that guidance, each operator has to infer which metric belongs in which panel, which labels are safe to aggregate by, and which failures should page someone instead of only appearing in traces or logs. This is easy to get wrong in federation code because raw actor IDs, object IDs, inbox URLs, and remote URLs can create unbounded metric cardinality.

Proposed solution

Document this in a new manual page at docs/manual/monitoring.md.

The guide should cover:

  • A minimal dashboard layout for federation health: queue backlog, inbox processing latency, outbound delivery attempts, outbound delivery failure rate, permanent delivery failures, and signature verification latency.
  • Prometheus or PromQL examples using the metric names finalized under Comprehensive OpenTelemetry metrics collection #316.
  • Alert examples for growing queue backlog, outbound failure spikes, sustained inbox latency, sustained increases in 404/410 responses from remote inbox delivery attempts, and sudden or sustained increases in signature verification failures.
  • Guidance that 404/410 spikes should usually be investigation alerts rather than paging alerts, since remote account deletion and instance churn are normal fediverse behavior.
  • Notes that alert thresholds are starting points, not defaults suitable for every deployment.
  • One OpenTelemetry Collector example that exposes metrics to Prometheus or forwards them to a backend through OTLP.
  • A short boundary between Fedify metrics and metrics owned by the runtime, database, queue backend, or host platform.
  • Cardinality rules for dashboard and alert authors.

Raw URLs, actor IDs, object IDs, and inbox URLs should not be used as metric labels. Hostnames should only be used where the metric design explicitly calls for a normalized, bounded remote-host attribute.

The guide should distinguish Fedify metrics from process, runtime, database, queue-backend, and host-platform metrics. Examples include CPU, RSS, event-loop lag, GC pauses, database connection pool saturation, Redis latency, PostgreSQL query latency, and platform queue depth.

Update docs/.vitepress/config.mts so the page appears in the manual navigation.

Link to the new page from the Observability in production section in docs/manual/deploy.md and from docs/manual/opentelemetry.md once metrics are documented there.

Scope

This is a documentation issue. It should not add new metrics, change metric names, or define behavior that belongs in the implementation issues under #316.

The guide should assume the relevant Fedify metrics already exist or are being finalized separately. If a required metric is missing, leave that section blocked on the relevant #316 sub-issue instead of inventing a metric name.

The guide should stay vendor-neutral. Prometheus and OpenTelemetry Collector examples are useful because they are common integration points, but vendor-specific setup should stop at the collector boundary and link out for full setup instructions.

Acceptance criteria

  • docs/manual/monitoring.md exists and has enough detail for an operator to create a first dashboard without reading Fedify source code.
  • The page is listed in the VitePress manual sidebar.
  • docs/manual/deploy.md links to the new guide from the production observability section.
  • docs/manual/opentelemetry.md links to the new guide after the Fedify metrics reference is available.
  • The guide distinguishes Fedify metrics from process, runtime, database, queue-backend, and host-platform metrics.
  • The examples use stable metric names and bounded attributes from the metrics work tracked under Comprehensive OpenTelemetry metrics collection #316.
  • Alert examples explain the failure condition they are meant to catch and state that thresholds are deployment-specific.
  • The guide does not require a specific observability vendor.

Open questions

  • Should this issue include a small Grafana dashboard JSON file, or should exported dashboard files be tracked separately after the written guide lands?
  • Should the first version use only PromQL examples, and leave non-Prometheus query examples for follow-up issues?
  • Should dashboard screenshots be part of this issue, or should the first version stay text-only?

Metadata

Metadata

Assignees

Labels

component/otelOpenTelemetry integrationtype/documentationImprovements or additions to documentationtype/enhancementImprovements to existing features

Type

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions