You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add runnable monitoring examples for Fedify's OpenTelemetry metrics, including a Grafana dashboard, Prometheus alert rules, and a minimal OpenTelemetry Collector setup.
The production monitoring guide in #743 should explain how operators think about Fedify metrics. This issue should provide files that users can copy, run, and adapt. The examples should make the Milestone 6 dashboard deliverable concrete by showing panels for activity delivery success rates, peer connectivity health, queue backlog, signature verification latency, inbox processing latency, and resource utilization patterns.
A likely location is examples/monitoring/, since these files are meant to be executable examples rather than prose-only documentation.
Problem
Fedify's planned metrics work will expose the raw signals needed for production observability, but raw metric names are not enough for many users. Operators still need to know how those metrics fit together in a dashboard, which alert rules are safe starting points, and how to wire OpenTelemetry metrics into a common Prometheus/Grafana setup.
The documentation issue #743 covers the written guide, but it should not have to carry exported dashboard files and runnable configuration. Keeping the example files in a separate issue gives us a clearer deliverable and makes it easier to test that the example setup still starts.
Without these examples, each deployment has to rebuild the same baseline monitoring setup from scratch, including choices that are easy to get wrong:
Which labels are safe to group by without creating unbounded cardinality.
Which delivery failures should page someone and which should only create investigation alerts.
How to distinguish Fedify metrics from runtime, database, queue backend, and host metrics.
How to connect OpenTelemetry Collector, Prometheus, and Grafana in a minimal local stack.
How to visualize queue backlog and delivery failures without exposing actor IDs, inbox URLs, object IDs, or raw remote URLs as labels.
Proposed solution
Add a monitoring example under examples/monitoring/.
The example should run a local observability stack that can receive or scrape metrics from a Fedify application:
docker compose -f examples/monitoring/docker-compose.yml up
The OpenTelemetry Collector config should expose a common path for local development, for example OTLP input on localhost:4317/localhost:4318 and Prometheus-compatible output for Prometheus scraping. The exact setup can change if the metrics implementation chooses a different recommended path, but the example should stay vendor-neutral and avoid managed-service-specific configuration.
The Grafana dashboard should include a baseline Fedify overview with panels such as:
HTTP request rate and latency by Fedify endpoint.
Inbox processing latency and error rate.
Outbound delivery attempts, success rate, retry rate, and permanent failures.
Queue depth by queue role and state.
Queue task processing duration and failure rate.
Signature verification duration and failure rate.
WebFinger and actor discovery latency and failure rate.
Remote document/key lookup latency and failure rate.
Peer connectivity health by bounded remote host attribute, where the metric design supports it.
Resource utilization context from runtime, process, database, or queue backend metrics, clearly separated from Fedify-owned metrics.
The Prometheus alert rules should provide starting points for common production symptoms:
Growing queue backlog.
Sustained inbox processing latency.
Outbound delivery failure spikes.
Permanent delivery failure spikes.
Sudden increase in signature verification failures.
Sustained discovery or key lookup failures.
Sustained 404/410 increases from remote delivery attempts, marked as investigation alerts rather than paging alerts by default.
Missing Fedify metrics from an expected target.
The alert thresholds should be explicitly labeled as examples, not defaults suitable for every deployment.
The example README.md should explain how to run the stack, how to point a local Fedify app at it, and how the example relates to #743. It should also explain that the dashboard and rules depend on the metric names finalized under #316 and #619, plus the follow-up metrics issues #735 through #742.
Scope
Add runnable monitoring example files, not just documentation prose.
Include a Grafana dashboard JSON file.
Include Prometheus alert rule examples.
Include a minimal OpenTelemetry Collector and Prometheus setup.
Include Grafana provisioning files so the dashboard loads automatically in the local example stack.
Keep the examples vendor-neutral beyond OpenTelemetry Collector, Prometheus, and Grafana.
Use only stable metric names and bounded attributes finalized by the metrics implementation issues.
Make resource utilization panels clearly separate from Fedify-owned metrics.
Summary
Add runnable monitoring examples for Fedify's OpenTelemetry metrics, including a Grafana dashboard, Prometheus alert rules, and a minimal OpenTelemetry Collector setup.
The production monitoring guide in #743 should explain how operators think about Fedify metrics. This issue should provide files that users can copy, run, and adapt. The examples should make the Milestone 6 dashboard deliverable concrete by showing panels for activity delivery success rates, peer connectivity health, queue backlog, signature verification latency, inbox processing latency, and resource utilization patterns.
A likely location is examples/monitoring/, since these files are meant to be executable examples rather than prose-only documentation.
Problem
Fedify's planned metrics work will expose the raw signals needed for production observability, but raw metric names are not enough for many users. Operators still need to know how those metrics fit together in a dashboard, which alert rules are safe starting points, and how to wire OpenTelemetry metrics into a common Prometheus/Grafana setup.
The documentation issue #743 covers the written guide, but it should not have to carry exported dashboard files and runnable configuration. Keeping the example files in a separate issue gives us a clearer deliverable and makes it easier to test that the example setup still starts.
Without these examples, each deployment has to rebuild the same baseline monitoring setup from scratch, including choices that are easy to get wrong:
Proposed solution
Add a monitoring example under examples/monitoring/.
One possible file layout:
The example should run a local observability stack that can receive or scrape metrics from a Fedify application:
The OpenTelemetry Collector config should expose a common path for local development, for example OTLP input on
localhost:4317/localhost:4318and Prometheus-compatible output for Prometheus scraping. The exact setup can change if the metrics implementation chooses a different recommended path, but the example should stay vendor-neutral and avoid managed-service-specific configuration.The Grafana dashboard should include a baseline Fedify overview with panels such as:
The Prometheus alert rules should provide starting points for common production symptoms:
404/410increases from remote delivery attempts, marked as investigation alerts rather than paging alerts by default.The alert thresholds should be explicitly labeled as examples, not defaults suitable for every deployment.
The example README.md should explain how to run the stack, how to point a local Fedify app at it, and how the example relates to #743. It should also explain that the dashboard and rules depend on the metric names finalized under #316 and #619, plus the follow-up metrics issues #735 through #742.
Scope
Acceptance criteria
Open questions
mise test:examples, a separate monitoring validation task, or lightweight config validation in CI?