Skip to content

Phase 1 Infrastructure Rollout

Ivan P edited this page Jun 5, 2026 · 6 revisions

Deploy the new telemetry backends following the same ArgoCD + Helm pattern established for Loki. This is a one-time cost and unblocks all service onboarding.

Total estimated effort: ~2–2.5 days Prerequisite for: Phase 2 — Service Onboarding


Issue: Deploy Grafana Tempo

Labels: phase-1 infrastructure tracing Estimated Effort: ~1 day

Description

Deploy Grafana Tempo as the distributed trace storage backend. Tempo is the core requirement for end-to-end request visibility — without it, no traces from instrumented services can be stored or queried.

Requirements

  • ArgoCD Application + Helm chart for Tempo following the same pattern as the Loki deployment
  • New SeaweedFS bucket: telemetry-tempo
  • Tempo configured to use SeaweedFS S3 gateway as object storage backend
  • Grafana datasource added pointing to Tempo

Expected Outcome

Grafana shows a working Tempo datasource. A test trace (e.g. from a manually instrumented script) is queryable in the Explore view.

Acceptance Criteria

  • ArgoCD Application for Tempo is healthy and synced
  • telemetry-tempo bucket exists in SeaweedFS
  • Tempo is storing trace data to SeaweedFS (verify via bucket contents after test trace)
  • Grafana Tempo datasource is configured and returns results

Issue: Deploy Grafana Mimir

Labels: phase-1 infrastructure metrics Estimated Effort: ~1 day

Description

Deploy Grafana Mimir as a unified metrics backend. Mimir consolidates the current pattern of one Grafana datasource per namespace into a single backend, and allows alert rules to be written once across all namespaces. It also enables OTel Exemplars — Trace IDs embedded in metric data points that link directly to the causative trace in Tempo.

Requirements

  • ArgoCD Application + Helm chart for Mimir following the same pattern as the Loki deployment
  • New SeaweedFS bucket: telemetry-mimir
  • Mimir configured to use SeaweedFS S3 gateway as object storage backend
  • Grafana datasource added pointing to Mimir
  • Alloy remote-write config pointing to Mimir (covered in Alloy config issue below)

Expected Outcome

Grafana shows a single Mimir datasource. Metrics from all namespaces are queryable without switching datasources.

Acceptance Criteria

  • ArgoCD Application for Mimir is healthy and synced
  • telemetry-mimir bucket exists in SeaweedFS
  • Alloy is remote-writing metrics to Mimir (verify via Mimir query returning namespace metrics)
  • Grafana Mimir datasource is configured and returns results
  • Existing per-namespace Prometheus datasources can be deprecated (or documented for removal)

Issue: Update Alloy Configuration

Labels: phase-1 infrastructure alloy Estimated Effort: ~2–3h

Description

Update the shared Alloy Helm values to enable OTLP trace receiving (so instrumented services can push traces to Tempo) and Mimir remote-write (so Alloy forwards scraped metrics to Mimir). This is a small diff to the existing shared values file — the same pattern used for Loki log routing.

Requirements

  • Enable otlp.receiver in Alloy config (gRPC on port 4317, HTTP on port 4318)
  • Add prometheus.remote_write block pointing to Mimir
  • Route received OTLP traces to Tempo
  • Verify no disruption to existing Loki log forwarding

Expected Outcome

Alloy accepts OTLP pushes from instrumented services and forwards traces to Tempo. Scraped metrics are remote-written to Mimir.

Acceptance Criteria

  • Alloy config diff reviewed and merged
  • ArgoCD syncs the updated Alloy config to all namespaces without errors
  • A test OTLP push to http://alloy:4317 results in a trace visible in Tempo
  • Mimir receives metrics from Alloy (verify via Mimir query)
  • Loki log forwarding unaffected

Issue: Import Grafana Dashboards

Labels: phase-1 infrastructure dashboards Estimated Effort: ~½ day

Description

Import and configure pre-built community dashboards for the new telemetry backends. These provide immediate operational value as soon as services begin pushing traces.

Requirements

  • Trace explorer / service map dashboard (Tempo)
  • Service overview dashboard (RED metrics: Rate, Errors, Duration)
  • Namespace metrics overview dashboard (Mimir)
  • Light customisation to match namespace/label conventions

Expected Outcome

Grafana home shows populated dashboards with trace, service, and metrics views available out of the box once services are onboarded.

Acceptance Criteria

  • Tempo trace explorer dashboard is imported and functional
  • Service overview dashboard populates once at least one service is instrumented
  • Namespace metrics dashboard queries Mimir (not per-namespace Prometheus)
  • Dashboards are provisioned via config (not manually created in UI)

Clone this wiki locally