Skip to content

Architecture and Infrastructure

Ivan P edited this page Jun 5, 2026 · 3 revisions

Architecture & Infrastructure

Existing Stack (Already Deployed via ArgoCD)

Component Details
Grafana Visualization — queries Loki, Mimir, Tempo, Pyroscope
Grafana Alloy Collector deployed as a Deployment to every managed namespace; services push OTLP; metrics scraped via ServiceMonitors
Grafana Loki 3 instances (dev / test / prod), Simple Scalable mode; dev Alloys → dev Loki; test+prod Alloys → prod Loki
SeaweedFS S3-compatible object storage; Dev instance + HA instance; all telemetry backends use S3 gateway only

No redeployment needed for the above. Only configuration updates where noted.


New Backends — Phase 1

Component Why Effort
Grafana Tempo Distributed trace storage — end-to-end request visibility across services ~1 day
Grafana Mimir Unified metrics backend — eliminates per-namespace datasource and alert rule duplication ~1 day
Alloy config update Enable OTLP receiver for traces + Mimir remote-write for metrics ~2–3h
Grafana Dashboards Import community dashboards for traces, service overview, namespace metrics ~½ day

Pyroscope (profiling backend) is deferred to Phase 2.

All new backends follow the same ArgoCD Helm chart pattern established for Loki, backed by new SeaweedFS buckets (telemetry-tempo, telemetry-mimir).


Data Flow

  1. Edge Collection: Services push traces, metrics, and profiles via OTLP to the Alloy deployment in their namespace. Language-native profiling SDKs (pyroscope-io / @pyroscope/nodejs) push profiling data to Alloy's Pyroscope receiver. Faro browser payloads are sent to a proxy route on each service's own backend, which forwards to the local Alloy Faro receiver — without exposing any Alloy endpoint publicly.
  2. Routing: Alloy routes traces → Tempo, metrics → Mimir (remote-write), profiles → Pyroscope, logs → Loki. OTel Exemplars map Trace IDs onto metric data points for metric-to-trace drill-down.
  3. Storage Offload: Backends flush immutable blocks to SeaweedFS S3 buckets periodically, decoupling retention from container lifecycles.
  4. Visualization: Grafana queries Mimir and Tempo centrally, bypassing namespace isolation proxies. A latency spike on a Mimir chart can be clicked through directly to the causative Tempo trace.

Faro Collection Pattern

Browser-side telemetry cannot reach an in-cluster Alloy endpoint due to same-origin restrictions. Pattern:

Browser SDK → https://app.example.com/faro (same origin)
                    ↓
              Backend proxy route (Express / FastAPI)
                    ↓
              http://alloy:12347 (in-cluster, local namespace)
                    ↓
              Alloy faro.receive_http
              ├── errors / events / Web Vitals → Loki
              └── browser traces → Tempo

Helm Values Design

To keep this non-prescriptive for other deployments:

faro:
  enabled: false       # opt-in — no footprint on existing deployments
  collectorUrl: ""     # URL injected into the browser bundle
  proxy:
    enabled: false     # adds a /faro proxy route to the backend service
    path: "/faro"
    upstreamUrl: ""    # e.g. http://alloy:12347 for in-cluster Alloy

Teams who expose Alloy via an ingress Route set faro.proxy.enabled: false and point faro.collectorUrl at their public endpoint. Teams not using Faro leave everything false.


Environment Variable Conventions

All services use standard OpenTelemetry environment variable names — no custom naming. Values are set per-deployment via Helm chart values to point at the local namespace Alloy service:

env:
  - name: OTEL_SERVICE_NAME
    value: "<service-name>"
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://alloy:4317"          # gRPC; use :4318 for HTTP
  - name: PYROSCOPE_SERVER_ADDRESS
    value: "http://alloy:12347"         # Phase 2

Clone this wiki locally