-
Notifications
You must be signed in to change notification settings - Fork 8
Architecture and Infrastructure
| Component | Details |
|---|---|
| Grafana | Visualization — queries Loki, Mimir, Tempo, Pyroscope |
| Grafana Alloy | Collector deployed as a Deployment to every managed namespace; services push OTLP; metrics scraped via ServiceMonitors |
| Grafana Loki | 3 instances (dev / test / prod), Simple Scalable mode; dev Alloys → dev Loki; test+prod Alloys → prod Loki |
| SeaweedFS | S3-compatible object storage; Dev instance + HA instance; all telemetry backends use S3 gateway only |
No redeployment needed for the above. Only configuration updates where noted.
| Component | Why | Effort |
|---|---|---|
| Grafana Tempo | Distributed trace storage — end-to-end request visibility across services | ~1 day |
| Grafana Mimir | Unified metrics backend — eliminates per-namespace datasource and alert rule duplication | ~1 day |
| Alloy config update | Enable OTLP receiver for traces + Mimir remote-write for metrics | ~2–3h |
| Grafana Dashboards | Import community dashboards for traces, service overview, namespace metrics | ~½ day |
Pyroscope (profiling backend) is deferred to Phase 2.
All new backends follow the same ArgoCD Helm chart pattern established for Loki, backed by new SeaweedFS buckets (telemetry-tempo, telemetry-mimir).
-
Edge Collection: Services push traces, metrics, and profiles via OTLP to the Alloy deployment in their namespace. Language-native profiling SDKs (
pyroscope-io/@pyroscope/nodejs) push profiling data to Alloy's Pyroscope receiver. Faro browser payloads are sent to a proxy route on each service's own backend, which forwards to the local Alloy Faro receiver — without exposing any Alloy endpoint publicly. - Routing: Alloy routes traces → Tempo, metrics → Mimir (remote-write), profiles → Pyroscope, logs → Loki. OTel Exemplars map Trace IDs onto metric data points for metric-to-trace drill-down.
- Storage Offload: Backends flush immutable blocks to SeaweedFS S3 buckets periodically, decoupling retention from container lifecycles.
- Visualization: Grafana queries Mimir and Tempo centrally, bypassing namespace isolation proxies. A latency spike on a Mimir chart can be clicked through directly to the causative Tempo trace.
Browser-side telemetry cannot reach an in-cluster Alloy endpoint due to same-origin restrictions. Pattern:
Browser SDK → https://app.example.com/faro (same origin)
↓
Backend proxy route (Express / FastAPI)
↓
http://alloy:12347 (in-cluster, local namespace)
↓
Alloy faro.receive_http
├── errors / events / Web Vitals → Loki
└── browser traces → Tempo
To keep this non-prescriptive for other deployments:
faro:
enabled: false # opt-in — no footprint on existing deployments
collectorUrl: "" # URL injected into the browser bundle
proxy:
enabled: false # adds a /faro proxy route to the backend service
path: "/faro"
upstreamUrl: "" # e.g. http://alloy:12347 for in-cluster AlloyTeams who expose Alloy via an ingress Route set faro.proxy.enabled: false and point faro.collectorUrl at their public endpoint. Teams not using Faro leave everything false.
All services use standard OpenTelemetry environment variable names — no custom naming. Values are set per-deployment via Helm chart values to point at the local namespace Alloy service:
env:
- name: OTEL_SERVICE_NAME
value: "<service-name>"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://alloy:4317" # gRPC; use :4318 for HTTP
- name: PYROSCOPE_SERVER_ADDRESS
value: "http://alloy:12347" # Phase 2