Phase 1 Infrastructure Rollout

Deploy the new telemetry backends following the same ArgoCD + Helm pattern established for Loki. This is a one-time cost and unblocks all service onboarding.

Total estimated effort: ~3.5–4 days Prerequisite for: Phase 2 — Service Onboarding

Deployment Architecture

Chart Selection

Component	Dev	Test	Prod
Tempo	`grafana-community/tempo` (monolithic)	`grafana-community/tempo` (monolithic)	`grafana-community/tempo-distributed`
Mimir	`mimir-distributed` (replicas: 1)	`mimir-distributed` (replicas: 1)	`mimir-distributed` (HA)
Pyroscope	deferred to Phase 3	deferred to Phase 3	deferred to Phase 3

Tempo chart rationale: The monolithic chart runs all components in a single binary — sufficient for dev/test at low traffic. tempo-distributed splits into independent components (distributor, ingester, querier, query-frontend, compactor, store-gateway) so each can scale independently in prod. Mimir has no monolithic chart; replicas: 1 on all components gives a minimal functional deployment for dev/test.

Environment Rollout Order

dev → test → prod

Deploy to dev first to validate Helm values and SeaweedFS S3 config. Promote proven values to test (different bucket names, same chart). Deploy prod last with distributed chart and HA replica counts.

Resource Policy

All deployments follow this policy across all three backends:

Component type	CPU request	CPU limit	Memory request	Memory limit
Ingesters	✓	—	✓	✓ (4–8Gi)
Store-gateways	✓	—	✓	✓ (2–4Gi)
Distributors	✓	—	✓	—
Queriers	✓	—	✓	—
Query-frontend	✓	—	✓	—
Compactors	✓	—	✓	—

CPU limits are not set to avoid CFS quota throttling — write-path components (ingesters, distributors) experience latency spikes when throttled even when node capacity is available. CPU requests are retained for scheduling and HPA. Memory limits are set only on ingesters and store-gateways where unbounded growth would risk a node-level OOM; a single replica OOM kill is recoverable, a node OOM is not.

Storage Buckets

Following the dts-{type}-{env}-{purpose} convention established for Loki:

seaweedfs-dev:

dts-traces-dev-blocks
dts-metrics-dev-blocks, dts-metrics-dev-ruler, dts-metrics-dev-alertmanager

seaweedfs-ha (test + prod):

dts-traces-test-blocks, dts-traces-prod-blocks
dts-metrics-test-blocks, dts-metrics-test-ruler, dts-metrics-test-alertmanager
dts-metrics-prod-blocks, dts-metrics-prod-ruler, dts-metrics-prod-alertmanager

Phase 3 (Pyroscope) will add: dts-profiles-{dev,test,prod}-blocks

Grafana Datasources

Each environment gets a matched set of datasources in the shared Grafana instance:

Tempo (dev) / Tempo (test) / Tempo (prod)
Mimir (dev) / Mimir (test) / Mimir (prod)

Namespace Resource Analysis

Available Headroom (pre-deployment)

Namespace	CPU available	Memory available	Storage available
ca7f8f-dev	2530m	~11.3Gi	~33Gi
ca7f8f-test	5340m	~24.8Gi	~30Gi
ca7f8f-prod	1370m	~7.5Gi	~9Gi

Estimated New Resource Requests

Environment	CPU req	Memory req	Storage PVCs
dev (Tempo monolithic + Mimir ×1)	~400m	~2Gi	~4Gi
test (Tempo monolithic + Mimir ×1)	~900m	~4Gi	~8Gi
prod (Tempo-distributed + Mimir HA)	~2900m	~13.5Gi	~16Gi

Phase 3 addition to prod (Pyroscope HA): ~950m CPU, ~3.3Gi memory, ~2Gi storage.

Dev and test fit within existing quotas with comfortable headroom. Prod requires quota increases — included below covering both Phase 1 and Phase 3 to avoid a follow-up request.

Prod Component Breakdown (Phase 1)

Component	Replicas	CPU req	Memory req	Storage PVC
Tempo distributor	×2	200m	512Mi	—
Tempo ingester	×3	600m	3Gi	6Gi (2Gi×3)
Tempo querier	×2	200m	1Gi	—
Tempo query-frontend	×2	100m	512Mi	—
Tempo compactor	×1	100m	512Mi	—
Tempo store-gateway	×2	200m	1Gi	—
Mimir distributor	×2	200m	512Mi	—
Mimir ingester	×3	600m	3Gi	6Gi
Mimir querier	×2	200m	1Gi	—
Mimir query-frontend	×2	100m	512Mi	—
Mimir compactor	×1	100m	512Mi	2Gi
Mimir store-gateway	×2	200m	1Gi	—
Mimir alertmanager	×2	100m	512Mi	2Gi
Phase 1 total		~2900m	~13.5Gi	~16Gi

Prod Quota Increase Request

Resource	Current quota	Currently used	Phase 1 new	Phase 3 new	Total required	Requested	Headroom
CPU	4000m	2630m	~2900m	~950m	~6480m	10000m	~35%
Memory	16Gi	~8.5Gi	~13.5Gi	~3.3Gi	~25.3Gi	32Gi	~21%
Storage	64Gi	~55Gi	~16Gi	~2Gi	~73Gi	96Gi	~24%

Justification

We are deploying Grafana Tempo (distributed tracing) and Grafana Mimir (long-term metrics) as high-availability backends to ca7f8f-prod in Phase 1, with Grafana Pyroscope (continuous profiling) to follow in Phase 3. These are infrastructure-tier services that underpin monitoring, alerting, and observability for all production workloads.

HA requirement: A minimum of 3 ingester replicas per backend (replication factor 2) is required to survive a single pod failure without write-path data loss. Stateless components (distributors, queriers, query-frontends, store-gateways) run at 2 replicas to allow rolling updates without downtime. Ingesters require local WAL PVCs — data is flushed to SeaweedFS S3 on graceful shutdown, but local PVC ensures no loss during pod restarts or rolling deployments.

No CPU limits: CPU limits are intentionally not set. The Linux CFS scheduler throttles pods that exceed their CPU quota within a 100ms window, causing latency spikes in write-path components even when node capacity is available. CPU requests are retained and are sufficient for scheduling and HPA-based autoscaling.

Single request for Phases 1 and 3: Pyroscope prod resource requirements (~950m CPU, ~3.3Gi memory, ~2Gi storage) are included in this request to avoid a follow-up quota increase when Phase 3 is deployed.

Requested increases: CPU 4000m → 10000m, Memory 16Gi → 32Gi, Storage 64Gi → 96Gi.

Issue: Deploy Grafana Tempo

Labels: phase-1 infrastructure tracing Estimated Effort: ~1.5 days

Description

Deploy Grafana Tempo as the distributed trace storage backend across all three environments. Tempo is the core requirement for end-to-end request visibility — without it, no traces from instrumented services can be stored or queried.

Requirements

ArgoCD ApplicationSet (or three Applications) for Tempo following the same pattern as Loki
grafana-community/tempo for dev and test; grafana-community/tempo-distributed for prod
SeaweedFS buckets created per the naming convention above
Tempo configured to use SeaweedFS S3 gateway as object storage backend
Resource policy applied (no CPU limits; memory limits on ingesters and store-gateways only)
Grafana datasources added: Tempo (dev), Tempo (test), Tempo (prod)

Prod replica baseline

distributor×2, ingester×3 (RF=2), querier×2, query-frontend×2, compactor×1, store-gateway×2

Expected Outcome

All three Grafana Tempo datasources are functional. A test trace (e.g. from a manually instrumented script or telemetrygen) is queryable in the Explore view for each environment.

Acceptance Criteria

ArgoCD Application(s) for Tempo are healthy and synced in all three namespaces
dts-traces-{dev,test,prod}-blocks buckets exist in the appropriate SeaweedFS instance
Tempo is storing trace data to SeaweedFS (verify via bucket contents after test trace)
Tempo (dev), Tempo (test), Tempo (prod) datasources configured in Grafana and returning results
No CPU limits set; memory limits applied to ingesters and store-gateways only

Issue: Deploy Grafana Mimir

Labels: phase-1 infrastructure metrics Estimated Effort: ~1.5 days

Description

Deploy Grafana Mimir as a unified long-term metrics backend across all three environments. Mimir consolidates the current pattern of one Grafana datasource per namespace into a single queryable backend, enables OTel Exemplars (trace IDs embedded in metric datapoints linking directly to Tempo), and supports cross-namespace alerting rules written once.

Requirements

ArgoCD ApplicationSet (or three Applications) for mimir-distributed with environment-specific values
Dev and test: all component replicas set to 1 (minimal deployment)
Prod: HA replica counts per breakdown above; PodDisruptionBudgets enabled
SeaweedFS buckets created: dts-metrics-{env}-{blocks,ruler,alertmanager}
Resource policy applied (no CPU limits; memory limits on ingesters and store-gateways only)
Grafana datasources added: Mimir (dev), Mimir (test), Mimir (prod)
Alloy remote-write config pointing to Mimir (covered in Alloy config issue below)

Expected Outcome

A single Mimir datasource per environment replaces the per-namespace Prometheus datasources. Metrics from all namespaces are queryable without switching datasources.

Acceptance Criteria

ArgoCD Application(s) for Mimir are healthy and synced in all three namespaces
dts-metrics-{dev,test,prod}-{blocks,ruler,alertmanager} buckets exist in the appropriate SeaweedFS instance
Alloy is remote-writing metrics to Mimir (verify via Mimir query returning namespace metrics)
Mimir (dev), Mimir (test), Mimir (prod) datasources configured in Grafana and returning results
No CPU limits set; memory limits applied to ingesters and store-gateways only
Existing per-namespace Prometheus datasources documented for deprecation

Issue: Update Alloy Configuration

Labels: phase-1 infrastructure alloy Estimated Effort: ~2–3h

Description

Update the shared Alloy Helm values to enable OTLP trace receiving (so instrumented services can push traces to Tempo) and Mimir remote-write (so Alloy forwards scraped metrics to Mimir). This is a small diff to the existing shared values file — the same pattern used for Loki log routing.

Requirements

Enable otlp.receiver in Alloy config (gRPC on port 4317, HTTP on port 4318)
Add prometheus.remote_write blocks pointing to the appropriate Mimir instance per namespace (dev → Mimir dev, test → Mimir test, prod → Mimir prod)
Route received OTLP traces to the appropriate Tempo instance per namespace
Verify no disruption to existing Loki log forwarding

Expected Outcome

Alloy accepts OTLP pushes from instrumented services and routes traces to the correct Tempo instance. Scraped metrics are remote-written to the correct Mimir instance per environment.

Acceptance Criteria

Alloy config diff reviewed and merged
ArgoCD syncs the updated Alloy config to all namespaces without errors
A test OTLP push to http://alloy:4317 in each namespace results in a trace visible in the matching Tempo datasource
Mimir receives metrics from Alloy in each environment (verify via Mimir query)
Loki log forwarding unaffected

Issue: Import Grafana Dashboards

Labels: phase-1 infrastructure dashboards Estimated Effort: ~½ day

Description

Import and configure pre-built community dashboards for the new telemetry backends. These provide immediate operational value as soon as services begin pushing traces and metrics.

Requirements

Trace explorer / service map dashboard (Tempo)
Service overview dashboard (RED metrics: Rate, Errors, Duration)
Namespace metrics overview dashboard (Mimir)
Dashboards parameterised by datasource variable so a single dashboard covers dev/test/prod via the environment selector
Light customisation to match namespace/label conventions

Expected Outcome

Grafana home shows populated dashboards with trace, service, and metrics views available out of the box once services are onboarded in Phase 2.

Acceptance Criteria

Tempo trace explorer dashboard imported and functional
Service overview dashboard populates once at least one service is instrumented
Namespace metrics dashboard queries Mimir (not per-namespace Prometheus)
Dashboards are provisioned via config (not manually created in UI)
Datasource variable allows switching between dev/test/prod environments

Uh oh!

Phase 1 Infrastructure Rollout

Deployment Architecture

Chart Selection

Environment Rollout Order

Resource Policy

Storage Buckets

Grafana Datasources

Namespace Resource Analysis

Available Headroom (pre-deployment)

Estimated New Resource Requests

Prod Component Breakdown (Phase 1)

Prod Quota Increase Request

Justification

Issue: Deploy Grafana Tempo

Description

Requirements

Prod replica baseline

Expected Outcome

Acceptance Criteria

Issue: Deploy Grafana Mimir

Description

Requirements

Expected Outcome

Acceptance Criteria

Issue: Update Alloy Configuration

Description

Requirements

Expected Outcome

Acceptance Criteria

Issue: Import Grafana Dashboards

Description

Requirements

Expected Outcome

Acceptance Criteria

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally