Phase 1 Infrastructure Rollout

Deploy the new telemetry backends following the same ArgoCD + Helm pattern established for Loki. This is a one-time cost and unblocks all service onboarding.

Total estimated effort: ~3.5–4 days Prerequisite for: Phase 2 — Service Onboarding

Deployment Architecture

Chart Selection

Component	Dev	Test	Prod
Tempo	`grafana-community/tempo` (monolithic)	`grafana-community/tempo` (monolithic)	`grafana-community/tempo-distributed`
Mimir	`mimir-distributed` (replicas: 1)	`mimir-distributed` (replicas: 1)	`mimir-distributed` (HA)
Pyroscope	deferred to Phase 3	deferred to Phase 3	deferred to Phase 3

Tempo chart rationale: The monolithic chart runs all components in a single binary — sufficient for dev/test at low traffic. tempo-distributed splits into independent components (distributor, ingester, querier, query-frontend, compactor, store-gateway) so each can scale independently in prod. Mimir has no monolithic chart; replicas: 1 on all components gives a minimal functional deployment for dev/test.

Environment Rollout Order

dev → test → prod

Deploy to dev first to validate Helm values and SeaweedFS S3 config. Promote proven values to test (different bucket names, same chart). Deploy prod last with distributed chart and HA replica counts.

Resource Policy

All deployments follow this policy across all three backends:

Component type	CPU request	CPU limit	Memory request	Memory limit
Ingesters	✓	—	✓	✓ (4–8Gi)
Store-gateways	✓	—	✓	✓ (2–4Gi)
Distributors	✓	—	✓	—
Queriers	✓	—	✓	—
Query-frontend	✓	—	✓	—
Compactors	✓	—	✓	—

CPU limits are not set to avoid CFS quota throttling — write-path components (ingesters, distributors) experience latency spikes when throttled even when node capacity is available. CPU requests are retained for scheduling and HPA. Memory limits are set only on ingesters and store-gateways where unbounded growth would risk a node-level OOM; a single replica OOM kill is recoverable, a node OOM is not.

Storage Buckets

Following the dts-{type}-{env}-{purpose} convention established for Loki:

seaweedfs-dev:

dts-traces-dev-blocks
dts-metrics-dev-blocks, dts-metrics-dev-ruler, dts-metrics-dev-alertmanager

seaweedfs-ha (test + prod):

dts-traces-test-blocks, dts-traces-prod-blocks
dts-metrics-test-blocks, dts-metrics-test-ruler, dts-metrics-test-alertmanager
dts-metrics-prod-blocks, dts-metrics-prod-ruler, dts-metrics-prod-alertmanager

Phase 3 (Pyroscope) will add: dts-profiles-{dev,test,prod}-blocks

Grafana Datasources

Each environment gets a matched set of datasources in the shared Grafana instance:

Tempo (dev) / Tempo (test) / Tempo (prod)
Mimir (dev) / Mimir (test) / Mimir (prod)

Namespace Resource Analysis

Available Headroom (pre-deployment)

Namespace	CPU available	Memory available	Storage available
ca7f8f-dev	2530m	~11.3Gi	~33Gi
ca7f8f-test	5340m	~24.8Gi	~30Gi
ca7f8f-prod	1370m	~7.5Gi	~9Gi

Estimated New Resource Requests

Environment	CPU req	Memory req	Storage PVCs
dev (Tempo monolithic + Mimir ×1)	~400m	~2Gi	~4Gi
test (Tempo monolithic + Mimir ×1)	~900m	~4Gi	~8Gi
prod (Tempo-distributed + Mimir HA)	~2900m	~13.5Gi	~16Gi

Phase 3 addition to prod (Pyroscope HA): ~950m CPU, ~3.3Gi memory, ~2Gi storage.

Dev and test fit within existing quotas with comfortable headroom. Prod requires quota increases — included below covering both Phase 1 and Phase 3 to avoid a follow-up request.

Prod Component Breakdown (Phase 1)

Component	Replicas	CPU req	Memory req	Storage PVC
Tempo distributor	×2	200m	512Mi	—
Tempo ingester	×3	600m	3Gi	6Gi (2Gi×3)
Tempo querier	×2	200m	1Gi	—
Tempo query-frontend	×2	100m	512Mi	—
Tempo compactor	×1	100m	512Mi	—
Tempo store-gateway	×2	200m	1Gi	—
Mimir distributor	×2	200m	512Mi	—
Mimir ingester	×3	600m	3Gi	6Gi
Mimir querier	×2	200m	1Gi	—
Mimir query-frontend	×2	100m	512Mi	—
Mimir compactor	×1	100m	512Mi	2Gi
Mimir store-gateway	×2	200m	1Gi	—
Mimir alertmanager	×2	100m	512Mi	2Gi
Phase 1 total		~2900m	~13.5Gi	~16Gi

Prod Quota Increase Request

Resource	Current quota	Currently used	Phase 1 new	Phase 3 new	Total required	Requested	Headroom
CPU	4000m	2630m	~2900m	~950m	~6480m	10000m	~35%
Memory	16Gi	~8.5Gi	~13.5Gi	~3.3Gi	~25.3Gi	32Gi	~21%
Storage	64Gi	~55Gi	~16Gi	~2Gi	~73Gi	96Gi	~24%

Justification

We are deploying Grafana Tempo (distributed tracing) and Grafana Mimir (long-term metrics) as high-availability backends to ca7f8f-prod in Phase 1, with Grafana Pyroscope (continuous profiling) to follow in Phase 3. These are infrastructure-tier services that underpin monitoring, alerting, and observability for all production workloads.

HA requirement: A minimum of 3 ingester replicas per backend (replication factor 2) is required to survive a single pod failure without write-path data loss. Stateless components (distributors, queriers, query-frontends, store-gateways) run at 2 replicas to allow rolling updates without downtime. Ingesters require local WAL PVCs — data is flushed to SeaweedFS S3 on graceful shutdown, but local PVC ensures no loss during pod restarts or rolling deployments.

No CPU limits: CPU limits are intentionally not set. The Linux CFS scheduler throttles pods that exceed their CPU quota within a 100ms window, causing latency spikes in write-path components even when node capacity is available. CPU requests are retained and are sufficient for scheduling and HPA-based autoscaling.

Single request for Phases 1 and 3: Pyroscope prod resource requirements (~950m CPU, ~3.3Gi memory, ~2Gi storage) are included in this request to avoid a follow-up quota increase when Phase 3 is deployed.

Requested increases: CPU 4000m → 10000m, Memory 16Gi → 32Gi, Storage 64Gi → 96Gi.

Issue: Deploy Grafana Tempo

Estimated Effort: ~1.5 days

https://github.com/bcgov/DITP-DevOps/issues/331

Issue: Deploy Grafana Mimir

Estimated Effort: ~1.5 days

https://github.com/bcgov/DITP-DevOps/issues/332

Issue: Update Alloy Configuration

Estimated Effort: ~2–3h

https://github.com/bcgov/DITP-DevOps/issues/333

Issue: Import Grafana Dashboards

Estimated Effort: ~½ day

https://github.com/bcgov/DITP-DevOps/issues/334

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Phase 1 Infrastructure Rollout

Deployment Architecture

Chart Selection

Environment Rollout Order

Resource Policy

Storage Buckets

Grafana Datasources

Namespace Resource Analysis

Available Headroom (pre-deployment)

Estimated New Resource Requests

Prod Component Breakdown (Phase 1)

Prod Quota Increase Request

Justification

Issue: Deploy Grafana Tempo

Issue: Deploy Grafana Mimir

Issue: Update Alloy Configuration

Issue: Import Grafana Dashboards

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally