Parent Issue
Tracked by datum-cloud/enhancements#682 (Launch Workload Compute Service — "UFOs")
Summary
The compute service has no observability design. Operators have no structured way to monitor instance health, resource utilization, or service-level indicators across the fleet. Customers have no way to see their workload's stdout/stderr or resource consumption. This work defines and implements the observability stack for compute covering both the operator and consumer layers.
Goals
Operator observability
- Define and emit metrics covering instance lifecycle (create, start, stop, terminate rates), resource utilization (vCPU, memory), scheduling latency, and error rates
- Establish structured logging conventions for compute controllers (instance controller, workload deployment controller)
- Define alerting thresholds for fleet-level health indicators
Consumer observability
- Define how compute workload stdout/stderr is collected and surfaced to the customer (relates to enhancements#714 for the broader log platform design)
- Define what instance lifecycle events are visible to the consumer (start, stop, crash, OOM)
- Ensure consumer-visible metrics (instance uptime, resource allocation) are queryable
Non-Goals
- Billing/metering telemetry (tracked in unikraft-provider#5)
- Quota utilization metrics (tracked in compute#90)
- The customer log platform itself (tracked in enhancements#714) — this issue defines the compute-specific requirements and integration points
Open Questions
- Which metrics backend do compute controllers emit to (same Prometheus stack as other Datum services)?
- Is consumer log access via the portal, API, or a Grafana datasource endpoint?
Parent Issue
Tracked by datum-cloud/enhancements#682 (Launch Workload Compute Service — "UFOs")
Summary
The compute service has no observability design. Operators have no structured way to monitor instance health, resource utilization, or service-level indicators across the fleet. Customers have no way to see their workload's stdout/stderr or resource consumption. This work defines and implements the observability stack for compute covering both the operator and consumer layers.
Goals
Operator observability
Consumer observability
Non-Goals
Open Questions