Observability design for the compute service

## Parent Issue

Tracked by https://github.com/datum-cloud/enhancements/issues/682 (Launch Workload Compute Service — "UFOs")

## Summary

The compute service has no observability design. Operators have no structured way to monitor instance health, resource utilization, or service-level indicators across the fleet. Customers have no way to see their workload's stdout/stderr or resource consumption. This work defines and implements the observability stack for compute covering both the operator and consumer layers.

## Goals

**Operator observability**
- Define and emit metrics covering instance lifecycle (create, start, stop, terminate rates), resource utilization (vCPU, memory), scheduling latency, and error rates
- Establish structured logging conventions for compute controllers (instance controller, workload deployment controller)
- Define alerting thresholds for fleet-level health indicators

**Consumer observability**
- Define how compute workload stdout/stderr is collected and surfaced to the customer (relates to enhancements#714 for the broader log platform design)
- Define what instance lifecycle events are visible to the consumer (start, stop, crash, OOM)
- Ensure consumer-visible metrics (instance uptime, resource allocation) are queryable

## Non-Goals

- Billing/metering telemetry (tracked in unikraft-provider#5)
- Quota utilization metrics (tracked in compute#90)
- The customer log platform itself (tracked in enhancements#714) — this issue defines the compute-specific requirements and integration points

## Open Questions

- Which metrics backend do compute controllers emit to (same Prometheus stack as other Datum services)?
- Is consumer log access via the portal, API, or a Grafana datasource endpoint?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability design for the compute service #99

Parent Issue

Summary

Goals

Non-Goals

Open Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Observability design for the compute service #99

Description

Parent Issue

Summary

Goals

Non-Goals

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions