Skip to content

Observability design for the compute service #99

@scotwells

Description

@scotwells

Parent Issue

Tracked by datum-cloud/enhancements#682 (Launch Workload Compute Service — "UFOs")

Summary

The compute service has no observability design. Operators have no structured way to monitor instance health, resource utilization, or service-level indicators across the fleet. Customers have no way to see their workload's stdout/stderr or resource consumption. This work defines and implements the observability stack for compute covering both the operator and consumer layers.

Goals

Operator observability

  • Define and emit metrics covering instance lifecycle (create, start, stop, terminate rates), resource utilization (vCPU, memory), scheduling latency, and error rates
  • Establish structured logging conventions for compute controllers (instance controller, workload deployment controller)
  • Define alerting thresholds for fleet-level health indicators

Consumer observability

  • Define how compute workload stdout/stderr is collected and surfaced to the customer (relates to enhancements#714 for the broader log platform design)
  • Define what instance lifecycle events are visible to the consumer (start, stop, crash, OOM)
  • Ensure consumer-visible metrics (instance uptime, resource allocation) are queryable

Non-Goals

  • Billing/metering telemetry (tracked in unikraft-provider#5)
  • Quota utilization metrics (tracked in compute#90)
  • The customer log platform itself (tracked in enhancements#714) — this issue defines the compute-specific requirements and integration points

Open Questions

  • Which metrics backend do compute controllers emit to (same Prometheus stack as other Datum services)?
  • Is consumer log access via the portal, API, or a Grafana datasource endpoint?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions