telemetry: fix cloud-init race + wire 3 unset dashboard gauges by motatoes · Pull Request #249 · diggerhq/opencomputer

motatoes · 2026-05-14T20:31:20Z

Two related fixes that together get the prod dashboard's worker panels populating:

1. `fix(vector): wait for cloud-final.service so worker.env is written`

On freshly-booted Azure workers, `populate-vector-env.service` was racing cloud-init:

cloud-init's final stage base64-decodes the worker.env payload baked by the control plane (`internal/compute/azure.go:651-655`) and writes `/etc/opensandbox/worker.env` — including `OPENSANDBOX_AZURE_KEY_VAULT_NAME=opencomputer-prod-kv`.
The populator only declared `After=network-online.target`, so systemd was free to start it before cloud-final. Its `EnvironmentFile=-/etc/opensandbox/worker.env` (leading `-`) silently skipped the missing file.
The script saw `OPENSANDBOX_AZURE_KEY_VAULT_NAME` unset, logged `"not set — skipping"`, and exited 0.
Exit 0 means `Restart=on-failure` never fires → populator wedged for the rest of boot.
`vector.service` then failed to start because `/etc/opensandbox/vector.env` was never written and its `${AXIOM_PLATFORM_TOKEN}` / `${OPENCOMPUTER_CELL_ID}` substitutions had no values.

Fix: add `After=cloud-final.service` + `Wants=cloud-final.service` so the populator only runs once cloud-init has finished writing worker.env.

2. `metrics: wire three dashboard gauges that were registered but never set`

`opensandbox_workers_total`, `opensandbox_auth_attempts_total`, and `opensandbox_sandboxes_active` were defined and registered in `internal/metrics/metrics.go` but had zero `.Set()` / `.Inc()` call sites — the corresponding dashboard panels showed either "No data" or the misleading `field 'tags.X' not found` APL error (which is what APL returns on empty result sets when the query group-bys a tag column).

Metric	Producer	Wiring location
`opensandbox_workers_total`	control plane	`internal/controlplane/redis_registry.go` — end of `reconcileAndPrune()` (10s tick), grouped by (region, status)
`opensandbox_auth_attempts_total`	control plane	`internal/auth/oauth_handlers.go` — each return path in `HandleCallback`, type=workos, result=success/failure
`opensandbox_sandboxes_active`	worker	`internal/worker/resource_metrics.go` (existing 30s tick) — driven by a new `SandboxCounter` interface satisfied by `*qemu.Manager.ActiveSandboxesByTemplate()`

Design notes

Gauge drift. `WorkersTotal` and `SandboxesActive` both call `.Reset()` before emitting. Without it, a region that drained to zero (or a template that just had its last sandbox stop) would keep reporting its last non-zero value forever. The Reset → re-emit window is sub-microsecond; Vector scrapes on a 15-30s cadence so the race is negligible.
Heartbeat for sandboxes_active. When no sandboxes are running, emit `{template="", value=0}` so the dashboard's `summarize by tags.worker_id` group-by has at least one row to chew on (renders `0` instead of "field not found").
Auth label cardinality. Failure sub-reasons (missing_code, invalid_state, upstream_auth_failed, provision_failed) stay in logs rather than as metric label values — keeps `result` to a tight 2-value set.

Test plan

`go build ./...` + `go vet ./...` clean on touched packages (verified — pre-existing firecracker errors on main are unrelated).
On a freshly-spawned worker built from this PR: `journalctl -u populate-vector-env` shows the populator ran AFTER cloud-init wrote worker.env and successfully fetched tokens from KV. `systemctl is-active vector.service` → `active`.
On a control plane built from this PR: `curl :9092/metrics | grep opensandbox_workers_total` shows `{region, status}` rows matching live worker count, including `draining` if any.
Successful + failed login → `opensandbox_auth_attempts_total{type="workos",result="success"}` and `...{result="failure"}` both visible.
Worker with no sandboxes: `curl :9091/metrics | grep opensandbox_sandboxes_active` shows `{template="",value=0}`. Start a sandbox → row appears for that template.
Dashboard panels (Active sandboxes, Workers, Auth attempts) populate after rollout instead of showing "No data" / "field not found".

Rollout

Worker AMI rebuild via `build-worker-ami.yml`, then fleet rollover. Control plane redeploy via the existing scp+systemctl-restart pattern in `deploy/azure/create-opencomputer-prod.sh`.
For currently-running workers stuck in the populator-race state, the manual unblock is `systemctl restart populate-vector-env.service vector.service` (worker.env is already populated by now, so the populator will succeed on retry) — this is independent of the AMI/binary rebuild.

🤖 Generated with Claude Code

On freshly-booted Azure workers, populate-vector-env.service was racing cloud-init: cloud-init's final stage is what base64-decodes the worker.env payload baked by the control plane (internal/compute/azure.go) and writes it to /etc/opensandbox/worker.env. The populator's `EnvironmentFile=-/etc/opensandbox/worker.env` uses a leading `-` (skip-if-absent), so when it ran ahead of cloud-final it saw an empty environ, logged "OPENSANDBOX_AZURE_KEY_VAULT_NAME not set — skipping", and exited 0. Since exit was 0, `Restart=on-failure` never fired and the populator was wedged for the rest of the boot. vector.service then couldn't start because /etc/opensandbox/vector.env was never written and its substitutions (AXIOM_PLATFORM_TOKEN, etc.) had no values. Result: no logs and no metrics from the worker fleet, even though worker.env was correct by the time anyone looked. Add `After=cloud-final.service` + `Wants=cloud-final.service` so the populator only runs once cloud-init has finished writing worker.env. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

breardon2011

approved

opensandbox_workers_total, opensandbox_auth_attempts_total, and opensandbox_sandboxes_active were defined and registered in internal/metrics/metrics.go but had zero call sites — every dashboard panel that queried them sat empty (or rendered the "field not found" APL quirk on empty result sets). Wire each from the natural producer. - opensandbox_workers_total (control plane). Emitted at the end of RedisWorkerRegistry.reconcileAndPrune, keyed by (region, status) where status is active/draining. Reset()s first so a region that drained to zero stops reporting its last non-zero value forever. - opensandbox_auth_attempts_total (control plane). Incremented at each return path in OAuthHandlers.HandleCallback, with type=workos and result=success|failure. Sub-reasons stay in logs rather than metric labels to keep cardinality bounded. - opensandbox_sandboxes_active (worker). Driven from the existing resource-stats tick in internal/worker/resource_metrics.go. Adds a SandboxCounter interface (parallel to MemoryAllocator) that the QEMU manager satisfies via a new ActiveSandboxesByTemplate method. Per-tick Reset() handles the gauge-drift case (template ended → its label would otherwise stay at its last value forever); a template="" heartbeat with value 0 is emitted when no sandboxes are running so the dashboard panel doesn't error on the empty group-by tags.worker_id case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#249 added `After=cloud-final.service` + `Wants=cloud-final.service` to populate-vector-env.service to fix a race where the populator ran before cloud-init wrote /etc/opensandbox/worker.env. Symptoms: empty env → no KV fetch → empty vector.env → Vector with no Axiom creds. #254-v1 tried to break the resulting systemd cycle by dropping just the Wants=. Tested on dev: cycle still fires, vector still inactive. Real root cause: on Azure this image, BOTH cloud-final.service and cloud-init.target declare `After=multi-user.target`. So ANY ordering dependency on a cloud-init unit from a unit WantedBy=multi-user.target (which populate-vector-env is) creates a cycle. systemd resolves it by silently deleting vector.service/start. This commit: 1. Reverts the unit-file changes from #249. Back to After=/Wants= network-online.target only — same as before #249, no cycle. 2. Fixes the original race at the script level. When neither /etc/opensandbox/worker.env nor server.env exists, the script now exits 1 instead of 0, so Restart=on-failure on the unit retries. With RestartSec=10s and StartLimitBurst=5 / IntervalSec=120, that's a ~50s retry budget — plenty for cloud-init to land worker.env on Azure. Once worker.env exists but VAULT_NAME is still unset, the script exits 0 (treating this as "host genuinely doesn't have KV configured", e.g. dev VMs without managed identity). Validated on dev (opensandbox-dev-tf-worker): before patch: reboot → vector inactive, "ordering cycle" in journal after patch: reboot → vector active, populator active, no cycle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ng cycle (#254) * vector: drop Wants=cloud-final from populator to break systemd cycle #249 added After= AND Wants= cloud-final.service to the populator unit. The Wants= half pulled cloud-final into the dep graph and created a cycle: vector.service Wants populate-vector-env.service Wants cloud-final.service cloud-final.service Before multi-user.target Wants vector.service At boot, systemd resolves this by silently deleting vector.service/start. Vector never starts, no log, no error. Observed on a prod worker after #249 merged: load=10, vector inactive, journal: "cloud-final.service: Job vector.service/start deleted to break ordering cycle starting with cloud-final.service/start" Drop cloud-final from Wants=. Keep it in After= — that alone is what fixes the original race and avoids forcing cloud-final into our dep graph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * vector: revert #249's cloud-final ordering; retry in script instead #249 added `After=cloud-final.service` + `Wants=cloud-final.service` to populate-vector-env.service to fix a race where the populator ran before cloud-init wrote /etc/opensandbox/worker.env. Symptoms: empty env → no KV fetch → empty vector.env → Vector with no Axiom creds. #254-v1 tried to break the resulting systemd cycle by dropping just the Wants=. Tested on dev: cycle still fires, vector still inactive. Real root cause: on Azure this image, BOTH cloud-final.service and cloud-init.target declare `After=multi-user.target`. So ANY ordering dependency on a cloud-init unit from a unit WantedBy=multi-user.target (which populate-vector-env is) creates a cycle. systemd resolves it by silently deleting vector.service/start. This commit: 1. Reverts the unit-file changes from #249. Back to After=/Wants= network-online.target only — same as before #249, no cycle. 2. Fixes the original race at the script level. When neither /etc/opensandbox/worker.env nor server.env exists, the script now exits 1 instead of 0, so Restart=on-failure on the unit retries. With RestartSec=10s and StartLimitBurst=5 / IntervalSec=120, that's a ~50s retry budget — plenty for cloud-init to land worker.env on Azure. Once worker.env exists but VAULT_NAME is still unset, the script exits 0 (treating this as "host genuinely doesn't have KV configured", e.g. dev VMs without managed identity). Validated on dev (opensandbox-dev-tf-worker): before patch: reboot → vector inactive, "ordering cycle" in journal after patch: reboot → vector active, populator active, no cycle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous iteration (since reverted in #259) shipped a 600s synchronous in-script wait on worker.env. On Azure that deadlocked the boot: cloud-final.service is ordered After=multi-user.target on Ubuntu Azure images, and writing /etc/opensandbox/worker.env is what cloud-final does. multi-user.target couldn't reach active while the populator was waiting (vector.service wants populator, multi-user wants vector). Every new Azure worker was reaped at exactly 600s by scaler.go's pendingWorkerTTL=10min. This change makes the populator exit fast in *all* boot paths: - If /etc/opensandbox/{worker,server}.env exists at populator-run time (dev hosts, image bake, reboot of a healthy VM), the populator pulls real creds from Key Vault and writes vector.env synchronously — unchanged behavior. - If neither role env exists (Azure first boot, cloud-final hasn't run yet), the populator: 1. writes a stub vector.env with all expected variables defined but empty, so `vector validate` passes and the service can start (the axiom sink fails its healthcheck and buffers to disk), 2. starts a new companion unit populate-vector-env-wait.service (not WantedBy=multi-user.target, so it doesn't block boot), 3. exits 0 in ~1s. The wait unit polls /etc/opensandbox/{worker,server}.env every 5s for up to 30 min (past Azure cloud-init's worst-case ~5 min), then re-runs the main populator (which now finds the role env file and goes through the synchronous path) and does `systemctl reset-failed + restart vector.service` so the disk buffer flushes into Axiom with the real token. Why prior approaches failed (full history in populate-vector-env.sh header): #249 After=cloud-final → systemd cycle, vector dropped silently. #254 exit 1 + Restart=on-failure → vector's restart-burst burnt the StartLimitBurst budget in <2s. #256 internal 90s poll → multi-user blocked 90s, populator gave up before cloud-final arrived at ~4 min anyway. #257 internal 600s poll → boot deadlock, every Azure worker reaped. What we explored but didn't ship: - systemd .path unit watching the specific worker.env file (not the dir): would work, but adds a third unit and still needs the same decoupling between vector.service and the populator at boot time that this approach already achieves more directly. - Type=forking + setsid + disown in one unit: the detached child can be killed by systemd on unit stop unless KillMode=process, which has subtler semantics than a clean separate unit. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

breardon2011 approved these changes May 14, 2026

View reviewed changes

motatoes mentioned this pull request May 14, 2026

metrics: wire three dashboard gauges that were registered but never set #250

Closed

4 tasks

motatoes changed the title ~~fix(vector): wait for cloud-final.service so worker.env is written~~ telemetry: fix cloud-init race + wire 3 unset dashboard gauges May 14, 2026

motatoes merged commit ad95b44 into main May 14, 2026
1 check passed

motatoes mentioned this pull request May 15, 2026

vector: drop Wants=cloud-final from populator to break systemd ordering cycle #254

Merged

4 tasks

motatoes mentioned this pull request May 18, 2026

vector: detach populator from boot when role env is missing #260

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

telemetry: fix cloud-init race + wire 3 unset dashboard gauges#249

telemetry: fix cloud-init race + wire 3 unset dashboard gauges#249
motatoes merged 2 commits into
mainfrom
fix/populator-wait-cloud-init

motatoes commented May 14, 2026 •

edited

Loading

Uh oh!

breardon2011 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

motatoes commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. `fix(vector): wait for cloud-final.service so worker.env is written`

2. `metrics: wire three dashboard gauges that were registered but never set`

Design notes

Test plan

Rollout

Uh oh!

breardon2011 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

motatoes commented May 14, 2026 •

edited

Loading