telemetry: fix cloud-init race + wire 3 unset dashboard gauges#249
Merged
Conversation
On freshly-booted Azure workers, populate-vector-env.service was racing cloud-init: cloud-init's final stage is what base64-decodes the worker.env payload baked by the control plane (internal/compute/azure.go) and writes it to /etc/opensandbox/worker.env. The populator's `EnvironmentFile=-/etc/opensandbox/worker.env` uses a leading `-` (skip-if-absent), so when it ran ahead of cloud-final it saw an empty environ, logged "OPENSANDBOX_AZURE_KEY_VAULT_NAME not set — skipping", and exited 0. Since exit was 0, `Restart=on-failure` never fired and the populator was wedged for the rest of the boot. vector.service then couldn't start because /etc/opensandbox/vector.env was never written and its substitutions (AXIOM_PLATFORM_TOKEN, etc.) had no values. Result: no logs and no metrics from the worker fleet, even though worker.env was correct by the time anyone looked. Add `After=cloud-final.service` + `Wants=cloud-final.service` so the populator only runs once cloud-init has finished writing worker.env. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
opensandbox_workers_total, opensandbox_auth_attempts_total, and opensandbox_sandboxes_active were defined and registered in internal/metrics/metrics.go but had zero call sites — every dashboard panel that queried them sat empty (or rendered the "field not found" APL quirk on empty result sets). Wire each from the natural producer. - opensandbox_workers_total (control plane). Emitted at the end of RedisWorkerRegistry.reconcileAndPrune, keyed by (region, status) where status is active/draining. Reset()s first so a region that drained to zero stops reporting its last non-zero value forever. - opensandbox_auth_attempts_total (control plane). Incremented at each return path in OAuthHandlers.HandleCallback, with type=workos and result=success|failure. Sub-reasons stay in logs rather than metric labels to keep cardinality bounded. - opensandbox_sandboxes_active (worker). Driven from the existing resource-stats tick in internal/worker/resource_metrics.go. Adds a SandboxCounter interface (parallel to MemoryAllocator) that the QEMU manager satisfies via a new ActiveSandboxesByTemplate method. Per-tick Reset() handles the gauge-drift case (template ended → its label would otherwise stay at its last value forever); a template="" heartbeat with value 0 is emitted when no sandboxes are running so the dashboard panel doesn't error on the empty group-by tags.worker_id case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
4 tasks
motatoes
added a commit
that referenced
this pull request
May 15, 2026
#249 added `After=cloud-final.service` + `Wants=cloud-final.service` to populate-vector-env.service to fix a race where the populator ran before cloud-init wrote /etc/opensandbox/worker.env. Symptoms: empty env → no KV fetch → empty vector.env → Vector with no Axiom creds. #254-v1 tried to break the resulting systemd cycle by dropping just the Wants=. Tested on dev: cycle still fires, vector still inactive. Real root cause: on Azure this image, BOTH cloud-final.service and cloud-init.target declare `After=multi-user.target`. So ANY ordering dependency on a cloud-init unit from a unit WantedBy=multi-user.target (which populate-vector-env is) creates a cycle. systemd resolves it by silently deleting vector.service/start. This commit: 1. Reverts the unit-file changes from #249. Back to After=/Wants= network-online.target only — same as before #249, no cycle. 2. Fixes the original race at the script level. When neither /etc/opensandbox/worker.env nor server.env exists, the script now exits 1 instead of 0, so Restart=on-failure on the unit retries. With RestartSec=10s and StartLimitBurst=5 / IntervalSec=120, that's a ~50s retry budget — plenty for cloud-init to land worker.env on Azure. Once worker.env exists but VAULT_NAME is still unset, the script exits 0 (treating this as "host genuinely doesn't have KV configured", e.g. dev VMs without managed identity). Validated on dev (opensandbox-dev-tf-worker): before patch: reboot → vector inactive, "ordering cycle" in journal after patch: reboot → vector active, populator active, no cycle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
motatoes
added a commit
that referenced
this pull request
May 15, 2026
…ng cycle (#254) * vector: drop Wants=cloud-final from populator to break systemd cycle #249 added After= AND Wants= cloud-final.service to the populator unit. The Wants= half pulled cloud-final into the dep graph and created a cycle: vector.service Wants populate-vector-env.service Wants cloud-final.service cloud-final.service Before multi-user.target Wants vector.service At boot, systemd resolves this by silently deleting vector.service/start. Vector never starts, no log, no error. Observed on a prod worker after #249 merged: load=10, vector inactive, journal: "cloud-final.service: Job vector.service/start deleted to break ordering cycle starting with cloud-final.service/start" Drop cloud-final from Wants=. Keep it in After= — that alone is what fixes the original race and avoids forcing cloud-final into our dep graph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * vector: revert #249's cloud-final ordering; retry in script instead #249 added `After=cloud-final.service` + `Wants=cloud-final.service` to populate-vector-env.service to fix a race where the populator ran before cloud-init wrote /etc/opensandbox/worker.env. Symptoms: empty env → no KV fetch → empty vector.env → Vector with no Axiom creds. #254-v1 tried to break the resulting systemd cycle by dropping just the Wants=. Tested on dev: cycle still fires, vector still inactive. Real root cause: on Azure this image, BOTH cloud-final.service and cloud-init.target declare `After=multi-user.target`. So ANY ordering dependency on a cloud-init unit from a unit WantedBy=multi-user.target (which populate-vector-env is) creates a cycle. systemd resolves it by silently deleting vector.service/start. This commit: 1. Reverts the unit-file changes from #249. Back to After=/Wants= network-online.target only — same as before #249, no cycle. 2. Fixes the original race at the script level. When neither /etc/opensandbox/worker.env nor server.env exists, the script now exits 1 instead of 0, so Restart=on-failure on the unit retries. With RestartSec=10s and StartLimitBurst=5 / IntervalSec=120, that's a ~50s retry budget — plenty for cloud-init to land worker.env on Azure. Once worker.env exists but VAULT_NAME is still unset, the script exits 0 (treating this as "host genuinely doesn't have KV configured", e.g. dev VMs without managed identity). Validated on dev (opensandbox-dev-tf-worker): before patch: reboot → vector inactive, "ordering cycle" in journal after patch: reboot → vector active, populator active, no cycle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 tasks
motatoes
added a commit
that referenced
this pull request
May 19, 2026
The previous iteration (since reverted in #259) shipped a 600s synchronous in-script wait on worker.env. On Azure that deadlocked the boot: cloud-final.service is ordered After=multi-user.target on Ubuntu Azure images, and writing /etc/opensandbox/worker.env is what cloud-final does. multi-user.target couldn't reach active while the populator was waiting (vector.service wants populator, multi-user wants vector). Every new Azure worker was reaped at exactly 600s by scaler.go's pendingWorkerTTL=10min. This change makes the populator exit fast in *all* boot paths: - If /etc/opensandbox/{worker,server}.env exists at populator-run time (dev hosts, image bake, reboot of a healthy VM), the populator pulls real creds from Key Vault and writes vector.env synchronously — unchanged behavior. - If neither role env exists (Azure first boot, cloud-final hasn't run yet), the populator: 1. writes a stub vector.env with all expected variables defined but empty, so `vector validate` passes and the service can start (the axiom sink fails its healthcheck and buffers to disk), 2. starts a new companion unit populate-vector-env-wait.service (not WantedBy=multi-user.target, so it doesn't block boot), 3. exits 0 in ~1s. The wait unit polls /etc/opensandbox/{worker,server}.env every 5s for up to 30 min (past Azure cloud-init's worst-case ~5 min), then re-runs the main populator (which now finds the role env file and goes through the synchronous path) and does `systemctl reset-failed + restart vector.service` so the disk buffer flushes into Axiom with the real token. Why prior approaches failed (full history in populate-vector-env.sh header): #249 After=cloud-final → systemd cycle, vector dropped silently. #254 exit 1 + Restart=on-failure → vector's restart-burst burnt the StartLimitBurst budget in <2s. #256 internal 90s poll → multi-user blocked 90s, populator gave up before cloud-final arrived at ~4 min anyway. #257 internal 600s poll → boot deadlock, every Azure worker reaped. What we explored but didn't ship: - systemd .path unit watching the specific worker.env file (not the dir): would work, but adds a third unit and still needs the same decoupling between vector.service and the populator at boot time that this approach already achieves more directly. - Type=forking + setsid + disown in one unit: the detached child can be killed by systemd on unit stop unless KillMode=process, which has subtler semantics than a clean separate unit. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two related fixes that together get the prod dashboard's worker panels populating:
1. `fix(vector): wait for cloud-final.service so worker.env is written`
On freshly-booted Azure workers, `populate-vector-env.service` was racing cloud-init:
Fix: add `After=cloud-final.service` + `Wants=cloud-final.service` so the populator only runs once cloud-init has finished writing worker.env.
2. `metrics: wire three dashboard gauges that were registered but never set`
`opensandbox_workers_total`, `opensandbox_auth_attempts_total`, and `opensandbox_sandboxes_active` were defined and registered in `internal/metrics/metrics.go` but had zero `.Set()` / `.Inc()` call sites — the corresponding dashboard panels showed either "No data" or the misleading `field 'tags.X' not found` APL error (which is what APL returns on empty result sets when the query group-bys a tag column).
Design notes
Test plan
Rollout
🤖 Generated with Claude Code