Skip to content

telemetry: fix cloud-init race + wire 3 unset dashboard gauges#249

Merged
motatoes merged 2 commits into
mainfrom
fix/populator-wait-cloud-init
May 14, 2026
Merged

telemetry: fix cloud-init race + wire 3 unset dashboard gauges#249
motatoes merged 2 commits into
mainfrom
fix/populator-wait-cloud-init

Conversation

@motatoes
Copy link
Copy Markdown
Contributor

@motatoes motatoes commented May 14, 2026

Two related fixes that together get the prod dashboard's worker panels populating:

1. `fix(vector): wait for cloud-final.service so worker.env is written`

On freshly-booted Azure workers, `populate-vector-env.service` was racing cloud-init:

  1. cloud-init's final stage base64-decodes the worker.env payload baked by the control plane (`internal/compute/azure.go:651-655`) and writes `/etc/opensandbox/worker.env` — including `OPENSANDBOX_AZURE_KEY_VAULT_NAME=opencomputer-prod-kv`.
  2. The populator only declared `After=network-online.target`, so systemd was free to start it before cloud-final. Its `EnvironmentFile=-/etc/opensandbox/worker.env` (leading `-`) silently skipped the missing file.
  3. The script saw `OPENSANDBOX_AZURE_KEY_VAULT_NAME` unset, logged `"not set — skipping"`, and exited 0.
  4. Exit 0 means `Restart=on-failure` never fires → populator wedged for the rest of boot.
  5. `vector.service` then failed to start because `/etc/opensandbox/vector.env` was never written and its `${AXIOM_PLATFORM_TOKEN}` / `${OPENCOMPUTER_CELL_ID}` substitutions had no values.

Fix: add `After=cloud-final.service` + `Wants=cloud-final.service` so the populator only runs once cloud-init has finished writing worker.env.

2. `metrics: wire three dashboard gauges that were registered but never set`

`opensandbox_workers_total`, `opensandbox_auth_attempts_total`, and `opensandbox_sandboxes_active` were defined and registered in `internal/metrics/metrics.go` but had zero `.Set()` / `.Inc()` call sites — the corresponding dashboard panels showed either "No data" or the misleading `field 'tags.X' not found` APL error (which is what APL returns on empty result sets when the query group-bys a tag column).

Metric Producer Wiring location
`opensandbox_workers_total` control plane `internal/controlplane/redis_registry.go` — end of `reconcileAndPrune()` (10s tick), grouped by (region, status)
`opensandbox_auth_attempts_total` control plane `internal/auth/oauth_handlers.go` — each return path in `HandleCallback`, type=workos, result=success/failure
`opensandbox_sandboxes_active` worker `internal/worker/resource_metrics.go` (existing 30s tick) — driven by a new `SandboxCounter` interface satisfied by `*qemu.Manager.ActiveSandboxesByTemplate()`

Design notes

  • Gauge drift. `WorkersTotal` and `SandboxesActive` both call `.Reset()` before emitting. Without it, a region that drained to zero (or a template that just had its last sandbox stop) would keep reporting its last non-zero value forever. The Reset → re-emit window is sub-microsecond; Vector scrapes on a 15-30s cadence so the race is negligible.
  • Heartbeat for sandboxes_active. When no sandboxes are running, emit `{template="", value=0}` so the dashboard's `summarize by tags.worker_id` group-by has at least one row to chew on (renders `0` instead of "field not found").
  • Auth label cardinality. Failure sub-reasons (missing_code, invalid_state, upstream_auth_failed, provision_failed) stay in logs rather than as metric label values — keeps `result` to a tight 2-value set.

Test plan

  • `go build ./...` + `go vet ./...` clean on touched packages (verified — pre-existing firecracker errors on main are unrelated).
  • On a freshly-spawned worker built from this PR: `journalctl -u populate-vector-env` shows the populator ran AFTER cloud-init wrote worker.env and successfully fetched tokens from KV. `systemctl is-active vector.service` → `active`.
  • On a control plane built from this PR: `curl :9092/metrics | grep opensandbox_workers_total` shows `{region, status}` rows matching live worker count, including `draining` if any.
  • Successful + failed login → `opensandbox_auth_attempts_total{type="workos",result="success"}` and `...{result="failure"}` both visible.
  • Worker with no sandboxes: `curl :9091/metrics | grep opensandbox_sandboxes_active` shows `{template="",value=0}`. Start a sandbox → row appears for that template.
  • Dashboard panels (Active sandboxes, Workers, Auth attempts) populate after rollout instead of showing "No data" / "field not found".

Rollout

  • Worker AMI rebuild via `build-worker-ami.yml`, then fleet rollover. Control plane redeploy via the existing scp+systemctl-restart pattern in `deploy/azure/create-opencomputer-prod.sh`.
  • For currently-running workers stuck in the populator-race state, the manual unblock is `systemctl restart populate-vector-env.service vector.service` (worker.env is already populated by now, so the populator will succeed on retry) — this is independent of the AMI/binary rebuild.

🤖 Generated with Claude Code

On freshly-booted Azure workers, populate-vector-env.service was racing
cloud-init: cloud-init's final stage is what base64-decodes the
worker.env payload baked by the control plane (internal/compute/azure.go)
and writes it to /etc/opensandbox/worker.env. The populator's
`EnvironmentFile=-/etc/opensandbox/worker.env` uses a leading `-`
(skip-if-absent), so when it ran ahead of cloud-final it saw an empty
environ, logged "OPENSANDBOX_AZURE_KEY_VAULT_NAME not set — skipping",
and exited 0. Since exit was 0, `Restart=on-failure` never fired and
the populator was wedged for the rest of the boot. vector.service
then couldn't start because /etc/opensandbox/vector.env was never
written and its substitutions (AXIOM_PLATFORM_TOKEN, etc.) had no
values. Result: no logs and no metrics from the worker fleet, even
though worker.env was correct by the time anyone looked.

Add `After=cloud-final.service` + `Wants=cloud-final.service` so the
populator only runs once cloud-init has finished writing worker.env.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@breardon2011 breardon2011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved

opensandbox_workers_total, opensandbox_auth_attempts_total, and
opensandbox_sandboxes_active were defined and registered in
internal/metrics/metrics.go but had zero call sites — every dashboard
panel that queried them sat empty (or rendered the "field not found"
APL quirk on empty result sets). Wire each from the natural producer.

- opensandbox_workers_total (control plane). Emitted at the end of
  RedisWorkerRegistry.reconcileAndPrune, keyed by (region, status)
  where status is active/draining. Reset()s first so a region that
  drained to zero stops reporting its last non-zero value forever.

- opensandbox_auth_attempts_total (control plane). Incremented at
  each return path in OAuthHandlers.HandleCallback, with type=workos
  and result=success|failure. Sub-reasons stay in logs rather than
  metric labels to keep cardinality bounded.

- opensandbox_sandboxes_active (worker). Driven from the existing
  resource-stats tick in internal/worker/resource_metrics.go. Adds a
  SandboxCounter interface (parallel to MemoryAllocator) that the
  QEMU manager satisfies via a new ActiveSandboxesByTemplate method.
  Per-tick Reset() handles the gauge-drift case (template ended →
  its label would otherwise stay at its last value forever); a
  template="" heartbeat with value 0 is emitted when no sandboxes
  are running so the dashboard panel doesn't error on the empty
  group-by tags.worker_id case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@motatoes motatoes changed the title fix(vector): wait for cloud-final.service so worker.env is written telemetry: fix cloud-init race + wire 3 unset dashboard gauges May 14, 2026
@motatoes motatoes merged commit ad95b44 into main May 14, 2026
1 check passed
motatoes added a commit that referenced this pull request May 15, 2026
#249 added `After=cloud-final.service` + `Wants=cloud-final.service` to
populate-vector-env.service to fix a race where the populator ran before
cloud-init wrote /etc/opensandbox/worker.env. Symptoms: empty env → no
KV fetch → empty vector.env → Vector with no Axiom creds.

#254-v1 tried to break the resulting systemd cycle by dropping just the
Wants=. Tested on dev: cycle still fires, vector still inactive.

Real root cause: on Azure this image, BOTH cloud-final.service and
cloud-init.target declare `After=multi-user.target`. So ANY ordering
dependency on a cloud-init unit from a unit WantedBy=multi-user.target
(which populate-vector-env is) creates a cycle. systemd resolves it by
silently deleting vector.service/start.

This commit:

1. Reverts the unit-file changes from #249. Back to After=/Wants=
   network-online.target only — same as before #249, no cycle.

2. Fixes the original race at the script level. When neither
   /etc/opensandbox/worker.env nor server.env exists, the script now
   exits 1 instead of 0, so Restart=on-failure on the unit retries.
   With RestartSec=10s and StartLimitBurst=5 / IntervalSec=120, that's a
   ~50s retry budget — plenty for cloud-init to land worker.env on
   Azure.

   Once worker.env exists but VAULT_NAME is still unset, the script
   exits 0 (treating this as "host genuinely doesn't have KV
   configured", e.g. dev VMs without managed identity).

Validated on dev (opensandbox-dev-tf-worker):
  before patch: reboot → vector inactive, "ordering cycle" in journal
  after patch:  reboot → vector active, populator active, no cycle

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
motatoes added a commit that referenced this pull request May 15, 2026
…ng cycle (#254)

* vector: drop Wants=cloud-final from populator to break systemd cycle

#249 added After= AND Wants= cloud-final.service to the populator unit.
The Wants= half pulled cloud-final into the dep graph and created a
cycle:

  vector.service Wants populate-vector-env.service Wants cloud-final.service
  cloud-final.service Before multi-user.target Wants vector.service

At boot, systemd resolves this by silently deleting vector.service/start.
Vector never starts, no log, no error. Observed on a prod worker after
#249 merged: load=10, vector inactive, journal:
  "cloud-final.service: Job vector.service/start deleted to break
   ordering cycle starting with cloud-final.service/start"

Drop cloud-final from Wants=. Keep it in After= — that alone is what
fixes the original race and avoids forcing cloud-final into our dep
graph.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* vector: revert #249's cloud-final ordering; retry in script instead

#249 added `After=cloud-final.service` + `Wants=cloud-final.service` to
populate-vector-env.service to fix a race where the populator ran before
cloud-init wrote /etc/opensandbox/worker.env. Symptoms: empty env → no
KV fetch → empty vector.env → Vector with no Axiom creds.

#254-v1 tried to break the resulting systemd cycle by dropping just the
Wants=. Tested on dev: cycle still fires, vector still inactive.

Real root cause: on Azure this image, BOTH cloud-final.service and
cloud-init.target declare `After=multi-user.target`. So ANY ordering
dependency on a cloud-init unit from a unit WantedBy=multi-user.target
(which populate-vector-env is) creates a cycle. systemd resolves it by
silently deleting vector.service/start.

This commit:

1. Reverts the unit-file changes from #249. Back to After=/Wants=
   network-online.target only — same as before #249, no cycle.

2. Fixes the original race at the script level. When neither
   /etc/opensandbox/worker.env nor server.env exists, the script now
   exits 1 instead of 0, so Restart=on-failure on the unit retries.
   With RestartSec=10s and StartLimitBurst=5 / IntervalSec=120, that's a
   ~50s retry budget — plenty for cloud-init to land worker.env on
   Azure.

   Once worker.env exists but VAULT_NAME is still unset, the script
   exits 0 (treating this as "host genuinely doesn't have KV
   configured", e.g. dev VMs without managed identity).

Validated on dev (opensandbox-dev-tf-worker):
  before patch: reboot → vector inactive, "ordering cycle" in journal
  after patch:  reboot → vector active, populator active, no cycle

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
motatoes added a commit that referenced this pull request May 19, 2026
The previous iteration (since reverted in #259) shipped a 600s
synchronous in-script wait on worker.env. On Azure that deadlocked the
boot: cloud-final.service is ordered After=multi-user.target on Ubuntu
Azure images, and writing /etc/opensandbox/worker.env is what
cloud-final does. multi-user.target couldn't reach active while the
populator was waiting (vector.service wants populator, multi-user
wants vector). Every new Azure worker was reaped at exactly 600s by
scaler.go's pendingWorkerTTL=10min.

This change makes the populator exit fast in *all* boot paths:

- If /etc/opensandbox/{worker,server}.env exists at populator-run time
  (dev hosts, image bake, reboot of a healthy VM), the populator pulls
  real creds from Key Vault and writes vector.env synchronously —
  unchanged behavior.

- If neither role env exists (Azure first boot, cloud-final hasn't
  run yet), the populator:
    1. writes a stub vector.env with all expected variables defined
       but empty, so `vector validate` passes and the service can
       start (the axiom sink fails its healthcheck and buffers to
       disk),
    2. starts a new companion unit populate-vector-env-wait.service
       (not WantedBy=multi-user.target, so it doesn't block boot),
    3. exits 0 in ~1s.

  The wait unit polls /etc/opensandbox/{worker,server}.env every 5s
  for up to 30 min (past Azure cloud-init's worst-case ~5 min), then
  re-runs the main populator (which now finds the role env file and
  goes through the synchronous path) and does
  `systemctl reset-failed + restart vector.service` so the disk
  buffer flushes into Axiom with the real token.

Why prior approaches failed (full history in populate-vector-env.sh
header):
  #249  After=cloud-final → systemd cycle, vector dropped silently.
  #254  exit 1 + Restart=on-failure → vector's restart-burst burnt
        the StartLimitBurst budget in <2s.
  #256  internal 90s poll → multi-user blocked 90s, populator gave up
        before cloud-final arrived at ~4 min anyway.
  #257  internal 600s poll → boot deadlock, every Azure worker reaped.

What we explored but didn't ship:
  - systemd .path unit watching the specific worker.env file (not the
    dir): would work, but adds a third unit and still needs the same
    decoupling between vector.service and the populator at boot time
    that this approach already achieves more directly.
  - Type=forking + setsid + disown in one unit: the detached child
    can be killed by systemd on unit stop unless KillMode=process,
    which has subtler semantics than a clean separate unit.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants