vector: drop Wants=cloud-final from populator to break systemd ordering cycle by motatoes · Pull Request #254 · diggerhq/opencomputer

motatoes · 2026-05-15T22:55:27Z

Summary

PR #249 added both `After=cloud-final.service` AND `Wants=cloud-final.service` to `populate-vector-env.service`. The `Wants=` half pulled cloud-final into Vector's dependency graph and created an ordering cycle that systemd silently resolves by deleting `vector.service/start` — Vector never boots, no log, no error.

Reproduction on prod

After #249 merged and a fresh worker rolled, observed on one prod worker:

`uptime`: 1h 50m — fresh boot on the new AMI
`/etc/opensandbox/vector.env`: fully populated (populator ran correctly)
`opensandbox-worker.service`: active
`vector.service`: inactive (dead) — never tried to start, zero journal entries this boot

In the early boot journal:
```
cloud-final.service: Found dependency on vector.service/start
cloud-final.service: Job vector.service/start deleted to break ordering cycle
starting with cloud-final.service/start
```

Why `After=` is enough

`After=cloud-final.service` already gives the ordering needed to fix the original cloud-init race #249 was solving. `Wants=` adds a "pull into dep graph" semantic we don't actually need — cloud-final is a stock cloud-init target that's always present, no need to "want" it.

Test plan

Stage on a dev worker — replace `/etc/systemd/system/populate-vector-env.service` with this version, `systemctl daemon-reload`, reboot
After reboot: `systemctl is-active vector.service` returns `active`
Boot journal contains zero "ordering cycle" entries
Metric events from worker show up in Axiom within 1–2 minutes

🤖 Generated with Claude Code

#249 added After= AND Wants= cloud-final.service to the populator unit. The Wants= half pulled cloud-final into the dep graph and created a cycle: vector.service Wants populate-vector-env.service Wants cloud-final.service cloud-final.service Before multi-user.target Wants vector.service At boot, systemd resolves this by silently deleting vector.service/start. Vector never starts, no log, no error. Observed on a prod worker after #249 merged: load=10, vector inactive, journal: "cloud-final.service: Job vector.service/start deleted to break ordering cycle starting with cloud-final.service/start" Drop cloud-final from Wants=. Keep it in After= — that alone is what fixes the original race and avoids forcing cloud-final into our dep graph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#249 added `After=cloud-final.service` + `Wants=cloud-final.service` to populate-vector-env.service to fix a race where the populator ran before cloud-init wrote /etc/opensandbox/worker.env. Symptoms: empty env → no KV fetch → empty vector.env → Vector with no Axiom creds. #254-v1 tried to break the resulting systemd cycle by dropping just the Wants=. Tested on dev: cycle still fires, vector still inactive. Real root cause: on Azure this image, BOTH cloud-final.service and cloud-init.target declare `After=multi-user.target`. So ANY ordering dependency on a cloud-init unit from a unit WantedBy=multi-user.target (which populate-vector-env is) creates a cycle. systemd resolves it by silently deleting vector.service/start. This commit: 1. Reverts the unit-file changes from #249. Back to After=/Wants= network-online.target only — same as before #249, no cycle. 2. Fixes the original race at the script level. When neither /etc/opensandbox/worker.env nor server.env exists, the script now exits 1 instead of 0, so Restart=on-failure on the unit retries. With RestartSec=10s and StartLimitBurst=5 / IntervalSec=120, that's a ~50s retry budget — plenty for cloud-init to land worker.env on Azure. Once worker.env exists but VAULT_NAME is still unset, the script exits 0 (treating this as "host genuinely doesn't have KV configured", e.g. dev VMs without managed identity). Validated on dev (opensandbox-dev-tf-worker): before patch: reboot → vector inactive, "ordering cycle" in journal after patch: reboot → vector active, populator active, no cycle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

breardon2011

approve

…#256) #254 made the populator exit 1 when the role env file was missing, so systemd's Restart=on-failure could retry. Hit a real bug on prod (osb-worker-c0741893): vector.service has Restart=always. Each restart re-requests the populator unit. systemd counts these as start attempts against the populator's StartLimitBurst=5 / IntervalSec=120 — but they all land in <2 seconds (faster than RestartSec=10s can pace them). Burst tripped, populator enters `failed`, vector also enters `failed`. Journal: 00:43:08 populator: Start request repeated too quickly 00:43:08 populator: Failed ... 5 more in 2 seconds 00:44:00 worker.env written by cloud-init (too late, populator dead) The systemd retry mechanism doesn't compose well when other units re-request you faster than your RestartSec= can pace. Fix: poll inside the script. Single systemd invocation, internal wait up to 90s, source the env file when it appears. No restart-budget interaction. Behaves identically on dev (env file already exists → break out immediately on first iteration). Why the test in #254 didn't catch this: Dev's bootstrap.sh writes /etc/opensandbox/worker.env BEFORE Vector's install step. So at boot, worker.env always exists for the populator. The dev test confirmed the cycle was gone, not that the retry mechanism worked under cloud-init delay. To reproduce on dev would have needed an artificial delay (e.g. systemd-run --on-active=60s touch worker.env) — would catch this in future. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous iteration (since reverted in #259) shipped a 600s synchronous in-script wait on worker.env. On Azure that deadlocked the boot: cloud-final.service is ordered After=multi-user.target on Ubuntu Azure images, and writing /etc/opensandbox/worker.env is what cloud-final does. multi-user.target couldn't reach active while the populator was waiting (vector.service wants populator, multi-user wants vector). Every new Azure worker was reaped at exactly 600s by scaler.go's pendingWorkerTTL=10min. This change makes the populator exit fast in *all* boot paths: - If /etc/opensandbox/{worker,server}.env exists at populator-run time (dev hosts, image bake, reboot of a healthy VM), the populator pulls real creds from Key Vault and writes vector.env synchronously — unchanged behavior. - If neither role env exists (Azure first boot, cloud-final hasn't run yet), the populator: 1. writes a stub vector.env with all expected variables defined but empty, so `vector validate` passes and the service can start (the axiom sink fails its healthcheck and buffers to disk), 2. starts a new companion unit populate-vector-env-wait.service (not WantedBy=multi-user.target, so it doesn't block boot), 3. exits 0 in ~1s. The wait unit polls /etc/opensandbox/{worker,server}.env every 5s for up to 30 min (past Azure cloud-init's worst-case ~5 min), then re-runs the main populator (which now finds the role env file and goes through the synchronous path) and does `systemctl reset-failed + restart vector.service` so the disk buffer flushes into Axiom with the real token. Why prior approaches failed (full history in populate-vector-env.sh header): #249 After=cloud-final → systemd cycle, vector dropped silently. #254 exit 1 + Restart=on-failure → vector's restart-burst burnt the StartLimitBurst budget in <2s. #256 internal 90s poll → multi-user blocked 90s, populator gave up before cloud-final arrived at ~4 min anyway. #257 internal 600s poll → boot deadlock, every Azure worker reaped. What we explored but didn't ship: - systemd .path unit watching the specific worker.env file (not the dir): would work, but adds a third unit and still needs the same decoupling between vector.service and the populator at boot time that this approach already achieves more directly. - Type=forking + setsid + disown in one unit: the detached child can be killed by systemd on unit stop unless KillMode=process, which has subtler semantics than a clean separate unit. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

motatoes and others added 2 commits May 15, 2026 15:54

motatoes marked this pull request as ready for review May 15, 2026 23:10

breardon2011 approved these changes May 15, 2026

View reviewed changes

motatoes merged commit 614413a into main May 15, 2026
2 checks passed

motatoes mentioned this pull request May 16, 2026

vector: poll for worker.env inside populator instead of systemd retry #256

Merged

3 tasks

motatoes mentioned this pull request May 18, 2026

vector: detach populator from boot when role env is missing #260

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vector: drop Wants=cloud-final from populator to break systemd ordering cycle#254

vector: drop Wants=cloud-final from populator to break systemd ordering cycle#254
motatoes merged 2 commits into
mainfrom
fix/populator-systemd-cycle

motatoes commented May 15, 2026

Uh oh!

breardon2011 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

motatoes commented May 15, 2026

Summary

Reproduction on prod

Why `After=` is enough

Test plan

Uh oh!

breardon2011 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants