vector: bump populator poll to 600s + ExecStartPost vector restart by motatoes · Pull Request #257 · diggerhq/opencomputer

motatoes · 2026-05-16T03:28:28Z

Summary

Two-line behavioral fix to populator that addresses the prod symptom from #256: cloud-init takes longer than 90s on Azure prod workers, populator exits before env file arrives, vector enters failed state and never recovers.

Changes

Poll deadline 90s → 600s in `populate-vector-env.sh`. Observed on `osb-worker-0b42c8be`: cloud-init wrote worker.env at +4 minutes into boot; vector: poll for worker.env inside populator instead of systemd retry #256's 90s budget expired at +1.5 minutes. Azure cloud-init on Standard_D-series VMs takes 3–5 min in practice; 10 minutes covers the long tail with margin.
`ExecStartPost` on the service that does `systemctl --no-block reset-failed vector.service` + `systemctl --no-block restart vector.service`. When populator finally writes vector.env, vector may already be in failed state from earlier restart-loops. reset-failed clears that, restart picks up the new env. `--no-block` avoids deadlock with vector's `After=populator` dep.

What we explored but didn't ship

systemd Path unit (`populate-vector-env.path` watching either worker.env or cloud-init's boot-finished marker). Eight dev-test reboots surfaced four distinct subtle bugs in succession:

`After=cloud-final.service` creates a systemd ordering cycle (cloud-init declares After=multi-user.target on this Azure image)
`RemainAfterExit=yes` makes path-unit's `systemctl start` a no-op
Vector's `Wants=populator` cascades on vector restart, burns populator's StartLimit in <1s
Dir-level inotify storms: 50–250 populator starts per second when cloud-init writes any file in the watched directory (even watching boot-finished in /var/lib/cloud/instance/ tripped this)

Each fix surfaced the next bug. Concluded path-unit-on-shared-dir interaction is the wrong tool for this; the poll approach is simpler with a known-bounded failure mode (timeout at 10 min).

Test plan

Dev reboot with fake-cloud-init writing worker.env at +240s: populator polls correctly, finds the file inside its budget, exits cleanly
AMI rebake + prod worker rotation: confirm vector reaches `active` on freshly-booted prod workers (worker.env arriving at +3-5min from cloud-init)
In-place patch on existing prod workers: run new script + service file via az run-command, restart populate-vector-env.service, confirm vector starts

🤖 Generated with Claude Code

#256 introduced a 90s internal poll for worker.env. Hit a follow-up issue on prod (osb-worker-0b42c8be): cloud-init wrote worker.env at +4 minutes into boot, our 90s poll gave up at +1.5 minutes. Populator exited 0 with "no KV configured", vector ran without env file, failed, restart-looped into a failed state, and the late env arrival had no effect. Two changes: 1. Bump the poll deadline from 90s to 600s. Azure cloud-init on Standard_D-series VMs takes 3-5 minutes in observed cases; 10 minutes covers the long tail with margin. 2. Add ExecStartPost on the service that does systemctl --no-block reset-failed vector.service systemctl --no-block restart vector.service so when populator finally writes vector.env (potentially after vector has already exhausted its restart budget), vector is reset-failed and restarted. --no-block avoids the deadlock with vector's After=populator dep. What we explored but didn't ship: systemd Path units (populate-vector-env.path watching worker.env or boot-finished). Eight dev-test reboots surfaced: ordering cycles, RemainAfterExit no-op on path triggers, Wants= cascade re-triggers, and dir-level inotify storms (50-250 starts/sec when cloud-init writes any file in the watched directory). Concluded the path-unit-on-shared-dir interaction is the wrong tool for this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

breardon2011

approve

The previous iteration (since reverted in #259) shipped a 600s synchronous in-script wait on worker.env. On Azure that deadlocked the boot: cloud-final.service is ordered After=multi-user.target on Ubuntu Azure images, and writing /etc/opensandbox/worker.env is what cloud-final does. multi-user.target couldn't reach active while the populator was waiting (vector.service wants populator, multi-user wants vector). Every new Azure worker was reaped at exactly 600s by scaler.go's pendingWorkerTTL=10min. This change makes the populator exit fast in *all* boot paths: - If /etc/opensandbox/{worker,server}.env exists at populator-run time (dev hosts, image bake, reboot of a healthy VM), the populator pulls real creds from Key Vault and writes vector.env synchronously — unchanged behavior. - If neither role env exists (Azure first boot, cloud-final hasn't run yet), the populator: 1. writes a stub vector.env with all expected variables defined but empty, so `vector validate` passes and the service can start (the axiom sink fails its healthcheck and buffers to disk), 2. starts a new companion unit populate-vector-env-wait.service (not WantedBy=multi-user.target, so it doesn't block boot), 3. exits 0 in ~1s. The wait unit polls /etc/opensandbox/{worker,server}.env every 5s for up to 30 min (past Azure cloud-init's worst-case ~5 min), then re-runs the main populator (which now finds the role env file and goes through the synchronous path) and does `systemctl reset-failed + restart vector.service` so the disk buffer flushes into Axiom with the real token. Why prior approaches failed (full history in populate-vector-env.sh header): #249 After=cloud-final → systemd cycle, vector dropped silently. #254 exit 1 + Restart=on-failure → vector's restart-burst burnt the StartLimitBurst budget in <2s. #256 internal 90s poll → multi-user blocked 90s, populator gave up before cloud-final arrived at ~4 min anyway. #257 internal 600s poll → boot deadlock, every Azure worker reaped. What we explored but didn't ship: - systemd .path unit watching the specific worker.env file (not the dir): would work, but adds a third unit and still needs the same decoupling between vector.service and the populator at boot time that this approach already achieves more directly. - Type=forking + setsid + disown in one unit: the detached child can be killed by systemd on unit stop unless KillMode=process, which has subtler semantics than a clean separate unit. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

breardon2011 approved these changes May 18, 2026

View reviewed changes

motatoes merged commit 213e8e6 into main May 18, 2026
1 check passed

This was referenced May 18, 2026

Revert "vector: bump populator poll to 600s + ExecStartPost vector restart" #259

Merged

vector: detach populator from boot when role env is missing #260

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vector: bump populator poll to 600s + ExecStartPost vector restart#257

vector: bump populator poll to 600s + ExecStartPost vector restart#257
motatoes merged 1 commit into
mainfrom
fix/populator-path-unit

motatoes commented May 16, 2026

Uh oh!

breardon2011 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

motatoes commented May 16, 2026

Summary

Changes

What we explored but didn't ship

Test plan

Uh oh!

breardon2011 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants