Skip to content

Revert "vector: bump populator poll to 600s + ExecStartPost vector restart"#259

Merged
motatoes merged 1 commit into
mainfrom
revert-257-fix/populator-path-unit
May 18, 2026
Merged

Revert "vector: bump populator poll to 600s + ExecStartPost vector restart"#259
motatoes merged 1 commit into
mainfrom
revert-257-fix/populator-path-unit

Conversation

@motatoes
Copy link
Copy Markdown
Contributor

Reverts #257

@motatoes motatoes merged commit 9f6b8c0 into main May 18, 2026
1 check passed
motatoes added a commit that referenced this pull request May 19, 2026
The previous iteration (since reverted in #259) shipped a 600s
synchronous in-script wait on worker.env. On Azure that deadlocked the
boot: cloud-final.service is ordered After=multi-user.target on Ubuntu
Azure images, and writing /etc/opensandbox/worker.env is what
cloud-final does. multi-user.target couldn't reach active while the
populator was waiting (vector.service wants populator, multi-user
wants vector). Every new Azure worker was reaped at exactly 600s by
scaler.go's pendingWorkerTTL=10min.

This change makes the populator exit fast in *all* boot paths:

- If /etc/opensandbox/{worker,server}.env exists at populator-run time
  (dev hosts, image bake, reboot of a healthy VM), the populator pulls
  real creds from Key Vault and writes vector.env synchronously —
  unchanged behavior.

- If neither role env exists (Azure first boot, cloud-final hasn't
  run yet), the populator:
    1. writes a stub vector.env with all expected variables defined
       but empty, so `vector validate` passes and the service can
       start (the axiom sink fails its healthcheck and buffers to
       disk),
    2. starts a new companion unit populate-vector-env-wait.service
       (not WantedBy=multi-user.target, so it doesn't block boot),
    3. exits 0 in ~1s.

  The wait unit polls /etc/opensandbox/{worker,server}.env every 5s
  for up to 30 min (past Azure cloud-init's worst-case ~5 min), then
  re-runs the main populator (which now finds the role env file and
  goes through the synchronous path) and does
  `systemctl reset-failed + restart vector.service` so the disk
  buffer flushes into Axiom with the real token.

Why prior approaches failed (full history in populate-vector-env.sh
header):
  #249  After=cloud-final → systemd cycle, vector dropped silently.
  #254  exit 1 + Restart=on-failure → vector's restart-burst burnt
        the StartLimitBurst budget in <2s.
  #256  internal 90s poll → multi-user blocked 90s, populator gave up
        before cloud-final arrived at ~4 min anyway.
  #257  internal 600s poll → boot deadlock, every Azure worker reaped.

What we explored but didn't ship:
  - systemd .path unit watching the specific worker.env file (not the
    dir): would work, but adds a third unit and still needs the same
    decoupling between vector.service and the populator at boot time
    that this approach already achieves more directly.
  - Type=forking + setsid + disown in one unit: the detached child
    can be killed by systemd on unit stop unless KillMode=process,
    which has subtler semantics than a clean separate unit.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants