Skip to content

vector: drop Wants=cloud-final from populator to break systemd ordering cycle#254

Merged
motatoes merged 2 commits into
mainfrom
fix/populator-systemd-cycle
May 15, 2026
Merged

vector: drop Wants=cloud-final from populator to break systemd ordering cycle#254
motatoes merged 2 commits into
mainfrom
fix/populator-systemd-cycle

Conversation

@motatoes
Copy link
Copy Markdown
Contributor

Summary

PR #249 added both `After=cloud-final.service` AND `Wants=cloud-final.service` to `populate-vector-env.service`. The `Wants=` half pulled cloud-final into Vector's dependency graph and created an ordering cycle that systemd silently resolves by deleting `vector.service/start` — Vector never boots, no log, no error.

Reproduction on prod

After #249 merged and a fresh worker rolled, observed on one prod worker:

  • `uptime`: 1h 50m — fresh boot on the new AMI
  • `/etc/opensandbox/vector.env`: fully populated (populator ran correctly)
  • `opensandbox-worker.service`: active
  • `vector.service`: inactive (dead) — never tried to start, zero journal entries this boot

In the early boot journal:
```
cloud-final.service: Found dependency on vector.service/start
cloud-final.service: Job vector.service/start deleted to break ordering cycle
starting with cloud-final.service/start
```

Why `After=` is enough

`After=cloud-final.service` already gives the ordering needed to fix the original cloud-init race #249 was solving. `Wants=` adds a "pull into dep graph" semantic we don't actually need — cloud-final is a stock cloud-init target that's always present, no need to "want" it.

Test plan

  • Stage on a dev worker — replace `/etc/systemd/system/populate-vector-env.service` with this version, `systemctl daemon-reload`, reboot
  • After reboot: `systemctl is-active vector.service` returns `active`
  • Boot journal contains zero "ordering cycle" entries
  • Metric events from worker show up in Axiom within 1–2 minutes

🤖 Generated with Claude Code

motatoes and others added 2 commits May 15, 2026 15:54
#249 added After= AND Wants= cloud-final.service to the populator unit.
The Wants= half pulled cloud-final into the dep graph and created a
cycle:

  vector.service Wants populate-vector-env.service Wants cloud-final.service
  cloud-final.service Before multi-user.target Wants vector.service

At boot, systemd resolves this by silently deleting vector.service/start.
Vector never starts, no log, no error. Observed on a prod worker after
#249 merged: load=10, vector inactive, journal:
  "cloud-final.service: Job vector.service/start deleted to break
   ordering cycle starting with cloud-final.service/start"

Drop cloud-final from Wants=. Keep it in After= — that alone is what
fixes the original race and avoids forcing cloud-final into our dep
graph.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#249 added `After=cloud-final.service` + `Wants=cloud-final.service` to
populate-vector-env.service to fix a race where the populator ran before
cloud-init wrote /etc/opensandbox/worker.env. Symptoms: empty env → no
KV fetch → empty vector.env → Vector with no Axiom creds.

#254-v1 tried to break the resulting systemd cycle by dropping just the
Wants=. Tested on dev: cycle still fires, vector still inactive.

Real root cause: on Azure this image, BOTH cloud-final.service and
cloud-init.target declare `After=multi-user.target`. So ANY ordering
dependency on a cloud-init unit from a unit WantedBy=multi-user.target
(which populate-vector-env is) creates a cycle. systemd resolves it by
silently deleting vector.service/start.

This commit:

1. Reverts the unit-file changes from #249. Back to After=/Wants=
   network-online.target only — same as before #249, no cycle.

2. Fixes the original race at the script level. When neither
   /etc/opensandbox/worker.env nor server.env exists, the script now
   exits 1 instead of 0, so Restart=on-failure on the unit retries.
   With RestartSec=10s and StartLimitBurst=5 / IntervalSec=120, that's a
   ~50s retry budget — plenty for cloud-init to land worker.env on
   Azure.

   Once worker.env exists but VAULT_NAME is still unset, the script
   exits 0 (treating this as "host genuinely doesn't have KV
   configured", e.g. dev VMs without managed identity).

Validated on dev (opensandbox-dev-tf-worker):
  before patch: reboot → vector inactive, "ordering cycle" in journal
  after patch:  reboot → vector active, populator active, no cycle

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@motatoes motatoes marked this pull request as ready for review May 15, 2026 23:10
Copy link
Copy Markdown
Contributor

@breardon2011 breardon2011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve

@motatoes motatoes merged commit 614413a into main May 15, 2026
2 checks passed
motatoes added a commit that referenced this pull request May 16, 2026
…#256)

#254 made the populator exit 1 when the role env file was missing, so
systemd's Restart=on-failure could retry. Hit a real bug on prod
(osb-worker-c0741893):

  vector.service has Restart=always. Each restart re-requests the
  populator unit. systemd counts these as start attempts against the
  populator's StartLimitBurst=5 / IntervalSec=120 — but they all land
  in <2 seconds (faster than RestartSec=10s can pace them). Burst
  tripped, populator enters `failed`, vector also enters `failed`.

  Journal:
    00:43:08 populator: Start request repeated too quickly
    00:43:08 populator: Failed
    ... 5 more in 2 seconds
    00:44:00 worker.env written by cloud-init (too late, populator dead)

The systemd retry mechanism doesn't compose well when other units
re-request you faster than your RestartSec= can pace.

Fix: poll inside the script. Single systemd invocation, internal wait
up to 90s, source the env file when it appears. No restart-budget
interaction. Behaves identically on dev (env file already exists →
break out immediately on first iteration).

Why the test in #254 didn't catch this:
  Dev's bootstrap.sh writes /etc/opensandbox/worker.env BEFORE Vector's
  install step. So at boot, worker.env always exists for the populator.
  The dev test confirmed the cycle was gone, not that the retry
  mechanism worked under cloud-init delay. To reproduce on dev would
  have needed an artificial delay (e.g. systemd-run --on-active=60s
  touch worker.env) — would catch this in future.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
motatoes added a commit that referenced this pull request May 19, 2026
The previous iteration (since reverted in #259) shipped a 600s
synchronous in-script wait on worker.env. On Azure that deadlocked the
boot: cloud-final.service is ordered After=multi-user.target on Ubuntu
Azure images, and writing /etc/opensandbox/worker.env is what
cloud-final does. multi-user.target couldn't reach active while the
populator was waiting (vector.service wants populator, multi-user
wants vector). Every new Azure worker was reaped at exactly 600s by
scaler.go's pendingWorkerTTL=10min.

This change makes the populator exit fast in *all* boot paths:

- If /etc/opensandbox/{worker,server}.env exists at populator-run time
  (dev hosts, image bake, reboot of a healthy VM), the populator pulls
  real creds from Key Vault and writes vector.env synchronously —
  unchanged behavior.

- If neither role env exists (Azure first boot, cloud-final hasn't
  run yet), the populator:
    1. writes a stub vector.env with all expected variables defined
       but empty, so `vector validate` passes and the service can
       start (the axiom sink fails its healthcheck and buffers to
       disk),
    2. starts a new companion unit populate-vector-env-wait.service
       (not WantedBy=multi-user.target, so it doesn't block boot),
    3. exits 0 in ~1s.

  The wait unit polls /etc/opensandbox/{worker,server}.env every 5s
  for up to 30 min (past Azure cloud-init's worst-case ~5 min), then
  re-runs the main populator (which now finds the role env file and
  goes through the synchronous path) and does
  `systemctl reset-failed + restart vector.service` so the disk
  buffer flushes into Axiom with the real token.

Why prior approaches failed (full history in populate-vector-env.sh
header):
  #249  After=cloud-final → systemd cycle, vector dropped silently.
  #254  exit 1 + Restart=on-failure → vector's restart-burst burnt
        the StartLimitBurst budget in <2s.
  #256  internal 90s poll → multi-user blocked 90s, populator gave up
        before cloud-final arrived at ~4 min anyway.
  #257  internal 600s poll → boot deadlock, every Azure worker reaped.

What we explored but didn't ship:
  - systemd .path unit watching the specific worker.env file (not the
    dir): would work, but adds a third unit and still needs the same
    decoupling between vector.service and the populator at boot time
    that this approach already achieves more directly.
  - Type=forking + setsid + disown in one unit: the detached child
    can be killed by systemd on unit stop unless KillMode=process,
    which has subtler semantics than a clean separate unit.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants