release: v4.7.1 — liveness alarm actually opens issues (checkout + threshold) by askalf · Pull Request #321 · askalf/dario

askalf · 2026-05-18T13:06:19Z

What does this PR do?

Closes two latent bugs in the v4.4.2 liveness alarm that overnight observation exposed. The alarm had been firing every 2 hours and correctly detecting that the class-B watcher was past threshold — but failing before it could open the `cc-watcher-liveness` issue. Net result: the watcher-of-the-watcher was a no-op for the entire window from v4.4.2 (2026-05-17) through v4.7.0. If the class-B watcher had actually gone offline, no alert would have surfaced.

Bug 1 — missing `actions/checkout`

The workflow shelled out to `gh issue list` / `gh issue create` without first checking out the repo. `gh` resolves the target repository by reading `.git/config` from the working directory; without a git context, it exits with `fatal: not a git repository`. The workflow logged `Last successful watcher run: 2026-05-18T06:35:47Z (4 hours ago, threshold 3h)` and then crashed.

Fix: `actions/checkout@v6.0.2` at the top of the job. We don't actually need any files — just the `.git` directory `gh` needs.

Bug 2 — threshold set against fictional cadence

I sized the v4.4.2 threshold (3h) against the declared `*/30 * * * *` cron (= 6 missed cycles). But GitHub Actions' free-tier cron scheduler is best-effort, not guaranteed. The observed cadence of the class-B watcher on this repo overnight was every 2-4 hours:

```
23:46:53 → 02:08:34 (2h22m)
02:08:34 → 06:35:47 (4h27m)
06:35:47 → 10:56:08 (4h21m)
```

So healthy watcher state would trip the 3h threshold ~half the time — but never actually surface as an alert because of Bug 1 above. Two bugs masking each other.

Fix: `THRESHOLD_HOURS` 3 → 8. Absorbs the observed skew while still catching real outages (anything past 8h of silence is signal, not noise).

Alert body text updated

The alert body now describes both the declared and observed cadence so an investigator reading the issue understands the threshold rationale instead of asking "why 8 hours?"

Documented

`docs/drift-monitor.md` gains an explicit "Observed cadence" column distinguishing declared from real-world cron. Plus a paragraph stating: GitHub Actions free-tier cron is best-effort; if you need sub-hour SLA, self-host the runner and the cron driver (e.g. a cron entry on the same Hetzner box calling `gh workflow run cc-drift-template-watch.yml --ref master`).

How to test

```bash
git fetch origin fix/v4.7.1-liveness-alarm
git checkout fix/v4.7.1-liveness-alarm
npm run build && npm test # 75/75 (no src/ changes)

Manual trigger after merge — should now exit 0 if watcher last

succeeded within 8h, or actually open a labeled issue if past:

gh workflow run cc-drift-watcher-liveness.yml --ref master
```

Checklist

`npm run build` passes
`npm test` passes (offline regression test, no credentials required) — 75/75
For changes that touch `proxy.ts`, `cc-template.ts`, or streaming behavior: tested with `dario proxy --verbose` + `node test/compat.mjs` (requires credentials) — N/A: workflow + docs only
No new runtime dependencies added
No tokens/secrets in code or logs

Overnight surfaced two latent v4.4.2 bugs. Alarm had been firing every 2h, correctly detecting that the class-B watcher was past threshold, but failing before opening the cc-watcher-liveness issue. Effectively a no-op for the entire window from v4.4.2 (2026-05-17) through v4.7.0. Bug 1: missing actions/checkout. gh resolves the target repo by reading .git/config from cwd; without a git context, exits with "fatal: not a git repository". Workflow exited 1 immediately after correctly logging the hours-since reading. Bug 2: threshold set against fictional cadence. I sized 3h against the declared `*/30 * * * *` cron (6 missed cycles), but GitHub Actions free-tier cron is best-effort. Observed cadence on this repo is every 2-4h, not 30 min. Even healthy state would trip 3h half the time. Fix: - Add actions/checkout@v6.0.2 at job start (supplies .git) - Bump THRESHOLD_HOURS 3 → 8 (absorbs observed 2-4h skew, still catches real outages) - Update alert body text to describe both declared and observed cadence so investigators understand the threshold rationale docs/drift-monitor.md gains explicit "Observed cadence" column distinguishing declared cron from real-world, plus a paragraph on the scheduler reality (sub-hour SLA requires self-hosting the cron driver too). 75/75 default suite green. No src/ changes.

askalf enabled auto-merge (squash) May 18, 2026 13:06

askalf merged commit 30595b9 into master May 18, 2026
9 checks passed

askalf deleted the fix/v4.7.1-liveness-alarm branch May 18, 2026 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: v4.7.1 — liveness alarm actually opens issues (checkout + threshold)#321

release: v4.7.1 — liveness alarm actually opens issues (checkout + threshold)#321
askalf merged 1 commit into
masterfrom
fix/v4.7.1-liveness-alarm

askalf commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

askalf commented May 18, 2026

What does this PR do?

Bug 1 — missing `actions/checkout`

Bug 2 — threshold set against fictional cadence

Alert body text updated

Documented

How to test

Manual trigger after merge — should now exit 0 if watcher last

succeeded within 8h, or actually open a labeled issue if past:

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant