Skip to content

release: v4.7.1 — liveness alarm actually opens issues (checkout + threshold)#321

Merged
askalf merged 1 commit into
masterfrom
fix/v4.7.1-liveness-alarm
May 18, 2026
Merged

release: v4.7.1 — liveness alarm actually opens issues (checkout + threshold)#321
askalf merged 1 commit into
masterfrom
fix/v4.7.1-liveness-alarm

Conversation

@askalf
Copy link
Copy Markdown
Owner

@askalf askalf commented May 18, 2026

What does this PR do?

Closes two latent bugs in the v4.4.2 liveness alarm that overnight observation exposed. The alarm had been firing every 2 hours and correctly detecting that the class-B watcher was past threshold — but failing before it could open the `cc-watcher-liveness` issue. Net result: the watcher-of-the-watcher was a no-op for the entire window from v4.4.2 (2026-05-17) through v4.7.0. If the class-B watcher had actually gone offline, no alert would have surfaced.

Bug 1 — missing `actions/checkout`

The workflow shelled out to `gh issue list` / `gh issue create` without first checking out the repo. `gh` resolves the target repository by reading `.git/config` from the working directory; without a git context, it exits with `fatal: not a git repository`. The workflow logged `Last successful watcher run: 2026-05-18T06:35:47Z (4 hours ago, threshold 3h)` and then crashed.

Fix: `actions/checkout@v6.0.2` at the top of the job. We don't actually need any files — just the `.git` directory `gh` needs.

Bug 2 — threshold set against fictional cadence

I sized the v4.4.2 threshold (3h) against the declared `*/30 * * * *` cron (= 6 missed cycles). But GitHub Actions' free-tier cron scheduler is best-effort, not guaranteed. The observed cadence of the class-B watcher on this repo overnight was every 2-4 hours:

```
23:46:53 → 02:08:34 (2h22m)
02:08:34 → 06:35:47 (4h27m)
06:35:47 → 10:56:08 (4h21m)
```

So healthy watcher state would trip the 3h threshold ~half the time — but never actually surface as an alert because of Bug 1 above. Two bugs masking each other.

Fix: `THRESHOLD_HOURS` 3 → 8. Absorbs the observed skew while still catching real outages (anything past 8h of silence is signal, not noise).

Alert body text updated

The alert body now describes both the declared and observed cadence so an investigator reading the issue understands the threshold rationale instead of asking "why 8 hours?"

Documented

`docs/drift-monitor.md` gains an explicit "Observed cadence" column distinguishing declared from real-world cron. Plus a paragraph stating: GitHub Actions free-tier cron is best-effort; if you need sub-hour SLA, self-host the runner and the cron driver (e.g. a cron entry on the same Hetzner box calling `gh workflow run cc-drift-template-watch.yml --ref master`).

How to test

```bash
git fetch origin fix/v4.7.1-liveness-alarm
git checkout fix/v4.7.1-liveness-alarm
npm run build && npm test # 75/75 (no src/ changes)

Manual trigger after merge — should now exit 0 if watcher last

succeeded within 8h, or actually open a labeled issue if past:

gh workflow run cc-drift-watcher-liveness.yml --ref master
```

Checklist

  • `npm run build` passes
  • `npm test` passes (offline regression test, no credentials required) — 75/75
  • For changes that touch `proxy.ts`, `cc-template.ts`, or streaming behavior: tested with `dario proxy --verbose` + `node test/compat.mjs` (requires credentials) — N/A: workflow + docs only
  • No new runtime dependencies added
  • No tokens/secrets in code or logs

Overnight surfaced two latent v4.4.2 bugs. Alarm had been firing
every 2h, correctly detecting that the class-B watcher was past
threshold, but failing before opening the cc-watcher-liveness
issue. Effectively a no-op for the entire window from v4.4.2
(2026-05-17) through v4.7.0.

Bug 1: missing actions/checkout. gh resolves the target repo by
reading .git/config from cwd; without a git context, exits with
"fatal: not a git repository". Workflow exited 1 immediately
after correctly logging the hours-since reading.

Bug 2: threshold set against fictional cadence. I sized 3h
against the declared `*/30 * * * *` cron (6 missed cycles), but
GitHub Actions free-tier cron is best-effort. Observed cadence
on this repo is every 2-4h, not 30 min. Even healthy state
would trip 3h half the time.

Fix:
- Add actions/checkout@v6.0.2 at job start (supplies .git)
- Bump THRESHOLD_HOURS 3 → 8 (absorbs observed 2-4h skew,
  still catches real outages)
- Update alert body text to describe both declared and observed
  cadence so investigators understand the threshold rationale

docs/drift-monitor.md gains explicit "Observed cadence" column
distinguishing declared cron from real-world, plus a paragraph
on the scheduler reality (sub-hour SLA requires self-hosting
the cron driver too).

75/75 default suite green. No src/ changes.
@askalf askalf enabled auto-merge (squash) May 18, 2026 13:06
@askalf askalf merged commit 30595b9 into master May 18, 2026
9 checks passed
@askalf askalf deleted the fix/v4.7.1-liveness-alarm branch May 18, 2026 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant