release: v4.7.1 — liveness alarm actually opens issues (checkout + threshold)#321
Merged
Conversation
Overnight surfaced two latent v4.4.2 bugs. Alarm had been firing every 2h, correctly detecting that the class-B watcher was past threshold, but failing before opening the cc-watcher-liveness issue. Effectively a no-op for the entire window from v4.4.2 (2026-05-17) through v4.7.0. Bug 1: missing actions/checkout. gh resolves the target repo by reading .git/config from cwd; without a git context, exits with "fatal: not a git repository". Workflow exited 1 immediately after correctly logging the hours-since reading. Bug 2: threshold set against fictional cadence. I sized 3h against the declared `*/30 * * * *` cron (6 missed cycles), but GitHub Actions free-tier cron is best-effort. Observed cadence on this repo is every 2-4h, not 30 min. Even healthy state would trip 3h half the time. Fix: - Add actions/checkout@v6.0.2 at job start (supplies .git) - Bump THRESHOLD_HOURS 3 → 8 (absorbs observed 2-4h skew, still catches real outages) - Update alert body text to describe both declared and observed cadence so investigators understand the threshold rationale docs/drift-monitor.md gains explicit "Observed cadence" column distinguishing declared cron from real-world, plus a paragraph on the scheduler reality (sub-hour SLA requires self-hosting the cron driver too). 75/75 default suite green. No src/ changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Closes two latent bugs in the v4.4.2 liveness alarm that overnight observation exposed. The alarm had been firing every 2 hours and correctly detecting that the class-B watcher was past threshold — but failing before it could open the `cc-watcher-liveness` issue. Net result: the watcher-of-the-watcher was a no-op for the entire window from v4.4.2 (2026-05-17) through v4.7.0. If the class-B watcher had actually gone offline, no alert would have surfaced.
Bug 1 — missing `actions/checkout`
The workflow shelled out to `gh issue list` / `gh issue create` without first checking out the repo. `gh` resolves the target repository by reading `.git/config` from the working directory; without a git context, it exits with `fatal: not a git repository`. The workflow logged `Last successful watcher run: 2026-05-18T06:35:47Z (4 hours ago, threshold 3h)` and then crashed.
Fix: `actions/checkout@v6.0.2` at the top of the job. We don't actually need any files — just the `.git` directory `gh` needs.
Bug 2 — threshold set against fictional cadence
I sized the v4.4.2 threshold (3h) against the declared `*/30 * * * *` cron (= 6 missed cycles). But GitHub Actions' free-tier cron scheduler is best-effort, not guaranteed. The observed cadence of the class-B watcher on this repo overnight was every 2-4 hours:
```
23:46:53 → 02:08:34 (2h22m)
02:08:34 → 06:35:47 (4h27m)
06:35:47 → 10:56:08 (4h21m)
```
So healthy watcher state would trip the 3h threshold ~half the time — but never actually surface as an alert because of Bug 1 above. Two bugs masking each other.
Fix: `THRESHOLD_HOURS` 3 → 8. Absorbs the observed skew while still catching real outages (anything past 8h of silence is signal, not noise).
Alert body text updated
The alert body now describes both the declared and observed cadence so an investigator reading the issue understands the threshold rationale instead of asking "why 8 hours?"
Documented
`docs/drift-monitor.md` gains an explicit "Observed cadence" column distinguishing declared from real-world cron. Plus a paragraph stating: GitHub Actions free-tier cron is best-effort; if you need sub-hour SLA, self-host the runner and the cron driver (e.g. a cron entry on the same Hetzner box calling `gh workflow run cc-drift-template-watch.yml --ref master`).
How to test
```bash
git fetch origin fix/v4.7.1-liveness-alarm
git checkout fix/v4.7.1-liveness-alarm
npm run build && npm test # 75/75 (no src/ changes)
Manual trigger after merge — should now exit 0 if watcher last
succeeded within 8h, or actually open a labeled issue if past:
gh workflow run cc-drift-watcher-liveness.yml --ref master
```
Checklist