-
Notifications
You must be signed in to change notification settings - Fork 27
Night Watch Retrospective
Purpose: Measure whether Night Watch comments are providing value, and automatically adapt behavior based on evidence.
Agent comments on PRs have a cost: they consume human attention. Without measurement, an agent can't distinguish between "helpful diagnosis that unblocked someone" and "noise that trained people to ignore bot comments." The Retrospective Engine closes this loop.
When Night Watch posts a comment, record it in state:
{
"commentId": 4009239925,
"prNumber": 459,
"prAuthor": "mellyeliu",
"authorType": "external",
"commentType": "ci_diagnosis",
"postedAt": "2026-03-06T01:09:22Z",
"shiftId": "2026-03-05-evening",
"scored": false
}At the end of each shift (last QA run before business hours), revisit every comment posted during the shift and score it:
| Signal | Points | How to Detect |
|---|---|---|
| Human replied to the comment | +3 | Check comment thread for non-bot replies posted after ours |
| Human reacted (👍, 🎉, ❤️, 🚀) | +2 | Check comment reactions, exclude bot reactions |
| PR author pushed a fix within 4h of comment | +2 | Compare comment timestamp to next commit on the PR |
| PR was merged within 24h of comment | +1 | Check PR merge status and timestamp |
| Signal | Points | How to Detect |
|---|---|---|
| Comment on maintainer's own PR (rule violation) | -3 | Check prAuthor against maintainer list |
| No human engagement after 24h | -1 | No replies, no reactions, no commits after comment |
| Duplicate of previous Night Watch comment on same PR | -3 | Check if same error type was already diagnosed this shift |
| PR was reopened after Night Watch closed it | -5 | Check PR timeline events for reopen after our close |
| Comment restates info visible in GitHub UI | -1 | Heuristic: merge_conflict type always gets this penalty |
for comment in shift_comments:
score = 0
# Positive signals
human_replies = gh api /repos/.../issues/{pr}/comments \
| filter(created_after=comment.postedAt, user.type != "Bot")
if human_replies: score += 3
human_reactions = gh api /repos/.../issues/comments/{id}/reactions \
| filter(user.type != "Bot")
if human_reactions: score += 2
fix_commits = gh api /repos/.../pulls/{pr}/commits \
| filter(date > comment.postedAt, date < comment.postedAt + 4h)
if fix_commits and comment.authorType == "external": score += 2
pr = gh api /repos/.../pulls/{pr}
if pr.merged and within_24h(pr.merged_at, comment.postedAt): score += 1
# Negative signals
if comment.authorType == "maintainer": score -= 3
if comment.commentType == "merge_conflict": score -= 1
if no_human_engagement_after_24h(comment): score -= 1
comment.score = scoreTrack a rolling score across shifts:
{
"retrospective": {
"shifts": [
{
"shiftId": "2026-03-05-evening",
"commentsPosted": 3,
"totalScore": -2,
"avgScore": -0.67,
"breakdown": {
"ci_diagnosis_external": { "count": 1, "score": 2 },
"ci_diagnosis_maintainer": { "count": 2, "score": -4 }
}
}
],
"rollingAvgScore": -0.67,
"adaptations": []
}
}These trigger when the rolling average (last 5 shifts) crosses a threshold:
| Rolling Avg Score | Adaptation |
|---|---|
| < -1.0 for 3 shifts | Disable all comments except Navi-authored PR fixes |
| < -2.0 for 3 shifts | Pause QA commenting entirely, alert human |
| > 1.0 for 5 shifts | Log success — current rules are working |
| > 2.0 for 5 shifts | Consider expanding scope (e.g., suggest fixes, not just diagnose) |
When an adaptation triggers, log it:
{
"adaptations": [
{
"date": "2026-03-08",
"trigger": "rollingAvg < -1.0 for 3 shifts",
"action": "disabled all comments except Navi PR fixes",
"rollingAvgAtTime": -1.3
}
]
}The human can override any adaptation by editing the state file or updating this wiki page.
Once per week (Sunday evening), compose a retrospective summary:
## Night Watch QA — Weekly Retrospective
**Period:** Mar 1–7, 2026
**Comments posted:** 12
**Signal score:** +3 (avg +0.25/comment)
### What worked:
- CI diagnosis on PR #459 (mellyeliu) — author pushed fix within 1h (+5)
### What didn't:
- 3 merge conflict pings — zero engagement (-3)
- 2 diagnoses on cixzhang's PRs — they fixed independently (-4)
### Adaptations applied:
- None this week (score above threshold)
### Trend:
Week 1: -2.1 → Week 2: +0.25 (improving after QA v2 rules)
Post to the Night Watch shift log. Optionally post to Workplace if the human has configured it.
Most agent automation has no feedback loop. The agent does a thing, nobody measures whether it helped, and the behavior persists forever — even if it's net-negative.
The Retrospective Engine treats agent behavior like a product feature: measure adoption (did humans engage?), measure impact (did it unblock work?), and deprecate features that don't carry their weight.
This is the difference between an agent that does things and an agent that learns which things are worth doing.
- Scoring runs at end-of-shift, not real-time. This avoids excessive GitHub API calls.
- Adaptations are conservative — they only trigger after 3+ shifts of consistent signal.
- All scores and adaptations are logged in state for auditability.
- The weekly digest is the human-readable summary; the state file is the machine-readable truth.
- The Retrospective Engine applies to QA only. Quartermaster and Bob produce artifacts (diffs, state tracking), not comments — they don't need engagement scoring.