Skip to content

Night Watch Retrospective

Cindy Zhang edited this page Jun 23, 2026 · 1 revision

Night Watch: Retrospective Engine

Purpose: Measure whether Night Watch comments are providing value, and automatically adapt behavior based on evidence.


The Problem

Agent comments on PRs have a cost: they consume human attention. Without measurement, an agent can't distinguish between "helpful diagnosis that unblocked someone" and "noise that trained people to ignore bot comments." The Retrospective Engine closes this loop.


How It Works

1. Track Every Outbound Comment

When Night Watch posts a comment, record it in state:

{
  "commentId": 4009239925,
  "prNumber": 459,
  "prAuthor": "mellyeliu",
  "authorType": "external",
  "commentType": "ci_diagnosis",
  "postedAt": "2026-03-06T01:09:22Z",
  "shiftId": "2026-03-05-evening",
  "scored": false
}

2. Score Comments at End of Shift

At the end of each shift (last QA run before business hours), revisit every comment posted during the shift and score it:

Positive Signals (comment was useful)

Signal Points How to Detect
Human replied to the comment +3 Check comment thread for non-bot replies posted after ours
Human reacted (👍, 🎉, ❤️, 🚀) +2 Check comment reactions, exclude bot reactions
PR author pushed a fix within 4h of comment +2 Compare comment timestamp to next commit on the PR
PR was merged within 24h of comment +1 Check PR merge status and timestamp

Negative Signals (comment was noise)

Signal Points How to Detect
Comment on maintainer's own PR (rule violation) -3 Check prAuthor against maintainer list
No human engagement after 24h -1 No replies, no reactions, no commits after comment
Duplicate of previous Night Watch comment on same PR -3 Check if same error type was already diagnosed this shift
PR was reopened after Night Watch closed it -5 Check PR timeline events for reopen after our close
Comment restates info visible in GitHub UI -1 Heuristic: merge_conflict type always gets this penalty

Scoring Implementation

for comment in shift_comments:
  score = 0
  
  # Positive signals
  human_replies = gh api /repos/.../issues/{pr}/comments \
    | filter(created_after=comment.postedAt, user.type != "Bot")
  if human_replies: score += 3
  
  human_reactions = gh api /repos/.../issues/comments/{id}/reactions \
    | filter(user.type != "Bot")
  if human_reactions: score += 2
  
  fix_commits = gh api /repos/.../pulls/{pr}/commits \
    | filter(date > comment.postedAt, date < comment.postedAt + 4h)
  if fix_commits and comment.authorType == "external": score += 2
  
  pr = gh api /repos/.../pulls/{pr}
  if pr.merged and within_24h(pr.merged_at, comment.postedAt): score += 1
  
  # Negative signals
  if comment.authorType == "maintainer": score -= 3
  if comment.commentType == "merge_conflict": score -= 1
  if no_human_engagement_after_24h(comment): score -= 1
  
  comment.score = score

3. Aggregate and Adapt

Track a rolling score across shifts:

{
  "retrospective": {
    "shifts": [
      {
        "shiftId": "2026-03-05-evening",
        "commentsPosted": 3,
        "totalScore": -2,
        "avgScore": -0.67,
        "breakdown": {
          "ci_diagnosis_external": { "count": 1, "score": 2 },
          "ci_diagnosis_maintainer": { "count": 2, "score": -4 }
        }
      }
    ],
    "rollingAvgScore": -0.67,
    "adaptations": []
  }
}

Automatic Adaptations

These trigger when the rolling average (last 5 shifts) crosses a threshold:

Rolling Avg Score Adaptation
< -1.0 for 3 shifts Disable all comments except Navi-authored PR fixes
< -2.0 for 3 shifts Pause QA commenting entirely, alert human
> 1.0 for 5 shifts Log success — current rules are working
> 2.0 for 5 shifts Consider expanding scope (e.g., suggest fixes, not just diagnose)

When an adaptation triggers, log it:

{
  "adaptations": [
    {
      "date": "2026-03-08",
      "trigger": "rollingAvg < -1.0 for 3 shifts",
      "action": "disabled all comments except Navi PR fixes",
      "rollingAvgAtTime": -1.3
    }
  ]
}

The human can override any adaptation by editing the state file or updating this wiki page.

4. Weekly Digest

Once per week (Sunday evening), compose a retrospective summary:

## Night Watch QA — Weekly Retrospective

**Period:** Mar 1–7, 2026
**Comments posted:** 12
**Signal score:** +3 (avg +0.25/comment)

### What worked:
- CI diagnosis on PR #459 (mellyeliu) — author pushed fix within 1h (+5)

### What didn't:
- 3 merge conflict pings — zero engagement (-3)
- 2 diagnoses on cixzhang's PRs — they fixed independently (-4)

### Adaptations applied:
- None this week (score above threshold)

### Trend:
Week 1: -2.1 → Week 2: +0.25 (improving after QA v2 rules)

Post to the Night Watch shift log. Optionally post to Workplace if the human has configured it.


Why This Matters

Most agent automation has no feedback loop. The agent does a thing, nobody measures whether it helped, and the behavior persists forever — even if it's net-negative.

The Retrospective Engine treats agent behavior like a product feature: measure adoption (did humans engage?), measure impact (did it unblock work?), and deprecate features that don't carry their weight.

This is the difference between an agent that does things and an agent that learns which things are worth doing.


Implementation Notes

  • Scoring runs at end-of-shift, not real-time. This avoids excessive GitHub API calls.
  • Adaptations are conservative — they only trigger after 3+ shifts of consistent signal.
  • All scores and adaptations are logged in state for auditability.
  • The weekly digest is the human-readable summary; the state file is the machine-readable truth.
  • The Retrospective Engine applies to QA only. Quartermaster and Bob produce artifacts (diffs, state tracking), not comments — they don't need engagement scoring.

Clone this wiki locally