Implement safe-output outcome evaluation based on observable repository state rather than workflow self-assessment or artifact survival.
Current outcome evaluation is too weak when it treats object existence as acceptance. That inflates acceptance rates and makes workflow effectiveness metrics misleading.
Current outcome evaluation has a mix of dedicated evaluators, placeholder evaluators, and fallback behavior.
Use this issue as a decomposition and sequencing source, not as a single implementation task.
Full Detailed Design Reference
This section preserves the full implementation guidance and is intentionally more detailed than the main issue body.
Core Principle
For implementation, accepted outcomes should be defined as human-checkable evidence and explicitly separated from weaker "artifact still exists" signals.
Core implementation principle:
A safe output is the agent's proposed action. An accepted outcome is the later repository state that a human reviewer would reasonably inspect to decide whether that action was useful, correct, or intentionally retained.
This fits the underlying safe-output design: safe outputs are validated operations executed outside the agent's write-permission context, while outcomes are meant to evaluate what happened after those operations based on repository state rather than the workflow's own self-assessment.
1. Implementation Goal
Build an outcome evaluator that answers, for every safe output item:
- What did the agent safely output?
- What GitHub object did it affect or create?
- What happened to that object afterward?
- Would a human checking the repo treat that as accepted, rejected, pending, ignored, or unmeasured?
- How strong is that evidence?
The most important implementation decision is to avoid treating all accepted statuses as equally strong. A PR merged by a maintainer is much stronger evidence than "the PR still exists."
Two layers should be implemented:
outcome_status:
accepted | rejected | pending | ignored | skipped | unknown
evidence_strength:
strong | medium | weak | none
This preserves compatibility with outcome reporting while keeping the measurement honest.
2. Recommended Outcome States
Use these states consistently across all safe output types.
| State |
Meaning |
accepted |
Observable repository state suggests the output was useful, correct, or intentionally retained. |
rejected |
Observable repository state suggests the output was undone, removed, closed as invalid, reverted, or contradicted. |
pending |
The output is still in flight; there has not been enough time or activity to judge. |
ignored |
The output exists but received no meaningful human or repository response within the evaluation window. |
skipped |
No human-facing outcome should be evaluated, for example noop, missing_tool, or cancelled outputs. |
unknown |
Required target object or audit data could not be fetched. |
Attach evidence strength separately:
| Evidence strength |
Meaning |
strong |
Direct human or repository acceptance signal: merged PR, resolved review, workflow success, human reply, retained metadata after triage. |
medium |
Indirect but meaningful signal: issue triaged, label retained, milestone retained, linked issue retained. |
weak |
Artifact merely still exists or target still exists. |
none |
No measurable evidence. |
3. Universal Evaluation Schema
Each evaluated output should produce a normalized record like this:
{
"safe_output_id": "run-id:item-index",
"safe_output_type": "create_pull_request",
"target": {
"repo": "owner/repo",
"kind": "pull_request",
"number": 123,
"node_id": "..."
},
"created_at": "2026-05-26T10:00:00Z",
"evaluated_at": "2026-05-27T10:00:00Z",
"evaluation_window_hours": 24,
"outcome_status": "accepted",
"evidence_strength": "strong",
"human_check_signal": "pull_request_merged",
"bot_aware": true,
"actor_summary": {
"visible_non_bot_actor_count": 2,
"bot_actor_count": 1
},
"details": {
"merged": true,
"merged_by_type": "User",
"closed": true,
"reverted": false
},
"confidence": "high",
"notes": "PR was merged by a visible non-bot actor."
}
The key field is human_check_signal. That is the concrete thing a human would check.
4. Data Collection Requirements
Enough metadata must be persisted at safe-output execution time to evaluate outcomes later.
For every safe output, store a record like:
{
"type": "add_labels",
"run_id": "...",
"workflow_name": "...",
"repo": "owner/repo",
"actor": "github-actions[bot]",
"created_at": "...",
"target_url": "...",
"target_node_id": "...",
"target_number": 123,
"payload_hash": "...",
"expected_state": {
"labels_added": ["bug", "needs-triage"]
}
}
For update operations, store the before and after values. Without that, later evaluation cannot determine whether a change retained or replaced the intended result.
Example for update_issue:
{
"type": "update_issue",
"target": {
"repo": "owner/repo",
"issue_number": 42
},
"before": {
"title": "Old title",
"body_hash": "abc123"
},
"after": {
"title": "New title",
"body_hash": "def456"
}
}
For PR and code-changing operations, store commit SHAs, branch name, patch hash, PR number, and base/head refs.
5. Evaluation Window
Use multiple windows rather than one hard cutoff.
Recommended:
T+1h: early signal
T+24h: primary signal
T+7d: durable signal
T+30d: long-term or revert signal for code changes
Why: some outputs are accepted quickly, like comments or labels. Others, especially PRs, may take days. A PR open after 24 hours should usually be pending, not ignored or rejected.
For research reporting, include both:
accepted_at_24h
accepted_at_7d
accepted_durable_at_30d
6. Bot-Aware Human Check
Do not assume every visible action is human. Visible actor identity is not perfect provenance: a non-bot actor may still be AI-assisted, and hidden authorship cannot be fully observed. Implement this as visible non-bot activity, not definitely human activity.
Recommended actor categories:
bot_actor
visible_non_bot_actor
same_workflow_actor
unknown_actor
system_actor
A signal is stronger when it involves a visible non-bot actor other than the original workflow actor.
7. Per-Safe-Output Implementation Rules
Issues and Discussions
create_issue
Human check: did the issue receive meaningful triage, assignment, linkage, closure, or completion?
| Repository state |
Outcome |
Strength |
| Closed as completed or resolved |
accepted |
strong |
| Assigned, labeled by non-bot, milestone added, linked to PR, referenced in commit or PR |
accepted |
medium |
| Open with non-bot comment, reaction, or triage |
pending or accepted_medium, depending policy |
medium |
| Closed as duplicate, not planned, invalid, or spam |
rejected |
strong |
| Open with no non-bot interaction after window |
ignored |
weak |
| Issue deleted or inaccessible |
unknown or rejected, depending audit availability |
none |
Implementation notes:
- Store issue number and initial body hash
- Query issue state, timeline events, comments, labels, assignees, milestone, linked PRs, and closing reason
- Do not treat "issue still exists" as accepted
update_issue
Human check: did the agent's edit stick?
| Repository state |
Outcome |
Strength |
| Edited fields still equal intended values after window |
accepted |
medium |
| Later visible non-bot edit preserves core change |
accepted |
medium |
| Later visible non-bot reverts or replaces the change |
rejected |
strong |
| Issue deleted or inaccessible |
unknown |
none |
| No way to compare before and after |
unknown |
none |
Implementation notes:
- Store field-level before and after values
- For body fields, compare normalized body hash rather than raw Markdown only
- Track title, body, state, labels, changed-by, and changed-at where available
close_issue
Human check: did the issue stay closed, or was it reopened?
| Repository state |
Outcome |
Strength |
| Still closed after evaluation window |
accepted |
medium |
| Reopened by visible non-bot |
rejected |
strong |
| Closed then referenced as mistakenly closed |
rejected |
medium |
| Closed by stale or lifecycle bot only |
accepted_weak or ignored |
weak |
| Target missing |
unknown |
none |
Implementation nuance: a close action can be harmful even if the issue remains closed. When possible, inspect timeline comments after closure for reopen attempts or challenge comments.
link_sub_issue
Human check: did the sub-issue relationship stick?
| Repository state |
Outcome |
Strength |
| Parent/sub-issue link still present |
accepted |
medium |
| Link removed by visible non-bot |
rejected |
strong |
| Target issue closed as duplicate or invalid |
rejected or ignored, depending context |
medium |
| Parent or child inaccessible |
unknown |
none |
create_discussion
Human check: did the discussion receive meaningful engagement or an accepted answer?
| Repository state |
Outcome |
Strength |
| Answer marked or accepted answer present |
accepted |
strong |
| Non-bot replies or meaningful reactions |
accepted |
medium |
| Discussion converted or linked to issue or PR |
accepted |
medium |
| Closed as duplicate, spam, or off-topic |
rejected |
strong |
| Exists with no engagement after window |
ignored |
weak |
update_discussion
Human check mirrors update_issue.
close_discussion
Human check: did the discussion stay closed?
| Repository state |
Outcome |
Strength |
| Still closed after window |
accepted |
medium |
| Reopened by visible non-bot |
rejected |
strong |
| Non-bot comments indicate close was wrong |
rejected |
medium |
| Deleted or inaccessible |
unknown |
none |
Pull Requests
create_pull_request
Human check: was the PR merged, closed unmerged, reviewed, or still pending?
| Repository state |
Outcome |
Strength |
| Merged |
accepted |
strong |
| Merged then reverted within durability window |
rejected or accepted_then_reverted |
strong |
| Approved by visible non-bot but not merged yet |
pending with positive signal |
medium |
| Open with review activity |
pending |
medium |
| Closed unmerged |
rejected |
strong |
| Open with no activity after long window |
ignored |
weak |
| PR creation fell back to issue |
evaluate as create_issue with fallback subtype |
varies |
Implementation notes:
- Store PR number, branch, commit SHAs, patch hash, labels, reviewers, and fallback behavior
- For durable research, check for revert commits after merge
update_pull_request
Human check mirrors update_issue but stronger when the PR later merges.
| Repository state |
Outcome |
Strength |
| Updated fields remain unchanged |
accepted |
medium |
| PR merged after update and update remained relevant |
accepted |
strong |
| Fields reverted or replaced by non-bot |
rejected |
strong |
| PR closed unmerged after update |
often rejected |
medium |
| Cannot compare |
unknown |
none |
close_pull_request
Human check: did the PR stay closed, or was the close undone?
| Repository state |
Outcome |
Strength |
| Still closed unmerged after window |
accepted |
medium |
| Reopened by visible non-bot |
rejected |
strong |
| Later merged after reopening |
rejected |
strong |
| Comment indicates premature close |
rejected |
medium |
| Target missing |
unknown |
none |
create_pull_request_review_comment
Human check: did the inline comment lead to resolution, reply, code change, or review action?
| Repository state |
Outcome |
Strength |
| Thread resolved by visible non-bot |
accepted |
strong |
| Comment replied to by visible non-bot |
accepted |
medium |
| Follow-up commit touches commented lines or files |
accepted |
medium |
| Comment marked outdated due to changes and PR merged |
accepted |
medium |
| Comment deleted or minimized as abuse/off-topic |
rejected |
strong |
| No reply or no resolution after PR closed or merged |
ignored |
weak |
| PR still open and thread unresolved |
pending |
weak or medium |
submit_pull_request_review
Human check: did the submitted review affect the PR lifecycle?
| Repository state |
Outcome |
Strength |
| Review approved and PR merged |
accepted |
strong |
| Changes requested and later addressed by commits |
accepted |
medium |
| Review dismissed by visible non-bot |
rejected |
strong |
| Review contradicted by later human review |
rejected or mixed |
medium |
| PR closed without addressing review |
contextual |
medium |
| PR still open with review pending |
pending |
medium |
reply_to_pull_request_review_comment
Human check: did the reply advance or resolve the thread?
| Repository state |
Outcome |
Strength |
| Thread resolved after reply |
accepted |
strong |
| Visible non-bot replies positively or continues constructively |
accepted |
medium |
| Reply is deleted or minimized |
rejected |
strong |
| Thread remains unresolved and PR closes without action |
ignored or rejected, depending context |
weak or medium |
| PR still open |
pending |
weak |
resolve_pull_request_review_thread
Human check: did the thread stay resolved?
| Repository state |
Outcome |
Strength |
| Thread still resolved after window |
accepted |
medium |
| Thread reopened by visible non-bot |
rejected |
strong |
| PR merged after resolution |
accepted |
strong |
| PR closed unmerged after resolution |
contextual |
medium |
| Thread inaccessible |
unknown |
none |
push_to_pull_request_branch
Human check: were the pushed commits accepted through the PR lifecycle?
| Repository state |
Outcome |
Strength |
| PR merged with pushed commits included |
accepted |
strong |
| Pushed commits later modified but PR merged |
accepted or mixed |
medium |
| Pushed commits reverted or dropped |
rejected |
strong |
| PR closed unmerged |
rejected |
strong |
| PR open with review or CI activity |
pending |
medium |
| PR open with no activity after long window |
ignored |
weak |
mark_pull_request_as_ready_for_review
Human check: did readiness lead to actual review or merge activity?
| Repository state |
Outcome |
Strength |
| Reviewed by visible non-bot after mark-ready |
accepted |
strong |
| Merged after mark-ready |
accepted |
strong |
| Converted back to draft by visible non-bot |
rejected |
strong |
| Closed unmerged with no review |
rejected or ignored |
medium |
| Open with no review after window |
pending or ignored |
weak |
add_reviewer
Human check: did the requested reviewer review, comment, approve, or remain meaningfully assigned?
| Repository state |
Outcome |
Strength |
| Requested reviewer submitted review |
accepted |
strong |
| Requested reviewer commented |
accepted |
medium |
| Reviewer request still pending while PR open |
pending |
weak |
| Reviewer removed by visible non-bot |
rejected |
medium or strong |
| PR merged without that reviewer acting |
ignored or accepted_weak, depending policy |
weak |
| PR closed unmerged with no reviewer action |
ignored |
weak |
Implementation notes:
- Store exact requested reviewers and teams
- Team review requests need special handling because any member review may satisfy the team request
- Do not treat PR existence as accepted
assign_to_agent
Human check: did the assigned agent produce useful downstream work?
| Repository state |
Outcome |
Strength |
| Agent-created PR merged |
accepted |
strong |
| Agent-created PR reviewed positively |
pending with positive signal |
medium |
| Issue closed as completed due to agent work |
accepted |
strong |
| Agent PR closed unmerged |
rejected |
strong |
| Agent assigned but no PR or activity after window |
ignored |
weak |
| Issue solved by someone else |
accepted_other_actor or neutral |
medium |
Labels, Metadata, and Planning
add_labels
Human check: did the label classification stick?
| Repository state |
Outcome |
Strength |
| All labels retained after triage window |
accepted |
medium |
| Some labels retained, some removed |
mixed |
medium |
| All labels removed by visible non-bot |
rejected |
strong |
| Target closed or merged while labels retained |
accepted |
medium |
| No permission to read labels |
unknown |
none |
Use partial scoring where useful.
assign_milestone
Human check: did the milestone remain assigned?
| Repository state |
Outcome |
Strength |
| Same milestone still set after window |
accepted |
medium |
| Milestone changed by visible non-bot |
rejected or mixed |
medium |
| Milestone removed |
rejected |
strong |
| Issue or PR completed under milestone |
accepted |
strong |
| Target missing |
unknown |
none |
update_project
Human check: did the specific project field update remain?
| Repository state |
Outcome |
Strength |
| Field value still equals intended value |
accepted |
medium |
| Field later moved forward in workflow, for example Todo to In Progress to Done |
accepted |
strong |
| Field reverted or changed away by visible non-bot |
rejected or mixed |
medium |
| Project item removed |
rejected |
strong |
| Project inaccessible |
unknown |
none |
set_issue_type
Human check: did the issue type remain set?
| Repository state |
Outcome |
Strength |
| Type still equals intended type |
accepted |
medium |
| Type changed by visible non-bot |
rejected or mixed |
medium |
| Type cleared by visible non-bot |
rejected |
strong |
| Issue completed with type retained |
accepted |
strong |
| Cannot read issue type |
unknown |
none |
set_issue_field
Human check: did the specific field value remain?
| Repository state |
Outcome |
Strength |
| Field still equals intended value |
accepted |
medium |
| Field advanced to later valid workflow state |
accepted |
strong |
| Field changed away by visible non-bot |
rejected or mixed |
medium |
| Field removed or item removed |
rejected |
strong |
| Cannot read field |
unknown |
none |
Workflows, Security, Releases, Code Scanning
dispatch_workflow
Human check: did the dispatched workflow run successfully?
| Repository state |
Outcome |
Strength |
| Dispatched run completed successfully |
accepted |
strong |
| Run failed, cancelled, or timed out |
rejected |
strong |
| Run still in progress |
pending |
medium |
| No matching run found |
unknown or rejected, depending dispatch API result |
none or medium |
| Run succeeded but produced no expected artifact |
accepted_weak or mixed |
weak |
autofix_code_scanning_alert
Human check: was the alert actually fixed?
| Repository state |
Outcome |
Strength |
| Alert fixed or closed by code change |
accepted |
strong |
| Linked PR merged and alert fixed |
accepted |
strong |
| Alert dismissed as false positive or won't fix by visible non-bot |
contextual acceptance for triage, not fix |
medium |
| Alert remains open after window |
pending or rejected, depending SLA |
weak or medium |
| Autofix PR closed unmerged |
rejected |
strong |
| Alert disappeared or inaccessible |
unknown |
none |
Differentiate fixed, triaged, rejected, and pending states rather than flattening them.
create_code_scanning_alert
Human check: was the alert triaged, fixed, or dismissed with reason?
| Repository state |
Outcome |
Strength |
| Alert acknowledged or triaged |
accepted |
medium |
| Alert fixed |
accepted |
strong |
| Alert dismissed with reason by visible non-bot |
accepted or rejected, depending expected semantics |
medium |
| Alert deleted or invalid |
rejected |
medium |
| Alert open with no activity |
pending or ignored |
weak |
update_release
Human check: did the release edit remain?
| Repository state |
Outcome |
Strength |
| Release fields still match intended update |
accepted |
medium |
| Release published after draft update |
accepted |
strong |
| Release edited again to revert agent change |
rejected |
strong |
| Release deleted |
rejected |
strong |
| Cannot compare fields |
unknown |
none |
Comment Moderation
hide_comment
Human check: did the hidden or minimized state persist?
| Repository state |
Outcome |
Strength |
| Comment still hidden or minimized |
accepted |
medium |
| Comment unhidden by visible non-bot |
rejected |
strong |
| Comment deleted after hiding |
often accepted for moderation |
medium |
| Target missing |
unknown |
none |
User Assignment
assign_to_user
Human check: did the assignment stick or result in user action?
| Repository state |
Outcome |
Strength |
| Assigned user comments, reviews, commits, or closes item |
accepted |
strong |
| Assignment remains while item open |
pending |
weak |
| User removed by visible non-bot |
rejected |
medium |
| Item completed while user assigned |
accepted |
medium |
| Item closed without assignee action |
ignored |
weak |
unassign_from_user
Human check: did the user remain unassigned?
| Repository state |
Outcome |
Strength |
| User remains unassigned after window |
accepted |
medium |
| User re-assigned by visible non-bot |
rejected |
strong |
| Item closed after unassignment |
contextual |
medium |
| Cannot read assignees |
unknown |
none |
System Outputs
noop
No human-facing outcome should be evaluated.
{
"outcome_status": "skipped",
"evidence_strength": "none",
"human_check_signal": "no_action_requested"
}
missing_tool
No human-facing outcome should be evaluated.
{
"outcome_status": "skipped",
"evidence_strength": "none",
"human_check_signal": "tool_unavailable"
}
8. Avoiding The Biggest Measurement Bug
The major measurement bug is:
target object exists -> accepted
That is not a human check. It is only a weak liveness check.
Use this instead:
target object exists -> target_resolved = true
type-specific acceptance signal exists -> accepted
otherwise -> pending / ignored / weak evidence
Example:
{
"target_resolved": true,
"outcome_status": "pending",
"evidence_strength": "weak",
"human_check_signal": "target_exists_only"
}
9. Suggested Scoring Model
For research dashboards, compute three acceptance rates.
Strict Acceptance Rate
Only strong evidence.
strict_acceptance_rate =
strong_accepted / evaluable_outputs
Human-Check Acceptance Rate
Strong plus medium evidence.
human_check_acceptance_rate =
(strong_accepted + medium_accepted) / evaluable_outputs
Sticky Artifact Rate
Weak survival evidence only.
sticky_artifact_rate =
weak_accepted_or_target_exists / evaluable_outputs
Never mix these into one unlabeled number.
10. Implementation Architecture
Recommended pipeline:
safe-output execution
↓
write normalized safe_output_event records
↓
outcome collector scheduled job
↓
fetch current GitHub state
↓
apply type-specific evaluator
↓
write outcome records
↓
aggregate dashboard/report
Pseudo-code:
def evaluate_safe_output(event, now):
target = fetch_target(event)
if event.type in {"noop", "missing_tool"}:
return skipped(event)
if target is None:
return unknown(event, reason="target_not_found_or_inaccessible")
evaluator = EVALUATORS.get(event.type)
if evaluator is None:
return unknown(event, reason="no_type_specific_evaluator")
return evaluator(event, target, now)
Each evaluator should return something like:
Outcome(
status="accepted",
strength="strong",
signal="pull_request_merged",
confidence="high",
details={...}
)
11. Minimal Evaluator Interface
class OutcomeEvaluator:
safe_output_type: str
def fetch(self, event: SafeOutputEvent) -> TargetState:
...
def evaluate(
self,
event: SafeOutputEvent,
state: TargetState,
window: EvaluationWindow
) -> Outcome:
...
def required_event_fields(self) -> list[str]:
...
def required_github_scopes(self) -> list[str]:
...
12. Dashboard Fields
For implementation and research, report:
total_safe_outputs
evaluable_outputs
accepted_strong
accepted_medium
accepted_weak
rejected
pending
ignored
unknown
skipped
fallback_exists_only_count
missing_type_specific_rule_count
durable_reversal_count
And by type:
safe_output_type
count
strict_acceptance_rate
human_check_acceptance_rate
sticky_artifact_rate
rejection_rate
pending_rate
unknown_rate
median_time_to_acceptance
13. Recommended Implementation Priority
Implement in this order:
create_pull_request
push_to_pull_request_branch
create_issue
add_comment
add_labels
update_issue
update_pull_request
add_reviewer
submit_pull_request_review
dispatch_workflow
autofix_code_scanning_alert
- project, milestone, and field metadata outputs
- discussions
- moderation outputs
- system outputs
Why: PRs, issues, comments, labels, and workflows give the highest research value and the clearest human-check signals.
14. Final Implementation Definition
Use this definition in the implementation spec:
Accepted Outcome: A type-specific, post-output repository state that provides observable evidence a human repository observer would reasonably interpret as the safe output being useful, correct, acted on, or intentionally retained.
And this warning:
Generic target existence must not be counted as accepted outcome evidence except as weak evidence under a separate target_exists_only signal.
That provides both operational compatibility and research rigor.
Implement Safe Output Outcome Evaluation with evidence-strength classification
Implement safe-output outcome evaluation based on observable repository state rather than workflow self-assessment or artifact survival.
Problem
Current outcome evaluation is too weak when it treats object existence as acceptance. That inflates acceptance rates and makes workflow effectiveness metrics misleading.
We need outcome evaluation that answers:
Goals
outcome_statusandevidence_strengthNon-Goals
Current State
Current outcome evaluation has a mix of dedicated evaluators, placeholder evaluators, and fallback behavior.
target_exists_onlyfallback behaviorDeliverables
Success Metrics
Dependencies
Planner Guidance
Use this issue as a decomposition and sequencing source, not as a single implementation task.
Recommended Execution Order
Recommended PR Boundaries
Safe Deferrals
Acceptance Criteria
Tracking Checklist
Normalized Outcome Model
Outcome Status
acceptedrejectedpendingignoredskippedunknownEvidence Strength
strongmediumweaknoneNormalized Shape
{ "safe_output_type": "create_pull_request", "outcome_status": "accepted", "evidence_strength": "strong", "human_check_signal": "pull_request_merged", "target_resolved": true, "confidence": "high" }Key Rule
Target existence alone must not count as accepted outcome evidence.
If the only observable fact is that the target still exists, classify it separately:
{ "outcome_status": "pending", "evidence_strength": "weak", "human_check_signal": "target_exists_only", "target_resolved": true }Execution-Time Metadata Requirements
Mutable operations must persist enough state at execution time for later comparison.
Required Examples
update_issueupdate_pull_requestadd_reviewerWhy This Matters
Without execution-time metadata, mutable operations cannot be evaluated as retained, reverted, replaced, or acted on. They fall back to weak survival evidence, which is not good enough for meaningful acceptance metrics.
Implementation Plan
Rollout Order
Ordering Constraints
Suggested Split Into Follow-On Issues Or PRs
Full Detailed Design Reference
This section preserves the full implementation guidance and is intentionally more detailed than the main issue body.
Core Principle
For implementation, accepted outcomes should be defined as human-checkable evidence and explicitly separated from weaker "artifact still exists" signals.
Core implementation principle:
This fits the underlying safe-output design: safe outputs are validated operations executed outside the agent's write-permission context, while outcomes are meant to evaluate what happened after those operations based on repository state rather than the workflow's own self-assessment.
1. Implementation Goal
Build an outcome evaluator that answers, for every safe output item:
The most important implementation decision is to avoid treating all accepted statuses as equally strong. A PR merged by a maintainer is much stronger evidence than "the PR still exists."
Two layers should be implemented:
This preserves compatibility with outcome reporting while keeping the measurement honest.
2. Recommended Outcome States
Use these states consistently across all safe output types.
acceptedrejectedpendingignoredskippednoop,missing_tool, or cancelled outputs.unknownAttach evidence strength separately:
strongmediumweaknone3. Universal Evaluation Schema
Each evaluated output should produce a normalized record like this:
{ "safe_output_id": "run-id:item-index", "safe_output_type": "create_pull_request", "target": { "repo": "owner/repo", "kind": "pull_request", "number": 123, "node_id": "..." }, "created_at": "2026-05-26T10:00:00Z", "evaluated_at": "2026-05-27T10:00:00Z", "evaluation_window_hours": 24, "outcome_status": "accepted", "evidence_strength": "strong", "human_check_signal": "pull_request_merged", "bot_aware": true, "actor_summary": { "visible_non_bot_actor_count": 2, "bot_actor_count": 1 }, "details": { "merged": true, "merged_by_type": "User", "closed": true, "reverted": false }, "confidence": "high", "notes": "PR was merged by a visible non-bot actor." }The key field is
human_check_signal. That is the concrete thing a human would check.4. Data Collection Requirements
Enough metadata must be persisted at safe-output execution time to evaluate outcomes later.
For every safe output, store a record like:
{ "type": "add_labels", "run_id": "...", "workflow_name": "...", "repo": "owner/repo", "actor": "github-actions[bot]", "created_at": "...", "target_url": "...", "target_node_id": "...", "target_number": 123, "payload_hash": "...", "expected_state": { "labels_added": ["bug", "needs-triage"] } }For update operations, store the before and after values. Without that, later evaluation cannot determine whether a change retained or replaced the intended result.
Example for
update_issue:{ "type": "update_issue", "target": { "repo": "owner/repo", "issue_number": 42 }, "before": { "title": "Old title", "body_hash": "abc123" }, "after": { "title": "New title", "body_hash": "def456" } }For PR and code-changing operations, store commit SHAs, branch name, patch hash, PR number, and base/head refs.
5. Evaluation Window
Use multiple windows rather than one hard cutoff.
Recommended:
Why: some outputs are accepted quickly, like comments or labels. Others, especially PRs, may take days. A PR open after 24 hours should usually be
pending, notignoredorrejected.For research reporting, include both:
6. Bot-Aware Human Check
Do not assume every visible action is human. Visible actor identity is not perfect provenance: a non-bot actor may still be AI-assisted, and hidden authorship cannot be fully observed. Implement this as visible non-bot activity, not definitely human activity.
Recommended actor categories:
A signal is stronger when it involves a visible non-bot actor other than the original workflow actor.
7. Per-Safe-Output Implementation Rules
Issues and Discussions
create_issueHuman check: did the issue receive meaningful triage, assignment, linkage, closure, or completion?
acceptedacceptedpendingoraccepted_medium, depending policyrejectedignoredunknownorrejected, depending audit availabilityImplementation notes:
update_issueHuman check: did the agent's edit stick?
acceptedacceptedrejectedunknownunknownImplementation notes:
close_issueHuman check: did the issue stay closed, or was it reopened?
acceptedrejectedrejectedaccepted_weakorignoredunknownImplementation nuance: a close action can be harmful even if the issue remains closed. When possible, inspect timeline comments after closure for reopen attempts or challenge comments.
link_sub_issueHuman check: did the sub-issue relationship stick?
acceptedrejectedrejectedorignored, depending contextunknowncreate_discussionHuman check: did the discussion receive meaningful engagement or an accepted answer?
acceptedacceptedacceptedrejectedignoredupdate_discussionHuman check mirrors
update_issue.close_discussionHuman check: did the discussion stay closed?
acceptedrejectedrejectedunknownPull Requests
create_pull_requestHuman check: was the PR merged, closed unmerged, reviewed, or still pending?
acceptedrejectedoraccepted_then_revertedpendingwith positive signalpendingrejectedignoredcreate_issuewith fallback subtypeImplementation notes:
update_pull_requestHuman check mirrors
update_issuebut stronger when the PR later merges.acceptedacceptedrejectedrejectedunknownclose_pull_requestHuman check: did the PR stay closed, or was the close undone?
acceptedrejectedrejectedrejectedunknowncreate_pull_request_review_commentHuman check: did the inline comment lead to resolution, reply, code change, or review action?
acceptedacceptedacceptedacceptedrejectedignoredpendingsubmit_pull_request_reviewHuman check: did the submitted review affect the PR lifecycle?
acceptedacceptedrejectedrejectedormixedpendingreply_to_pull_request_review_commentHuman check: did the reply advance or resolve the thread?
acceptedacceptedrejectedignoredorrejected, depending contextpendingresolve_pull_request_review_threadHuman check: did the thread stay resolved?
acceptedrejectedacceptedunknownpush_to_pull_request_branchHuman check: were the pushed commits accepted through the PR lifecycle?
acceptedacceptedormixedrejectedrejectedpendingignoredmark_pull_request_as_ready_for_reviewHuman check: did readiness lead to actual review or merge activity?
acceptedacceptedrejectedrejectedorignoredpendingorignoredadd_reviewerHuman check: did the requested reviewer review, comment, approve, or remain meaningfully assigned?
acceptedacceptedpendingrejectedignoredoraccepted_weak, depending policyignoredImplementation notes:
assign_to_agentHuman check: did the assigned agent produce useful downstream work?
acceptedpendingwith positive signalacceptedrejectedignoredaccepted_other_actororneutralLabels, Metadata, and Planning
add_labelsHuman check: did the label classification stick?
acceptedmixedrejectedacceptedunknownUse partial scoring where useful.
assign_milestoneHuman check: did the milestone remain assigned?
acceptedrejectedormixedrejectedacceptedunknownupdate_projectHuman check: did the specific project field update remain?
acceptedacceptedrejectedormixedrejectedunknownset_issue_typeHuman check: did the issue type remain set?
acceptedrejectedormixedrejectedacceptedunknownset_issue_fieldHuman check: did the specific field value remain?
acceptedacceptedrejectedormixedrejectedunknownWorkflows, Security, Releases, Code Scanning
dispatch_workflowHuman check: did the dispatched workflow run successfully?
acceptedrejectedpendingunknownorrejected, depending dispatch API resultaccepted_weakormixedautofix_code_scanning_alertHuman check: was the alert actually fixed?
acceptedacceptedpendingorrejected, depending SLArejectedunknownDifferentiate fixed, triaged, rejected, and pending states rather than flattening them.
create_code_scanning_alertHuman check: was the alert triaged, fixed, or dismissed with reason?
acceptedacceptedacceptedorrejected, depending expected semanticsrejectedpendingorignoredupdate_releaseHuman check: did the release edit remain?
acceptedacceptedrejectedrejectedunknownComment Moderation
hide_commentHuman check: did the hidden or minimized state persist?
acceptedrejectedacceptedfor moderationunknownUser Assignment
assign_to_userHuman check: did the assignment stick or result in user action?
acceptedpendingrejectedacceptedignoredunassign_from_userHuman check: did the user remain unassigned?
acceptedrejectedunknownSystem Outputs
noopNo human-facing outcome should be evaluated.
{ "outcome_status": "skipped", "evidence_strength": "none", "human_check_signal": "no_action_requested" }missing_toolNo human-facing outcome should be evaluated.
{ "outcome_status": "skipped", "evidence_strength": "none", "human_check_signal": "tool_unavailable" }8. Avoiding The Biggest Measurement Bug
The major measurement bug is:
That is not a human check. It is only a weak liveness check.
Use this instead:
Example:
{ "target_resolved": true, "outcome_status": "pending", "evidence_strength": "weak", "human_check_signal": "target_exists_only" }9. Suggested Scoring Model
For research dashboards, compute three acceptance rates.
Strict Acceptance Rate
Only strong evidence.
Human-Check Acceptance Rate
Strong plus medium evidence.
Sticky Artifact Rate
Weak survival evidence only.
Never mix these into one unlabeled number.
10. Implementation Architecture
Recommended pipeline:
Pseudo-code:
Each evaluator should return something like:
11. Minimal Evaluator Interface
12. Dashboard Fields
For implementation and research, report:
And by type:
13. Recommended Implementation Priority
Implement in this order:
create_pull_requestpush_to_pull_request_branchcreate_issueadd_commentadd_labelsupdate_issueupdate_pull_requestadd_reviewersubmit_pull_request_reviewdispatch_workflowautofix_code_scanning_alertWhy: PRs, issues, comments, labels, and workflows give the highest research value and the clearest human-check signals.
14. Final Implementation Definition
Use this definition in the implementation spec:
And this warning:
That provides both operational compatibility and research rigor.