Improve Safe Output Outcome Evaluation

# Implement Safe Output Outcome Evaluation with evidence-strength classification

Implement safe-output outcome evaluation based on observable repository state rather than workflow self-assessment or artifact survival.

## Problem

Current outcome evaluation is too weak when it treats object existence as acceptance. That inflates acceptance rates and makes workflow effectiveness metrics misleading.

We need outcome evaluation that answers:

1. What safe output was produced?
2. What GitHub object did it affect?
3. What happened to that object afterward?
4. Would a human observer classify that as accepted, rejected, pending, ignored, skipped, or unknown?
5. How strong is the evidence?

## Goals

- Define a normalized outcome model with `outcome_status` and `evidence_strength`
- Evaluate outcomes from repository state, not workflow self-reporting
- Persist enough execution-time metadata to evaluate mutable operations later
- Implement dedicated evaluators for major safe output types
- Keep weak survival evidence separate from real acceptance
- Emit consistent JSONL and telemetry fields for downstream reporting

## Non-Goals

- Do not build a weighted business-value model
- Do not try to infer hidden human vs AI provenance
- Do not collapse strong, medium, and weak acceptance into one unlabeled metric
- Do not require perfect coverage before shipping the framework

## Current State

Current outcome evaluation has a mix of dedicated evaluators, placeholder evaluators, and fallback behavior.

- Some output types already have dedicated outcome logic
- Some mutable operations require execution-time metadata capture before they can be evaluated correctly
- Several safe output types still rely on weak `target_exists_only` fallback behavior
- Some discussion, review-thread, workflow, and moderation types are still effectively unimplemented from a human-check perspective

## Deliverables

- a normalized outcome model in reports and JSONL output
- explicit weak-evidence fallback behavior that does not count as accepted
- execution-time metadata persistence for mutable operations
- dedicated evaluators for the targeted rollout slices
- updated documentation and formal outcome-evaluation specification
- focused tests for each implemented evaluator family

## Success Metrics

- no existence-only fallback is counted as accepted
- the number of types with dedicated evaluators increases
- weak fallback outcomes are visibly separated from accepted outcomes
- mutable operations can be evaluated as retained, reverted, replaced, or acted on
- focused tests exist for each delivered evaluator slice

## Dependencies

- mutable update evaluators depend on before/after metadata persistence
- review lifecycle evaluators depend on requested-state or thread metadata capture
- workflow and code-scanning evaluators depend on correlation metadata
- documentation and reporting should be updated after each evaluator slice lands

## Planner Guidance

Use this issue as a decomposition and sequencing source, not as a single implementation task.

### Recommended Execution Order

1. shared outcome model and fallback semantics
2. execution-time metadata persistence
3. mutable update evaluators
4. review lifecycle evaluators
5. issue and comment evaluators
6. workflow and code-scanning evaluators
7. metadata, discussion, project, and moderation evaluators

### Recommended PR Boundaries

1. shared model and fallback semantics
2. metadata persistence for mutable operations
3. retained-update evaluators
4. review lifecycle evaluators
5. issue and comment evaluators
6. workflow and code-scanning evaluators
7. metadata, discussion, project, and moderation evaluators

### Safe Deferrals

- scoring-model rollups
- richer dashboard aggregation
- strong/medium/weak telemetry summaries beyond core item output
- long-tail evaluator families after the first rollout slices land

## Acceptance Criteria

- Dedicated evaluators exist for the first rollout slice
- Generic fallback does not classify existence-only as accepted
- Mutable operations persist comparison metadata at execution time
- JSONL and outcome reports include normalized outcome fields
- Docs reflect the new semantics
- Focused tests exist for each implemented evaluator slice

## Tracking Checklist

- [ ] Shared outcome model and fallback semantics
- [ ] Manifest and artifact metadata for mutable operations
- [ ] Mutable update evaluators
- [ ] Review lifecycle evaluators
- [ ] Issue and comment interaction evaluators
- [ ] Workflow and code scanning evaluators
- [ ] Metadata, project, discussion, and moderation evaluators
- [ ] Docs updated after each slice
- [ ] Telemetry fields aligned with evidence strength

<details>
<summary>Normalized Outcome Model</summary>

## Outcome Status

- `accepted`
- `rejected`
- `pending`
- `ignored`
- `skipped`
- `unknown`

## Evidence Strength

- `strong`
- `medium`
- `weak`
- `none`

## Normalized Shape

```json
{
  "safe_output_type": "create_pull_request",
  "outcome_status": "accepted",
  "evidence_strength": "strong",
  "human_check_signal": "pull_request_merged",
  "target_resolved": true,
  "confidence": "high"
}
```

## Key Rule

Target existence alone must not count as accepted outcome evidence.

If the only observable fact is that the target still exists, classify it separately:

```json
{
  "outcome_status": "pending",
  "evidence_strength": "weak",
  "human_check_signal": "target_exists_only",
  "target_resolved": true
}
```

</details>

<details>
<summary>Execution-Time Metadata Requirements</summary>

Mutable operations must persist enough state at execution time for later comparison.

## Required Examples

- `update_issue`
  - before/after title
  - normalized body hash
  - labels
  - assignees
  - state
- `update_pull_request`
  - before/after title
  - normalized body hash
  - base
  - draft
  - head SHA when relevant
- `add_reviewer`
  - requested reviewers
  - requested teams
- workflow and code-scanning operations
  - correlation keys
  - expected target state

## Why This Matters

Without execution-time metadata, mutable operations cannot be evaluated as retained, reverted, replaced, or acted on. They fall back to weak survival evidence, which is not good enough for meaningful acceptance metrics.

</details>

<details>
<summary>Implementation Plan</summary>

## Rollout Order

1. Shared outcome model and fallback semantics
2. Mutable update evaluators
3. Review lifecycle evaluators
4. Issue/comment evaluators
5. Workflow and code-scanning evaluators
6. Metadata/project evaluators
7. Discussions and moderation evaluators

## Ordering Constraints

- Do not implement mutable update evaluators without first persisting the metadata they need
- Do not treat target existence as acceptance while evaluator coverage is incomplete
- Prefer implementing evaluator families where the runtime data dependencies are already available
- Keep docs and output schema aligned with the implementation after each slice

## Suggested Split Into Follow-On Issues Or PRs

1. Shared outcome model and fallback semantics
2. Persist execution-time metadata for mutable safe outputs
3. Implement retained-update evaluators for mutable outputs
4. Implement review lifecycle evaluators
5. Implement issue and comment interaction evaluators
6. Implement workflow and code scanning evaluators
7. Implement metadata, project, discussion, and moderation evaluators

</details>

<details>
<summary>Full Detailed Design Reference</summary>

This section preserves the full implementation guidance and is intentionally more detailed than the main issue body.

---

### Core Principle

For implementation, accepted outcomes should be defined as human-checkable evidence and explicitly separated from weaker "artifact still exists" signals.

Core implementation principle:

> A safe output is the agent's proposed action. An accepted outcome is the later repository state that a human reviewer would reasonably inspect to decide whether that action was useful, correct, or intentionally retained.

This fits the underlying safe-output design: safe outputs are validated operations executed outside the agent's write-permission context, while outcomes are meant to evaluate what happened after those operations based on repository state rather than the workflow's own self-assessment.

---

### 1. Implementation Goal

Build an outcome evaluator that answers, for every safe output item:

1. What did the agent safely output?
2. What GitHub object did it affect or create?
3. What happened to that object afterward?
4. Would a human checking the repo treat that as accepted, rejected, pending, ignored, or unmeasured?
5. How strong is that evidence?

The most important implementation decision is to avoid treating all accepted statuses as equally strong. A PR merged by a maintainer is much stronger evidence than "the PR still exists."

Two layers should be implemented:

```text
outcome_status:
  accepted | rejected | pending | ignored | skipped | unknown

evidence_strength:
  strong | medium | weak | none
```

This preserves compatibility with outcome reporting while keeping the measurement honest.

---

### 2. Recommended Outcome States

Use these states consistently across all safe output types.

| State | Meaning |
| --- | --- |
| `accepted` | Observable repository state suggests the output was useful, correct, or intentionally retained. |
| `rejected` | Observable repository state suggests the output was undone, removed, closed as invalid, reverted, or contradicted. |
| `pending` | The output is still in flight; there has not been enough time or activity to judge. |
| `ignored` | The output exists but received no meaningful human or repository response within the evaluation window. |
| `skipped` | No human-facing outcome should be evaluated, for example `noop`, `missing_tool`, or cancelled outputs. |
| `unknown` | Required target object or audit data could not be fetched. |

Attach evidence strength separately:

| Evidence strength | Meaning |
| --- | --- |
| `strong` | Direct human or repository acceptance signal: merged PR, resolved review, workflow success, human reply, retained metadata after triage. |
| `medium` | Indirect but meaningful signal: issue triaged, label retained, milestone retained, linked issue retained. |
| `weak` | Artifact merely still exists or target still exists. |
| `none` | No measurable evidence. |

---

### 3. Universal Evaluation Schema

Each evaluated output should produce a normalized record like this:

```json
{
  "safe_output_id": "run-id:item-index",
  "safe_output_type": "create_pull_request",
  "target": {
    "repo": "owner/repo",
    "kind": "pull_request",
    "number": 123,
    "node_id": "..."
  },
  "created_at": "2026-05-26T10:00:00Z",
  "evaluated_at": "2026-05-27T10:00:00Z",
  "evaluation_window_hours": 24,
  "outcome_status": "accepted",
  "evidence_strength": "strong",
  "human_check_signal": "pull_request_merged",
  "bot_aware": true,
  "actor_summary": {
    "visible_non_bot_actor_count": 2,
    "bot_actor_count": 1
  },
  "details": {
    "merged": true,
    "merged_by_type": "User",
    "closed": true,
    "reverted": false
  },
  "confidence": "high",
  "notes": "PR was merged by a visible non-bot actor."
}
```

The key field is `human_check_signal`. That is the concrete thing a human would check.

---

### 4. Data Collection Requirements

Enough metadata must be persisted at safe-output execution time to evaluate outcomes later.

For every safe output, store a record like:

```json
{
  "type": "add_labels",
  "run_id": "...",
  "workflow_name": "...",
  "repo": "owner/repo",
  "actor": "github-actions[bot]",
  "created_at": "...",
  "target_url": "...",
  "target_node_id": "...",
  "target_number": 123,
  "payload_hash": "...",
  "expected_state": {
    "labels_added": ["bug", "needs-triage"]
  }
}
```

For update operations, store the before and after values. Without that, later evaluation cannot determine whether a change retained or replaced the intended result.

Example for `update_issue`:

```json
{
  "type": "update_issue",
  "target": {
    "repo": "owner/repo",
    "issue_number": 42
  },
  "before": {
    "title": "Old title",
    "body_hash": "abc123"
  },
  "after": {
    "title": "New title",
    "body_hash": "def456"
  }
}
```

For PR and code-changing operations, store commit SHAs, branch name, patch hash, PR number, and base/head refs.

---

### 5. Evaluation Window

Use multiple windows rather than one hard cutoff.

Recommended:

```text
T+1h: early signal
T+24h: primary signal
T+7d: durable signal
T+30d: long-term or revert signal for code changes
```

Why: some outputs are accepted quickly, like comments or labels. Others, especially PRs, may take days. A PR open after 24 hours should usually be `pending`, not `ignored` or `rejected`.

For research reporting, include both:

```text
accepted_at_24h
accepted_at_7d
accepted_durable_at_30d
```

---

### 6. Bot-Aware Human Check

Do not assume every visible action is human. Visible actor identity is not perfect provenance: a non-bot actor may still be AI-assisted, and hidden authorship cannot be fully observed. Implement this as visible non-bot activity, not definitely human activity.

Recommended actor categories:

```text
bot_actor
visible_non_bot_actor
same_workflow_actor
unknown_actor
system_actor
```

A signal is stronger when it involves a visible non-bot actor other than the original workflow actor.

---

### 7. Per-Safe-Output Implementation Rules

#### Issues and Discussions

#### `create_issue`

Human check: did the issue receive meaningful triage, assignment, linkage, closure, or completion?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Closed as completed or resolved | `accepted` | strong |
| Assigned, labeled by non-bot, milestone added, linked to PR, referenced in commit or PR | `accepted` | medium |
| Open with non-bot comment, reaction, or triage | `pending` or `accepted_medium`, depending policy | medium |
| Closed as duplicate, not planned, invalid, or spam | `rejected` | strong |
| Open with no non-bot interaction after window | `ignored` | weak |
| Issue deleted or inaccessible | `unknown` or `rejected`, depending audit availability | none |

Implementation notes:

- Store issue number and initial body hash
- Query issue state, timeline events, comments, labels, assignees, milestone, linked PRs, and closing reason
- Do not treat "issue still exists" as accepted

#### `update_issue`

Human check: did the agent's edit stick?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Edited fields still equal intended values after window | `accepted` | medium |
| Later visible non-bot edit preserves core change | `accepted` | medium |
| Later visible non-bot reverts or replaces the change | `rejected` | strong |
| Issue deleted or inaccessible | `unknown` | none |
| No way to compare before and after | `unknown` | none |

Implementation notes:

- Store field-level before and after values
- For body fields, compare normalized body hash rather than raw Markdown only
- Track title, body, state, labels, changed-by, and changed-at where available

#### `close_issue`

Human check: did the issue stay closed, or was it reopened?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Still closed after evaluation window | `accepted` | medium |
| Reopened by visible non-bot | `rejected` | strong |
| Closed then referenced as mistakenly closed | `rejected` | medium |
| Closed by stale or lifecycle bot only | `accepted_weak` or `ignored` | weak |
| Target missing | `unknown` | none |

Implementation nuance: a close action can be harmful even if the issue remains closed. When possible, inspect timeline comments after closure for reopen attempts or challenge comments.

#### `link_sub_issue`

Human check: did the sub-issue relationship stick?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Parent/sub-issue link still present | `accepted` | medium |
| Link removed by visible non-bot | `rejected` | strong |
| Target issue closed as duplicate or invalid | `rejected` or `ignored`, depending context | medium |
| Parent or child inaccessible | `unknown` | none |

#### `create_discussion`

Human check: did the discussion receive meaningful engagement or an accepted answer?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Answer marked or accepted answer present | `accepted` | strong |
| Non-bot replies or meaningful reactions | `accepted` | medium |
| Discussion converted or linked to issue or PR | `accepted` | medium |
| Closed as duplicate, spam, or off-topic | `rejected` | strong |
| Exists with no engagement after window | `ignored` | weak |

#### `update_discussion`

Human check mirrors `update_issue`.

#### `close_discussion`

Human check: did the discussion stay closed?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Still closed after window | `accepted` | medium |
| Reopened by visible non-bot | `rejected` | strong |
| Non-bot comments indicate close was wrong | `rejected` | medium |
| Deleted or inaccessible | `unknown` | none |

#### Pull Requests

#### `create_pull_request`

Human check: was the PR merged, closed unmerged, reviewed, or still pending?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Merged | `accepted` | strong |
| Merged then reverted within durability window | `rejected` or `accepted_then_reverted` | strong |
| Approved by visible non-bot but not merged yet | `pending` with positive signal | medium |
| Open with review activity | `pending` | medium |
| Closed unmerged | `rejected` | strong |
| Open with no activity after long window | `ignored` | weak |
| PR creation fell back to issue | evaluate as `create_issue` with fallback subtype | varies |

Implementation notes:

- Store PR number, branch, commit SHAs, patch hash, labels, reviewers, and fallback behavior
- For durable research, check for revert commits after merge

#### `update_pull_request`

Human check mirrors `update_issue` but stronger when the PR later merges.

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Updated fields remain unchanged | `accepted` | medium |
| PR merged after update and update remained relevant | `accepted` | strong |
| Fields reverted or replaced by non-bot | `rejected` | strong |
| PR closed unmerged after update | often `rejected` | medium |
| Cannot compare | `unknown` | none |

#### `close_pull_request`

Human check: did the PR stay closed, or was the close undone?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Still closed unmerged after window | `accepted` | medium |
| Reopened by visible non-bot | `rejected` | strong |
| Later merged after reopening | `rejected` | strong |
| Comment indicates premature close | `rejected` | medium |
| Target missing | `unknown` | none |

#### `create_pull_request_review_comment`

Human check: did the inline comment lead to resolution, reply, code change, or review action?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Thread resolved by visible non-bot | `accepted` | strong |
| Comment replied to by visible non-bot | `accepted` | medium |
| Follow-up commit touches commented lines or files | `accepted` | medium |
| Comment marked outdated due to changes and PR merged | `accepted` | medium |
| Comment deleted or minimized as abuse/off-topic | `rejected` | strong |
| No reply or no resolution after PR closed or merged | `ignored` | weak |
| PR still open and thread unresolved | `pending` | weak or medium |

#### `submit_pull_request_review`

Human check: did the submitted review affect the PR lifecycle?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Review approved and PR merged | `accepted` | strong |
| Changes requested and later addressed by commits | `accepted` | medium |
| Review dismissed by visible non-bot | `rejected` | strong |
| Review contradicted by later human review | `rejected` or `mixed` | medium |
| PR closed without addressing review | contextual | medium |
| PR still open with review pending | `pending` | medium |

#### `reply_to_pull_request_review_comment`

Human check: did the reply advance or resolve the thread?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Thread resolved after reply | `accepted` | strong |
| Visible non-bot replies positively or continues constructively | `accepted` | medium |
| Reply is deleted or minimized | `rejected` | strong |
| Thread remains unresolved and PR closes without action | `ignored` or `rejected`, depending context | weak or medium |
| PR still open | `pending` | weak |

#### `resolve_pull_request_review_thread`

Human check: did the thread stay resolved?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Thread still resolved after window | `accepted` | medium |
| Thread reopened by visible non-bot | `rejected` | strong |
| PR merged after resolution | `accepted` | strong |
| PR closed unmerged after resolution | contextual | medium |
| Thread inaccessible | `unknown` | none |

#### `push_to_pull_request_branch`

Human check: were the pushed commits accepted through the PR lifecycle?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| PR merged with pushed commits included | `accepted` | strong |
| Pushed commits later modified but PR merged | `accepted` or `mixed` | medium |
| Pushed commits reverted or dropped | `rejected` | strong |
| PR closed unmerged | `rejected` | strong |
| PR open with review or CI activity | `pending` | medium |
| PR open with no activity after long window | `ignored` | weak |

#### `mark_pull_request_as_ready_for_review`

Human check: did readiness lead to actual review or merge activity?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Reviewed by visible non-bot after mark-ready | `accepted` | strong |
| Merged after mark-ready | `accepted` | strong |
| Converted back to draft by visible non-bot | `rejected` | strong |
| Closed unmerged with no review | `rejected` or `ignored` | medium |
| Open with no review after window | `pending` or `ignored` | weak |

#### `add_reviewer`

Human check: did the requested reviewer review, comment, approve, or remain meaningfully assigned?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Requested reviewer submitted review | `accepted` | strong |
| Requested reviewer commented | `accepted` | medium |
| Reviewer request still pending while PR open | `pending` | weak |
| Reviewer removed by visible non-bot | `rejected` | medium or strong |
| PR merged without that reviewer acting | `ignored` or `accepted_weak`, depending policy | weak |
| PR closed unmerged with no reviewer action | `ignored` | weak |

Implementation notes:

- Store exact requested reviewers and teams
- Team review requests need special handling because any member review may satisfy the team request
- Do not treat PR existence as accepted

#### `assign_to_agent`

Human check: did the assigned agent produce useful downstream work?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Agent-created PR merged | `accepted` | strong |
| Agent-created PR reviewed positively | `pending` with positive signal | medium |
| Issue closed as completed due to agent work | `accepted` | strong |
| Agent PR closed unmerged | `rejected` | strong |
| Agent assigned but no PR or activity after window | `ignored` | weak |
| Issue solved by someone else | `accepted_other_actor` or `neutral` | medium |

#### Labels, Metadata, and Planning

#### `add_labels`

Human check: did the label classification stick?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| All labels retained after triage window | `accepted` | medium |
| Some labels retained, some removed | `mixed` | medium |
| All labels removed by visible non-bot | `rejected` | strong |
| Target closed or merged while labels retained | `accepted` | medium |
| No permission to read labels | `unknown` | none |

Use partial scoring where useful.

#### `assign_milestone`

Human check: did the milestone remain assigned?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Same milestone still set after window | `accepted` | medium |
| Milestone changed by visible non-bot | `rejected` or `mixed` | medium |
| Milestone removed | `rejected` | strong |
| Issue or PR completed under milestone | `accepted` | strong |
| Target missing | `unknown` | none |

#### `update_project`

Human check: did the specific project field update remain?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Field value still equals intended value | `accepted` | medium |
| Field later moved forward in workflow, for example Todo to In Progress to Done | `accepted` | strong |
| Field reverted or changed away by visible non-bot | `rejected` or `mixed` | medium |
| Project item removed | `rejected` | strong |
| Project inaccessible | `unknown` | none |

#### `set_issue_type`

Human check: did the issue type remain set?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Type still equals intended type | `accepted` | medium |
| Type changed by visible non-bot | `rejected` or `mixed` | medium |
| Type cleared by visible non-bot | `rejected` | strong |
| Issue completed with type retained | `accepted` | strong |
| Cannot read issue type | `unknown` | none |

#### `set_issue_field`

Human check: did the specific field value remain?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Field still equals intended value | `accepted` | medium |
| Field advanced to later valid workflow state | `accepted` | strong |
| Field changed away by visible non-bot | `rejected` or `mixed` | medium |
| Field removed or item removed | `rejected` | strong |
| Cannot read field | `unknown` | none |

#### Workflows, Security, Releases, Code Scanning

#### `dispatch_workflow`

Human check: did the dispatched workflow run successfully?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Dispatched run completed successfully | `accepted` | strong |
| Run failed, cancelled, or timed out | `rejected` | strong |
| Run still in progress | `pending` | medium |
| No matching run found | `unknown` or `rejected`, depending dispatch API result | none or medium |
| Run succeeded but produced no expected artifact | `accepted_weak` or `mixed` | weak |

#### `autofix_code_scanning_alert`

Human check: was the alert actually fixed?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Alert fixed or closed by code change | `accepted` | strong |
| Linked PR merged and alert fixed | `accepted` | strong |
| Alert dismissed as false positive or won't fix by visible non-bot | contextual acceptance for triage, not fix | medium |
| Alert remains open after window | `pending` or `rejected`, depending SLA | weak or medium |
| Autofix PR closed unmerged | `rejected` | strong |
| Alert disappeared or inaccessible | `unknown` | none |

Differentiate fixed, triaged, rejected, and pending states rather than flattening them.

#### `create_code_scanning_alert`

Human check: was the alert triaged, fixed, or dismissed with reason?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Alert acknowledged or triaged | `accepted` | medium |
| Alert fixed | `accepted` | strong |
| Alert dismissed with reason by visible non-bot | `accepted` or `rejected`, depending expected semantics | medium |
| Alert deleted or invalid | `rejected` | medium |
| Alert open with no activity | `pending` or `ignored` | weak |

#### `update_release`

Human check: did the release edit remain?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Release fields still match intended update | `accepted` | medium |
| Release published after draft update | `accepted` | strong |
| Release edited again to revert agent change | `rejected` | strong |
| Release deleted | `rejected` | strong |
| Cannot compare fields | `unknown` | none |

#### Comment Moderation

#### `hide_comment`

Human check: did the hidden or minimized state persist?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Comment still hidden or minimized | `accepted` | medium |
| Comment unhidden by visible non-bot | `rejected` | strong |
| Comment deleted after hiding | often `accepted` for moderation | medium |
| Target missing | `unknown` | none |

#### User Assignment

#### `assign_to_user`

Human check: did the assignment stick or result in user action?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| Assigned user comments, reviews, commits, or closes item | `accepted` | strong |
| Assignment remains while item open | `pending` | weak |
| User removed by visible non-bot | `rejected` | medium |
| Item completed while user assigned | `accepted` | medium |
| Item closed without assignee action | `ignored` | weak |

#### `unassign_from_user`

Human check: did the user remain unassigned?

| Repository state | Outcome | Strength |
| --- | ---: | ---: |
| User remains unassigned after window | `accepted` | medium |
| User re-assigned by visible non-bot | `rejected` | strong |
| Item closed after unassignment | contextual | medium |
| Cannot read assignees | `unknown` | none |

#### System Outputs

#### `noop`

No human-facing outcome should be evaluated.

```json
{
  "outcome_status": "skipped",
  "evidence_strength": "none",
  "human_check_signal": "no_action_requested"
}
```

#### `missing_tool`

No human-facing outcome should be evaluated.

```json
{
  "outcome_status": "skipped",
  "evidence_strength": "none",
  "human_check_signal": "tool_unavailable"
}
```

---

### 8. Avoiding The Biggest Measurement Bug

The major measurement bug is:

```text
target object exists -> accepted
```

That is not a human check. It is only a weak liveness check.

Use this instead:

```text
target object exists -> target_resolved = true
type-specific acceptance signal exists -> accepted
otherwise -> pending / ignored / weak evidence
```

Example:

```json
{
  "target_resolved": true,
  "outcome_status": "pending",
  "evidence_strength": "weak",
  "human_check_signal": "target_exists_only"
}
```

---

### 9. Suggested Scoring Model

For research dashboards, compute three acceptance rates.

### Strict Acceptance Rate

Only strong evidence.

```text
strict_acceptance_rate =
  strong_accepted / evaluable_outputs
```

### Human-Check Acceptance Rate

Strong plus medium evidence.

```text
human_check_acceptance_rate =
  (strong_accepted + medium_accepted) / evaluable_outputs
```

### Sticky Artifact Rate

Weak survival evidence only.

```text
sticky_artifact_rate =
  weak_accepted_or_target_exists / evaluable_outputs
```

Never mix these into one unlabeled number.

---

### 10. Implementation Architecture

Recommended pipeline:

```text
safe-output execution
  ↓
write normalized safe_output_event records
  ↓
outcome collector scheduled job
  ↓
fetch current GitHub state
  ↓
apply type-specific evaluator
  ↓
write outcome records
  ↓
aggregate dashboard/report
```

Pseudo-code:

```python
def evaluate_safe_output(event, now):
    target = fetch_target(event)

    if event.type in {"noop", "missing_tool"}:
        return skipped(event)

    if target is None:
        return unknown(event, reason="target_not_found_or_inaccessible")

    evaluator = EVALUATORS.get(event.type)

    if evaluator is None:
        return unknown(event, reason="no_type_specific_evaluator")

    return evaluator(event, target, now)
```

Each evaluator should return something like:

```python
Outcome(
    status="accepted",
    strength="strong",
    signal="pull_request_merged",
    confidence="high",
    details={...}
)
```

---

### 11. Minimal Evaluator Interface

```python
class OutcomeEvaluator:
    safe_output_type: str

    def fetch(self, event: SafeOutputEvent) -> TargetState:
        ...

    def evaluate(
        self,
        event: SafeOutputEvent,
        state: TargetState,
        window: EvaluationWindow
    ) -> Outcome:
        ...

    def required_event_fields(self) -> list[str]:
        ...

    def required_github_scopes(self) -> list[str]:
        ...
```

---

### 12. Dashboard Fields

For implementation and research, report:

```text
total_safe_outputs
evaluable_outputs
accepted_strong
accepted_medium
accepted_weak
rejected
pending
ignored
unknown
skipped
fallback_exists_only_count
missing_type_specific_rule_count
durable_reversal_count
```

And by type:

```text
safe_output_type
count
strict_acceptance_rate
human_check_acceptance_rate
sticky_artifact_rate
rejection_rate
pending_rate
unknown_rate
median_time_to_acceptance
```

---

### 13. Recommended Implementation Priority

Implement in this order:

1. `create_pull_request`
2. `push_to_pull_request_branch`
3. `create_issue`
4. `add_comment`
5. `add_labels`
6. `update_issue`
7. `update_pull_request`
8. `add_reviewer`
9. `submit_pull_request_review`
10. `dispatch_workflow`
11. `autofix_code_scanning_alert`
12. project, milestone, and field metadata outputs
13. discussions
14. moderation outputs
15. system outputs

Why: PRs, issues, comments, labels, and workflows give the highest research value and the clearest human-check signals.

---

### 14. Final Implementation Definition

Use this definition in the implementation spec:

> **Accepted Outcome:** A type-specific, post-output repository state that provides observable evidence a human repository observer would reasonably interpret as the safe output being useful, correct, acted on, or intentionally retained.

And this warning:

> Generic target existence must not be counted as accepted outcome evidence except as `weak` evidence under a separate `target_exists_only` signal.

That provides both operational compatibility and research rigor.

</details>


State	Meaning
`accepted`	Observable repository state suggests the output was useful, correct, or intentionally retained.
`rejected`	Observable repository state suggests the output was undone, removed, closed as invalid, reverted, or contradicted.
`pending`	The output is still in flight; there has not been enough time or activity to judge.
`ignored`	The output exists but received no meaningful human or repository response within the evaluation window.
`skipped`	No human-facing outcome should be evaluated, for example `noop`, `missing_tool`, or cancelled outputs.
`unknown`	Required target object or audit data could not be fetched.

Evidence strength	Meaning
`strong`	Direct human or repository acceptance signal: merged PR, resolved review, workflow success, human reply, retained metadata after triage.
`medium`	Indirect but meaningful signal: issue triaged, label retained, milestone retained, linked issue retained.
`weak`	Artifact merely still exists or target still exists.
`none`	No measurable evidence.

Repository state	Outcome	Strength
Closed as completed or resolved	`accepted`	strong
Assigned, labeled by non-bot, milestone added, linked to PR, referenced in commit or PR	`accepted`	medium
Open with non-bot comment, reaction, or triage	`pending` or `accepted_medium`, depending policy	medium
Closed as duplicate, not planned, invalid, or spam	`rejected`	strong
Open with no non-bot interaction after window	`ignored`	weak
Issue deleted or inaccessible	`unknown` or `rejected`, depending audit availability	none

Repository state	Outcome	Strength
Edited fields still equal intended values after window	`accepted`	medium
Later visible non-bot edit preserves core change	`accepted`	medium
Later visible non-bot reverts or replaces the change	`rejected`	strong
Issue deleted or inaccessible	`unknown`	none
No way to compare before and after	`unknown`	none

Repository state	Outcome	Strength
Still closed after evaluation window	`accepted`	medium
Reopened by visible non-bot	`rejected`	strong
Closed then referenced as mistakenly closed	`rejected`	medium
Closed by stale or lifecycle bot only	`accepted_weak` or `ignored`	weak
Target missing	`unknown`	none

Repository state	Outcome	Strength
Parent/sub-issue link still present	`accepted`	medium
Link removed by visible non-bot	`rejected`	strong
Target issue closed as duplicate or invalid	`rejected` or `ignored`, depending context	medium
Parent or child inaccessible	`unknown`	none

Repository state	Outcome	Strength
Answer marked or accepted answer present	`accepted`	strong
Non-bot replies or meaningful reactions	`accepted`	medium
Discussion converted or linked to issue or PR	`accepted`	medium
Closed as duplicate, spam, or off-topic	`rejected`	strong
Exists with no engagement after window	`ignored`	weak

Repository state	Outcome	Strength
Still closed after window	`accepted`	medium
Reopened by visible non-bot	`rejected`	strong
Non-bot comments indicate close was wrong	`rejected`	medium
Deleted or inaccessible	`unknown`	none

Repository state	Outcome	Strength
Merged	`accepted`	strong
Merged then reverted within durability window	`rejected` or `accepted_then_reverted`	strong
Approved by visible non-bot but not merged yet	`pending` with positive signal	medium
Open with review activity	`pending`	medium
Closed unmerged	`rejected`	strong
Open with no activity after long window	`ignored`	weak
PR creation fell back to issue	evaluate as `create_issue` with fallback subtype	varies

Repository state	Outcome	Strength
Updated fields remain unchanged	`accepted`	medium
PR merged after update and update remained relevant	`accepted`	strong
Fields reverted or replaced by non-bot	`rejected`	strong
PR closed unmerged after update	often `rejected`	medium
Cannot compare	`unknown`	none

Repository state	Outcome	Strength
Still closed unmerged after window	`accepted`	medium
Reopened by visible non-bot	`rejected`	strong
Later merged after reopening	`rejected`	strong
Comment indicates premature close	`rejected`	medium
Target missing	`unknown`	none

Repository state	Outcome	Strength
Thread resolved by visible non-bot	`accepted`	strong
Comment replied to by visible non-bot	`accepted`	medium
Follow-up commit touches commented lines or files	`accepted`	medium
Comment marked outdated due to changes and PR merged	`accepted`	medium
Comment deleted or minimized as abuse/off-topic	`rejected`	strong
No reply or no resolution after PR closed or merged	`ignored`	weak
PR still open and thread unresolved	`pending`	weak or medium

Repository state	Outcome	Strength
Review approved and PR merged	`accepted`	strong
Changes requested and later addressed by commits	`accepted`	medium
Review dismissed by visible non-bot	`rejected`	strong
Review contradicted by later human review	`rejected` or `mixed`	medium
PR closed without addressing review	contextual	medium
PR still open with review pending	`pending`	medium

Repository state	Outcome	Strength
Thread resolved after reply	`accepted`	strong
Visible non-bot replies positively or continues constructively	`accepted`	medium
Reply is deleted or minimized	`rejected`	strong
Thread remains unresolved and PR closes without action	`ignored` or `rejected`, depending context	weak or medium
PR still open	`pending`	weak

Repository state	Outcome	Strength
Thread still resolved after window	`accepted`	medium
Thread reopened by visible non-bot	`rejected`	strong
PR merged after resolution	`accepted`	strong
PR closed unmerged after resolution	contextual	medium
Thread inaccessible	`unknown`	none

Repository state	Outcome	Strength
PR merged with pushed commits included	`accepted`	strong
Pushed commits later modified but PR merged	`accepted` or `mixed`	medium
Pushed commits reverted or dropped	`rejected`	strong
PR closed unmerged	`rejected`	strong
PR open with review or CI activity	`pending`	medium
PR open with no activity after long window	`ignored`	weak

Repository state	Outcome	Strength
Reviewed by visible non-bot after mark-ready	`accepted`	strong
Merged after mark-ready	`accepted`	strong
Converted back to draft by visible non-bot	`rejected`	strong
Closed unmerged with no review	`rejected` or `ignored`	medium
Open with no review after window	`pending` or `ignored`	weak

Repository state	Outcome	Strength
Requested reviewer submitted review	`accepted`	strong
Requested reviewer commented	`accepted`	medium
Reviewer request still pending while PR open	`pending`	weak
Reviewer removed by visible non-bot	`rejected`	medium or strong
PR merged without that reviewer acting	`ignored` or `accepted_weak`, depending policy	weak
PR closed unmerged with no reviewer action	`ignored`	weak

Repository state	Outcome	Strength
Agent-created PR merged	`accepted`	strong
Agent-created PR reviewed positively	`pending` with positive signal	medium
Issue closed as completed due to agent work	`accepted`	strong
Agent PR closed unmerged	`rejected`	strong
Agent assigned but no PR or activity after window	`ignored`	weak
Issue solved by someone else	`accepted_other_actor` or `neutral`	medium

Repository state	Outcome	Strength
All labels retained after triage window	`accepted`	medium
Some labels retained, some removed	`mixed`	medium
All labels removed by visible non-bot	`rejected`	strong
Target closed or merged while labels retained	`accepted`	medium
No permission to read labels	`unknown`	none

Improve Safe Output Outcome Evaluation #35033

Description

Implement Safe Output Outcome Evaluation with evidence-strength classification

Problem

Goals

Non-Goals

Current State

Deliverables

Success Metrics

Dependencies

Planner Guidance

Recommended Execution Order

Recommended PR Boundaries

Safe Deferrals

Acceptance Criteria

Tracking Checklist

Outcome Status

Evidence Strength

Normalized Shape

Key Rule

Required Examples

Why This Matters

Rollout Order

Ordering Constraints

Suggested Split Into Follow-On Issues Or PRs

Core Principle

1. Implementation Goal

2. Recommended Outcome States

3. Universal Evaluation Schema

4. Data Collection Requirements

5. Evaluation Window

6. Bot-Aware Human Check

7. Per-Safe-Output Implementation Rules

Issues and Discussions

create_issue

update_issue

close_issue

link_sub_issue

create_discussion

update_discussion

close_discussion

Pull Requests

create_pull_request

update_pull_request

close_pull_request

create_pull_request_review_comment

submit_pull_request_review

reply_to_pull_request_review_comment

resolve_pull_request_review_thread

push_to_pull_request_branch

mark_pull_request_as_ready_for_review

add_reviewer

assign_to_agent

Labels, Metadata, and Planning

add_labels

assign_milestone

update_project

set_issue_type

set_issue_field

Workflows, Security, Releases, Code Scanning

dispatch_workflow

autofix_code_scanning_alert

create_code_scanning_alert

update_release

Comment Moderation

hide_comment

User Assignment

assign_to_user

unassign_from_user

System Outputs

noop

missing_tool

8. Avoiding The Biggest Measurement Bug

9. Suggested Scoring Model

Strict Acceptance Rate

Human-Check Acceptance Rate

Sticky Artifact Rate

10. Implementation Architecture

11. Minimal Evaluator Interface

12. Dashboard Fields

`create_issue`

`update_issue`

`close_issue`

`link_sub_issue`

`create_discussion`

`update_discussion`

`close_discussion`

`create_pull_request`

`update_pull_request`

`close_pull_request`

`create_pull_request_review_comment`

`submit_pull_request_review`

`reply_to_pull_request_review_comment`

`resolve_pull_request_review_thread`

`push_to_pull_request_branch`

`mark_pull_request_as_ready_for_review`

`add_reviewer`

`assign_to_agent`

`add_labels`

`assign_milestone`

`update_project`

`set_issue_type`

`set_issue_field`

`dispatch_workflow`

`autofix_code_scanning_alert`

`create_code_scanning_alert`

`update_release`

`hide_comment`

`assign_to_user`

`unassign_from_user`

`noop`

`missing_tool`

Repository state	Outcome	Strength
Same milestone still set after window	`accepted`	medium
Milestone changed by visible non-bot	`rejected` or `mixed`	medium
Milestone removed	`rejected`	strong
Issue or PR completed under milestone	`accepted`	strong
Target missing	`unknown`	none

Repository state	Outcome	Strength
Field value still equals intended value	`accepted`	medium
Field later moved forward in workflow, for example Todo to In Progress to Done	`accepted`	strong
Field reverted or changed away by visible non-bot	`rejected` or `mixed`	medium
Project item removed	`rejected`	strong
Project inaccessible	`unknown`	none

Repository state	Outcome	Strength
Type still equals intended type	`accepted`	medium
Type changed by visible non-bot	`rejected` or `mixed`	medium
Type cleared by visible non-bot	`rejected`	strong
Issue completed with type retained	`accepted`	strong
Cannot read issue type	`unknown`	none

Repository state	Outcome	Strength
Field still equals intended value	`accepted`	medium
Field advanced to later valid workflow state	`accepted`	strong
Field changed away by visible non-bot	`rejected` or `mixed`	medium
Field removed or item removed	`rejected`	strong
Cannot read field	`unknown`	none

Repository state	Outcome	Strength
Dispatched run completed successfully	`accepted`	strong
Run failed, cancelled, or timed out	`rejected`	strong
Run still in progress	`pending`	medium
No matching run found	`unknown` or `rejected`, depending dispatch API result	none or medium
Run succeeded but produced no expected artifact	`accepted_weak` or `mixed`	weak

Repository state	Outcome	Strength
Alert fixed or closed by code change	`accepted`	strong
Linked PR merged and alert fixed	`accepted`	strong
Alert dismissed as false positive or won't fix by visible non-bot	contextual acceptance for triage, not fix	medium
Alert remains open after window	`pending` or `rejected`, depending SLA	weak or medium
Autofix PR closed unmerged	`rejected`	strong
Alert disappeared or inaccessible	`unknown`	none

Repository state	Outcome	Strength
Alert acknowledged or triaged	`accepted`	medium
Alert fixed	`accepted`	strong
Alert dismissed with reason by visible non-bot	`accepted` or `rejected`, depending expected semantics	medium
Alert deleted or invalid	`rejected`	medium
Alert open with no activity	`pending` or `ignored`	weak

Repository state	Outcome	Strength
Release fields still match intended update	`accepted`	medium
Release published after draft update	`accepted`	strong
Release edited again to revert agent change	`rejected`	strong
Release deleted	`rejected`	strong
Cannot compare fields	`unknown`	none

Repository state	Outcome	Strength
Comment still hidden or minimized	`accepted`	medium
Comment unhidden by visible non-bot	`rejected`	strong
Comment deleted after hiding	often `accepted` for moderation	medium
Target missing	`unknown`	none

Repository state	Outcome	Strength
Assigned user comments, reviews, commits, or closes item	`accepted`	strong
Assignment remains while item open	`pending`	weak
User removed by visible non-bot	`rejected`	medium
Item completed while user assigned	`accepted`	medium
Item closed without assignee action	`ignored`	weak