Skip to content

Improve Safe Output Outcome Evaluation #35033

@mnkiefer

Description

@mnkiefer

Implement Safe Output Outcome Evaluation with evidence-strength classification

Implement safe-output outcome evaluation based on observable repository state rather than workflow self-assessment or artifact survival.

Problem

Current outcome evaluation is too weak when it treats object existence as acceptance. That inflates acceptance rates and makes workflow effectiveness metrics misleading.

We need outcome evaluation that answers:

  1. What safe output was produced?
  2. What GitHub object did it affect?
  3. What happened to that object afterward?
  4. Would a human observer classify that as accepted, rejected, pending, ignored, skipped, or unknown?
  5. How strong is the evidence?

Goals

  • Define a normalized outcome model with outcome_status and evidence_strength
  • Evaluate outcomes from repository state, not workflow self-reporting
  • Persist enough execution-time metadata to evaluate mutable operations later
  • Implement dedicated evaluators for major safe output types
  • Keep weak survival evidence separate from real acceptance
  • Emit consistent JSONL and telemetry fields for downstream reporting

Non-Goals

  • Do not build a weighted business-value model
  • Do not try to infer hidden human vs AI provenance
  • Do not collapse strong, medium, and weak acceptance into one unlabeled metric
  • Do not require perfect coverage before shipping the framework

Current State

Current outcome evaluation has a mix of dedicated evaluators, placeholder evaluators, and fallback behavior.

  • Some output types already have dedicated outcome logic
  • Some mutable operations require execution-time metadata capture before they can be evaluated correctly
  • Several safe output types still rely on weak target_exists_only fallback behavior
  • Some discussion, review-thread, workflow, and moderation types are still effectively unimplemented from a human-check perspective

Deliverables

  • a normalized outcome model in reports and JSONL output
  • explicit weak-evidence fallback behavior that does not count as accepted
  • execution-time metadata persistence for mutable operations
  • dedicated evaluators for the targeted rollout slices
  • updated documentation and formal outcome-evaluation specification
  • focused tests for each implemented evaluator family

Success Metrics

  • no existence-only fallback is counted as accepted
  • the number of types with dedicated evaluators increases
  • weak fallback outcomes are visibly separated from accepted outcomes
  • mutable operations can be evaluated as retained, reverted, replaced, or acted on
  • focused tests exist for each delivered evaluator slice

Dependencies

  • mutable update evaluators depend on before/after metadata persistence
  • review lifecycle evaluators depend on requested-state or thread metadata capture
  • workflow and code-scanning evaluators depend on correlation metadata
  • documentation and reporting should be updated after each evaluator slice lands

Planner Guidance

Use this issue as a decomposition and sequencing source, not as a single implementation task.

Recommended Execution Order

  1. shared outcome model and fallback semantics
  2. execution-time metadata persistence
  3. mutable update evaluators
  4. review lifecycle evaluators
  5. issue and comment evaluators
  6. workflow and code-scanning evaluators
  7. metadata, discussion, project, and moderation evaluators

Recommended PR Boundaries

  1. shared model and fallback semantics
  2. metadata persistence for mutable operations
  3. retained-update evaluators
  4. review lifecycle evaluators
  5. issue and comment evaluators
  6. workflow and code-scanning evaluators
  7. metadata, discussion, project, and moderation evaluators

Safe Deferrals

  • scoring-model rollups
  • richer dashboard aggregation
  • strong/medium/weak telemetry summaries beyond core item output
  • long-tail evaluator families after the first rollout slices land

Acceptance Criteria

  • Dedicated evaluators exist for the first rollout slice
  • Generic fallback does not classify existence-only as accepted
  • Mutable operations persist comparison metadata at execution time
  • JSONL and outcome reports include normalized outcome fields
  • Docs reflect the new semantics
  • Focused tests exist for each implemented evaluator slice

Tracking Checklist

  • Shared outcome model and fallback semantics
  • Manifest and artifact metadata for mutable operations
  • Mutable update evaluators
  • Review lifecycle evaluators
  • Issue and comment interaction evaluators
  • Workflow and code scanning evaluators
  • Metadata, project, discussion, and moderation evaluators
  • Docs updated after each slice
  • Telemetry fields aligned with evidence strength
Normalized Outcome Model

Outcome Status

  • accepted
  • rejected
  • pending
  • ignored
  • skipped
  • unknown

Evidence Strength

  • strong
  • medium
  • weak
  • none

Normalized Shape

{
  "safe_output_type": "create_pull_request",
  "outcome_status": "accepted",
  "evidence_strength": "strong",
  "human_check_signal": "pull_request_merged",
  "target_resolved": true,
  "confidence": "high"
}

Key Rule

Target existence alone must not count as accepted outcome evidence.

If the only observable fact is that the target still exists, classify it separately:

{
  "outcome_status": "pending",
  "evidence_strength": "weak",
  "human_check_signal": "target_exists_only",
  "target_resolved": true
}
Execution-Time Metadata Requirements

Mutable operations must persist enough state at execution time for later comparison.

Required Examples

  • update_issue
    • before/after title
    • normalized body hash
    • labels
    • assignees
    • state
  • update_pull_request
    • before/after title
    • normalized body hash
    • base
    • draft
    • head SHA when relevant
  • add_reviewer
    • requested reviewers
    • requested teams
  • workflow and code-scanning operations
    • correlation keys
    • expected target state

Why This Matters

Without execution-time metadata, mutable operations cannot be evaluated as retained, reverted, replaced, or acted on. They fall back to weak survival evidence, which is not good enough for meaningful acceptance metrics.

Implementation Plan

Rollout Order

  1. Shared outcome model and fallback semantics
  2. Mutable update evaluators
  3. Review lifecycle evaluators
  4. Issue/comment evaluators
  5. Workflow and code-scanning evaluators
  6. Metadata/project evaluators
  7. Discussions and moderation evaluators

Ordering Constraints

  • Do not implement mutable update evaluators without first persisting the metadata they need
  • Do not treat target existence as acceptance while evaluator coverage is incomplete
  • Prefer implementing evaluator families where the runtime data dependencies are already available
  • Keep docs and output schema aligned with the implementation after each slice

Suggested Split Into Follow-On Issues Or PRs

  1. Shared outcome model and fallback semantics
  2. Persist execution-time metadata for mutable safe outputs
  3. Implement retained-update evaluators for mutable outputs
  4. Implement review lifecycle evaluators
  5. Implement issue and comment interaction evaluators
  6. Implement workflow and code scanning evaluators
  7. Implement metadata, project, discussion, and moderation evaluators
Full Detailed Design Reference

This section preserves the full implementation guidance and is intentionally more detailed than the main issue body.


Core Principle

For implementation, accepted outcomes should be defined as human-checkable evidence and explicitly separated from weaker "artifact still exists" signals.

Core implementation principle:

A safe output is the agent's proposed action. An accepted outcome is the later repository state that a human reviewer would reasonably inspect to decide whether that action was useful, correct, or intentionally retained.

This fits the underlying safe-output design: safe outputs are validated operations executed outside the agent's write-permission context, while outcomes are meant to evaluate what happened after those operations based on repository state rather than the workflow's own self-assessment.


1. Implementation Goal

Build an outcome evaluator that answers, for every safe output item:

  1. What did the agent safely output?
  2. What GitHub object did it affect or create?
  3. What happened to that object afterward?
  4. Would a human checking the repo treat that as accepted, rejected, pending, ignored, or unmeasured?
  5. How strong is that evidence?

The most important implementation decision is to avoid treating all accepted statuses as equally strong. A PR merged by a maintainer is much stronger evidence than "the PR still exists."

Two layers should be implemented:

outcome_status:
  accepted | rejected | pending | ignored | skipped | unknown

evidence_strength:
  strong | medium | weak | none

This preserves compatibility with outcome reporting while keeping the measurement honest.


2. Recommended Outcome States

Use these states consistently across all safe output types.

State Meaning
accepted Observable repository state suggests the output was useful, correct, or intentionally retained.
rejected Observable repository state suggests the output was undone, removed, closed as invalid, reverted, or contradicted.
pending The output is still in flight; there has not been enough time or activity to judge.
ignored The output exists but received no meaningful human or repository response within the evaluation window.
skipped No human-facing outcome should be evaluated, for example noop, missing_tool, or cancelled outputs.
unknown Required target object or audit data could not be fetched.

Attach evidence strength separately:

Evidence strength Meaning
strong Direct human or repository acceptance signal: merged PR, resolved review, workflow success, human reply, retained metadata after triage.
medium Indirect but meaningful signal: issue triaged, label retained, milestone retained, linked issue retained.
weak Artifact merely still exists or target still exists.
none No measurable evidence.

3. Universal Evaluation Schema

Each evaluated output should produce a normalized record like this:

{
  "safe_output_id": "run-id:item-index",
  "safe_output_type": "create_pull_request",
  "target": {
    "repo": "owner/repo",
    "kind": "pull_request",
    "number": 123,
    "node_id": "..."
  },
  "created_at": "2026-05-26T10:00:00Z",
  "evaluated_at": "2026-05-27T10:00:00Z",
  "evaluation_window_hours": 24,
  "outcome_status": "accepted",
  "evidence_strength": "strong",
  "human_check_signal": "pull_request_merged",
  "bot_aware": true,
  "actor_summary": {
    "visible_non_bot_actor_count": 2,
    "bot_actor_count": 1
  },
  "details": {
    "merged": true,
    "merged_by_type": "User",
    "closed": true,
    "reverted": false
  },
  "confidence": "high",
  "notes": "PR was merged by a visible non-bot actor."
}

The key field is human_check_signal. That is the concrete thing a human would check.


4. Data Collection Requirements

Enough metadata must be persisted at safe-output execution time to evaluate outcomes later.

For every safe output, store a record like:

{
  "type": "add_labels",
  "run_id": "...",
  "workflow_name": "...",
  "repo": "owner/repo",
  "actor": "github-actions[bot]",
  "created_at": "...",
  "target_url": "...",
  "target_node_id": "...",
  "target_number": 123,
  "payload_hash": "...",
  "expected_state": {
    "labels_added": ["bug", "needs-triage"]
  }
}

For update operations, store the before and after values. Without that, later evaluation cannot determine whether a change retained or replaced the intended result.

Example for update_issue:

{
  "type": "update_issue",
  "target": {
    "repo": "owner/repo",
    "issue_number": 42
  },
  "before": {
    "title": "Old title",
    "body_hash": "abc123"
  },
  "after": {
    "title": "New title",
    "body_hash": "def456"
  }
}

For PR and code-changing operations, store commit SHAs, branch name, patch hash, PR number, and base/head refs.


5. Evaluation Window

Use multiple windows rather than one hard cutoff.

Recommended:

T+1h: early signal
T+24h: primary signal
T+7d: durable signal
T+30d: long-term or revert signal for code changes

Why: some outputs are accepted quickly, like comments or labels. Others, especially PRs, may take days. A PR open after 24 hours should usually be pending, not ignored or rejected.

For research reporting, include both:

accepted_at_24h
accepted_at_7d
accepted_durable_at_30d

6. Bot-Aware Human Check

Do not assume every visible action is human. Visible actor identity is not perfect provenance: a non-bot actor may still be AI-assisted, and hidden authorship cannot be fully observed. Implement this as visible non-bot activity, not definitely human activity.

Recommended actor categories:

bot_actor
visible_non_bot_actor
same_workflow_actor
unknown_actor
system_actor

A signal is stronger when it involves a visible non-bot actor other than the original workflow actor.


7. Per-Safe-Output Implementation Rules

Issues and Discussions

create_issue

Human check: did the issue receive meaningful triage, assignment, linkage, closure, or completion?

Repository state Outcome Strength
Closed as completed or resolved accepted strong
Assigned, labeled by non-bot, milestone added, linked to PR, referenced in commit or PR accepted medium
Open with non-bot comment, reaction, or triage pending or accepted_medium, depending policy medium
Closed as duplicate, not planned, invalid, or spam rejected strong
Open with no non-bot interaction after window ignored weak
Issue deleted or inaccessible unknown or rejected, depending audit availability none

Implementation notes:

  • Store issue number and initial body hash
  • Query issue state, timeline events, comments, labels, assignees, milestone, linked PRs, and closing reason
  • Do not treat "issue still exists" as accepted

update_issue

Human check: did the agent's edit stick?

Repository state Outcome Strength
Edited fields still equal intended values after window accepted medium
Later visible non-bot edit preserves core change accepted medium
Later visible non-bot reverts or replaces the change rejected strong
Issue deleted or inaccessible unknown none
No way to compare before and after unknown none

Implementation notes:

  • Store field-level before and after values
  • For body fields, compare normalized body hash rather than raw Markdown only
  • Track title, body, state, labels, changed-by, and changed-at where available

close_issue

Human check: did the issue stay closed, or was it reopened?

Repository state Outcome Strength
Still closed after evaluation window accepted medium
Reopened by visible non-bot rejected strong
Closed then referenced as mistakenly closed rejected medium
Closed by stale or lifecycle bot only accepted_weak or ignored weak
Target missing unknown none

Implementation nuance: a close action can be harmful even if the issue remains closed. When possible, inspect timeline comments after closure for reopen attempts or challenge comments.

link_sub_issue

Human check: did the sub-issue relationship stick?

Repository state Outcome Strength
Parent/sub-issue link still present accepted medium
Link removed by visible non-bot rejected strong
Target issue closed as duplicate or invalid rejected or ignored, depending context medium
Parent or child inaccessible unknown none

create_discussion

Human check: did the discussion receive meaningful engagement or an accepted answer?

Repository state Outcome Strength
Answer marked or accepted answer present accepted strong
Non-bot replies or meaningful reactions accepted medium
Discussion converted or linked to issue or PR accepted medium
Closed as duplicate, spam, or off-topic rejected strong
Exists with no engagement after window ignored weak

update_discussion

Human check mirrors update_issue.

close_discussion

Human check: did the discussion stay closed?

Repository state Outcome Strength
Still closed after window accepted medium
Reopened by visible non-bot rejected strong
Non-bot comments indicate close was wrong rejected medium
Deleted or inaccessible unknown none

Pull Requests

create_pull_request

Human check: was the PR merged, closed unmerged, reviewed, or still pending?

Repository state Outcome Strength
Merged accepted strong
Merged then reverted within durability window rejected or accepted_then_reverted strong
Approved by visible non-bot but not merged yet pending with positive signal medium
Open with review activity pending medium
Closed unmerged rejected strong
Open with no activity after long window ignored weak
PR creation fell back to issue evaluate as create_issue with fallback subtype varies

Implementation notes:

  • Store PR number, branch, commit SHAs, patch hash, labels, reviewers, and fallback behavior
  • For durable research, check for revert commits after merge

update_pull_request

Human check mirrors update_issue but stronger when the PR later merges.

Repository state Outcome Strength
Updated fields remain unchanged accepted medium
PR merged after update and update remained relevant accepted strong
Fields reverted or replaced by non-bot rejected strong
PR closed unmerged after update often rejected medium
Cannot compare unknown none

close_pull_request

Human check: did the PR stay closed, or was the close undone?

Repository state Outcome Strength
Still closed unmerged after window accepted medium
Reopened by visible non-bot rejected strong
Later merged after reopening rejected strong
Comment indicates premature close rejected medium
Target missing unknown none

create_pull_request_review_comment

Human check: did the inline comment lead to resolution, reply, code change, or review action?

Repository state Outcome Strength
Thread resolved by visible non-bot accepted strong
Comment replied to by visible non-bot accepted medium
Follow-up commit touches commented lines or files accepted medium
Comment marked outdated due to changes and PR merged accepted medium
Comment deleted or minimized as abuse/off-topic rejected strong
No reply or no resolution after PR closed or merged ignored weak
PR still open and thread unresolved pending weak or medium

submit_pull_request_review

Human check: did the submitted review affect the PR lifecycle?

Repository state Outcome Strength
Review approved and PR merged accepted strong
Changes requested and later addressed by commits accepted medium
Review dismissed by visible non-bot rejected strong
Review contradicted by later human review rejected or mixed medium
PR closed without addressing review contextual medium
PR still open with review pending pending medium

reply_to_pull_request_review_comment

Human check: did the reply advance or resolve the thread?

Repository state Outcome Strength
Thread resolved after reply accepted strong
Visible non-bot replies positively or continues constructively accepted medium
Reply is deleted or minimized rejected strong
Thread remains unresolved and PR closes without action ignored or rejected, depending context weak or medium
PR still open pending weak

resolve_pull_request_review_thread

Human check: did the thread stay resolved?

Repository state Outcome Strength
Thread still resolved after window accepted medium
Thread reopened by visible non-bot rejected strong
PR merged after resolution accepted strong
PR closed unmerged after resolution contextual medium
Thread inaccessible unknown none

push_to_pull_request_branch

Human check: were the pushed commits accepted through the PR lifecycle?

Repository state Outcome Strength
PR merged with pushed commits included accepted strong
Pushed commits later modified but PR merged accepted or mixed medium
Pushed commits reverted or dropped rejected strong
PR closed unmerged rejected strong
PR open with review or CI activity pending medium
PR open with no activity after long window ignored weak

mark_pull_request_as_ready_for_review

Human check: did readiness lead to actual review or merge activity?

Repository state Outcome Strength
Reviewed by visible non-bot after mark-ready accepted strong
Merged after mark-ready accepted strong
Converted back to draft by visible non-bot rejected strong
Closed unmerged with no review rejected or ignored medium
Open with no review after window pending or ignored weak

add_reviewer

Human check: did the requested reviewer review, comment, approve, or remain meaningfully assigned?

Repository state Outcome Strength
Requested reviewer submitted review accepted strong
Requested reviewer commented accepted medium
Reviewer request still pending while PR open pending weak
Reviewer removed by visible non-bot rejected medium or strong
PR merged without that reviewer acting ignored or accepted_weak, depending policy weak
PR closed unmerged with no reviewer action ignored weak

Implementation notes:

  • Store exact requested reviewers and teams
  • Team review requests need special handling because any member review may satisfy the team request
  • Do not treat PR existence as accepted

assign_to_agent

Human check: did the assigned agent produce useful downstream work?

Repository state Outcome Strength
Agent-created PR merged accepted strong
Agent-created PR reviewed positively pending with positive signal medium
Issue closed as completed due to agent work accepted strong
Agent PR closed unmerged rejected strong
Agent assigned but no PR or activity after window ignored weak
Issue solved by someone else accepted_other_actor or neutral medium

Labels, Metadata, and Planning

add_labels

Human check: did the label classification stick?

Repository state Outcome Strength
All labels retained after triage window accepted medium
Some labels retained, some removed mixed medium
All labels removed by visible non-bot rejected strong
Target closed or merged while labels retained accepted medium
No permission to read labels unknown none

Use partial scoring where useful.

assign_milestone

Human check: did the milestone remain assigned?

Repository state Outcome Strength
Same milestone still set after window accepted medium
Milestone changed by visible non-bot rejected or mixed medium
Milestone removed rejected strong
Issue or PR completed under milestone accepted strong
Target missing unknown none

update_project

Human check: did the specific project field update remain?

Repository state Outcome Strength
Field value still equals intended value accepted medium
Field later moved forward in workflow, for example Todo to In Progress to Done accepted strong
Field reverted or changed away by visible non-bot rejected or mixed medium
Project item removed rejected strong
Project inaccessible unknown none

set_issue_type

Human check: did the issue type remain set?

Repository state Outcome Strength
Type still equals intended type accepted medium
Type changed by visible non-bot rejected or mixed medium
Type cleared by visible non-bot rejected strong
Issue completed with type retained accepted strong
Cannot read issue type unknown none

set_issue_field

Human check: did the specific field value remain?

Repository state Outcome Strength
Field still equals intended value accepted medium
Field advanced to later valid workflow state accepted strong
Field changed away by visible non-bot rejected or mixed medium
Field removed or item removed rejected strong
Cannot read field unknown none

Workflows, Security, Releases, Code Scanning

dispatch_workflow

Human check: did the dispatched workflow run successfully?

Repository state Outcome Strength
Dispatched run completed successfully accepted strong
Run failed, cancelled, or timed out rejected strong
Run still in progress pending medium
No matching run found unknown or rejected, depending dispatch API result none or medium
Run succeeded but produced no expected artifact accepted_weak or mixed weak

autofix_code_scanning_alert

Human check: was the alert actually fixed?

Repository state Outcome Strength
Alert fixed or closed by code change accepted strong
Linked PR merged and alert fixed accepted strong
Alert dismissed as false positive or won't fix by visible non-bot contextual acceptance for triage, not fix medium
Alert remains open after window pending or rejected, depending SLA weak or medium
Autofix PR closed unmerged rejected strong
Alert disappeared or inaccessible unknown none

Differentiate fixed, triaged, rejected, and pending states rather than flattening them.

create_code_scanning_alert

Human check: was the alert triaged, fixed, or dismissed with reason?

Repository state Outcome Strength
Alert acknowledged or triaged accepted medium
Alert fixed accepted strong
Alert dismissed with reason by visible non-bot accepted or rejected, depending expected semantics medium
Alert deleted or invalid rejected medium
Alert open with no activity pending or ignored weak

update_release

Human check: did the release edit remain?

Repository state Outcome Strength
Release fields still match intended update accepted medium
Release published after draft update accepted strong
Release edited again to revert agent change rejected strong
Release deleted rejected strong
Cannot compare fields unknown none

Comment Moderation

hide_comment

Human check: did the hidden or minimized state persist?

Repository state Outcome Strength
Comment still hidden or minimized accepted medium
Comment unhidden by visible non-bot rejected strong
Comment deleted after hiding often accepted for moderation medium
Target missing unknown none

User Assignment

assign_to_user

Human check: did the assignment stick or result in user action?

Repository state Outcome Strength
Assigned user comments, reviews, commits, or closes item accepted strong
Assignment remains while item open pending weak
User removed by visible non-bot rejected medium
Item completed while user assigned accepted medium
Item closed without assignee action ignored weak

unassign_from_user

Human check: did the user remain unassigned?

Repository state Outcome Strength
User remains unassigned after window accepted medium
User re-assigned by visible non-bot rejected strong
Item closed after unassignment contextual medium
Cannot read assignees unknown none

System Outputs

noop

No human-facing outcome should be evaluated.

{
  "outcome_status": "skipped",
  "evidence_strength": "none",
  "human_check_signal": "no_action_requested"
}

missing_tool

No human-facing outcome should be evaluated.

{
  "outcome_status": "skipped",
  "evidence_strength": "none",
  "human_check_signal": "tool_unavailable"
}

8. Avoiding The Biggest Measurement Bug

The major measurement bug is:

target object exists -> accepted

That is not a human check. It is only a weak liveness check.

Use this instead:

target object exists -> target_resolved = true
type-specific acceptance signal exists -> accepted
otherwise -> pending / ignored / weak evidence

Example:

{
  "target_resolved": true,
  "outcome_status": "pending",
  "evidence_strength": "weak",
  "human_check_signal": "target_exists_only"
}

9. Suggested Scoring Model

For research dashboards, compute three acceptance rates.

Strict Acceptance Rate

Only strong evidence.

strict_acceptance_rate =
  strong_accepted / evaluable_outputs

Human-Check Acceptance Rate

Strong plus medium evidence.

human_check_acceptance_rate =
  (strong_accepted + medium_accepted) / evaluable_outputs

Sticky Artifact Rate

Weak survival evidence only.

sticky_artifact_rate =
  weak_accepted_or_target_exists / evaluable_outputs

Never mix these into one unlabeled number.


10. Implementation Architecture

Recommended pipeline:

safe-output execution
  ↓
write normalized safe_output_event records
  ↓
outcome collector scheduled job
  ↓
fetch current GitHub state
  ↓
apply type-specific evaluator
  ↓
write outcome records
  ↓
aggregate dashboard/report

Pseudo-code:

def evaluate_safe_output(event, now):
    target = fetch_target(event)

    if event.type in {"noop", "missing_tool"}:
        return skipped(event)

    if target is None:
        return unknown(event, reason="target_not_found_or_inaccessible")

    evaluator = EVALUATORS.get(event.type)

    if evaluator is None:
        return unknown(event, reason="no_type_specific_evaluator")

    return evaluator(event, target, now)

Each evaluator should return something like:

Outcome(
    status="accepted",
    strength="strong",
    signal="pull_request_merged",
    confidence="high",
    details={...}
)

11. Minimal Evaluator Interface

class OutcomeEvaluator:
    safe_output_type: str

    def fetch(self, event: SafeOutputEvent) -> TargetState:
        ...

    def evaluate(
        self,
        event: SafeOutputEvent,
        state: TargetState,
        window: EvaluationWindow
    ) -> Outcome:
        ...

    def required_event_fields(self) -> list[str]:
        ...

    def required_github_scopes(self) -> list[str]:
        ...

12. Dashboard Fields

For implementation and research, report:

total_safe_outputs
evaluable_outputs
accepted_strong
accepted_medium
accepted_weak
rejected
pending
ignored
unknown
skipped
fallback_exists_only_count
missing_type_specific_rule_count
durable_reversal_count

And by type:

safe_output_type
count
strict_acceptance_rate
human_check_acceptance_rate
sticky_artifact_rate
rejection_rate
pending_rate
unknown_rate
median_time_to_acceptance

13. Recommended Implementation Priority

Implement in this order:

  1. create_pull_request
  2. push_to_pull_request_branch
  3. create_issue
  4. add_comment
  5. add_labels
  6. update_issue
  7. update_pull_request
  8. add_reviewer
  9. submit_pull_request_review
  10. dispatch_workflow
  11. autofix_code_scanning_alert
  12. project, milestone, and field metadata outputs
  13. discussions
  14. moderation outputs
  15. system outputs

Why: PRs, issues, comments, labels, and workflows give the highest research value and the clearest human-check signals.


14. Final Implementation Definition

Use this definition in the implementation spec:

Accepted Outcome: A type-specific, post-output repository state that provides observable evidence a human repository observer would reasonably interpret as the safe output being useful, correct, acted on, or intentionally retained.

And this warning:

Generic target existence must not be counted as accepted outcome evidence except as weak evidence under a separate target_exists_only signal.

That provides both operational compatibility and research rigor.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions