Skip to content

docs: document raki report --diff workflow for before/after comparison #258

@decko

Description

@decko

Parent

Child of #257 (cohort comparison).

Problem

raki report --diff baseline.json compare.json already produces a full comparison report (CLI + HTML) with metric deltas, direction indicators, session matching, and regression detection. But this workflow is undocumented — users don't know they can:

  1. Run raki run with a manifest scoped to pre-change sessions → get report-before.json
  2. Run raki run with a manifest scoped to post-change sessions → get report-after.json
  3. Run raki report --diff report-before.json report-after.json --html diff.html

Deliverable

Add a "Comparing Runs" section to the docs covering:

  • The --diff subcommand with a worked example
  • How to scope manifests for before/after session sets
  • What the diff report shows (metric deltas, ▲/▼ direction, matched/new/dropped sessions, judge config mismatch warnings)
  • --fail-on-regression for CI gating
  • When to use --diff vs raki trends (point-in-time comparison vs trajectory)

Scope

Docs-only. No code changes.

Acceptance criteria

  • New doc section with worked example
  • Cross-reference from raki trends docs
  • Towncrier fragment (+diff-docs.doc or similar)

Spec

Context

RAKI already has a complete diff comparison pipeline (raki report --diff baseline.json compare.json) that produces:

  • CLI output: metric deltas with ▲/▼ direction indicators, matched/new/dropped session counts, judge config mismatch warnings, per-session verdict transitions (improvement/regression)
  • HTML output: self-contained dark-theme diff report via write_diff_html_report
  • CI gating: --fail-on-regression exits non-zero when metrics regress beyond a 2% noise margin

This feature is mentioned in passing in two places:

  • getting-started.md — 3-line example under "Understanding the output"
  • ci-integration.md — brief mention under "Regression detection with --fail-on-regression"

Neither doc explains the full workflow, what the diff report shows, or when to use --diff vs raki trends.

What to document

Create docs/comparing-runs.md with these sections (ordered per doc writer review — procedure first, concept last):

1. Quick start (worked example)

Get users to a result in 60 seconds. Each step introduced with a lead-in sentence, command in its own block (no inline # Step N: comments per Red Hat code-block conventions).

To evaluate the pre-change sessions, run the following command:

uv run raki run -m manifests/before.yaml -o results/ --include-sessions

To evaluate the post-change sessions, run the following command:

uv run raki run -m manifests/after.yaml -o results/ --include-sessions

To compare the two reports, run the following command:

uv run raki report --diff results/before.json results/after.json

To generate an HTML diff report, add --html:

uv run raki report --diff results/before.json results/after.json --html diff-report.html

To fail CI when metrics regress, add --fail-on-regression:

uv run raki report --diff results/baseline.json results/current.json --fail-on-regression

Include realistic synthetic CLI output so field descriptions are anchored to something concrete:

Comparing raki-20260501T100000 → raki-20260512T140000
Matched: 12/14 sessions (2 new)

  First-pass success rate  0.36 → 0.67  (+0.31)  ▲
  Rework cycles            0.72 → 0.33  (-0.39)  ▲
  Cost / session           $12.30 → $8.50  (-$3.80)  ▲
  Severity score           0.45 → 0.22  (-0.23)  ▲

Improvements: 4 sessions (fail→pass: 3, rework→pass: 1)
Regressions:  1 session (pass→rework: 1)

2. Reading the diff output

Document what each section of the CLI diff shows:

  • Header: baseline run ID → compare run ID
  • Judge config warnings: when provider/model differ between runs, or when only one used a judge
  • Agent model warnings: when the session agent models differ
  • Coverage line: matched/total sessions, new/dropped counts
  • Metric deltas: each metric with baseline → compare value, delta, and direction (▲ improved / ▼ regressed / = flat). Direction respects higher_is_better per metric
  • Session transitions: sessions whose verdict changed (pass/rework/fail), grouped into improvements and regressions

3. Producing reports to compare

  • Manifest scoping: one manifest per session set, or one manifest with all sessions and two runs
  • --include-sessions is required for per-session verdict transitions (improvements/regressions). Without it, only aggregate metric deltas are shown
  • Both reports should use the same judge config (provider + model) for analytical metrics; RAKI warns when they differ

4. When to use --diff vs raki trends

  • raki report --diff: point-in-time comparison between two specific evaluation runs — "did this change improve things?"
  • raki trends: trajectory over many runs — "are we getting better over time?"
  • Decision table: use --diff for before/after a specific change, trends for ongoing monitoring

5. Tips and caveats

  • Reports without --include-sessions still show aggregate deltas but no per-session transitions
  • New/dropped sessions (session IDs present in one report but not the other) are tracked and displayed; aggregate deltas use matched sessions only
  • Judge config mismatches are warned but do not block comparison — analytical metric deltas may not be meaningful if the judge changed
  • Exit codes: 0 (no regression), 3 (regression detected), 4 (regression + threshold violation)

Where it lives

  • New file: docs/comparing-runs.md
  • Add entry in docs/index.md — group guides and metric references separately:
## Guides

- [Getting Started](getting-started.md)
- [Interpreting Results](interpreting-results.md)
- [Results Interpretation Reference](interpretation-reference.md)
- [Comparing Runs](comparing-runs.md)
- [CI Integration](ci-integration.md)
- [Ground Truth Curation Guide](curation-guide.md)
- [Adapter Guide](adapter-guide.md)
- [Session Schema Reference](session-schema.md)

## Metric References

- [Operational Metrics](metrics/operational.md)
- [Knowledge Metrics](metrics/knowledge.md)
- [Analytical Metrics](metrics/analytical.md)
- [Rationale and Interpretation](metrics/rationale-and-interpretation.md)
  • Cross-reference from docs/getting-started.md (expand the 3-line mention)
  • Cross-reference from docs/ci-integration.md (expand the regression detection section)

Out of scope

Style requirements (from doc writer review)

  • No contractions (cannot, not can't; do not, not don't)
  • Active voice throughout (RAKI warns when configs differ, not a warning is shown)
  • that included in subordinate clauses (Verify that both reports use the same judge config)
  • Demonstrative pronouns followed by nouns (this command, not this)
  • Code blocks introduced with a complete lead-in sentence ending in a colon
  • No # comment explanations inside code blocks — move them to prose above
  • Descriptive link text (see Comparing Runs, not click here)

Plan

Tasks

  1. Create docs/comparing-runs.md

    • Write all five sections in the order specified above (quick start first, concept last)
    • Include worked CLI examples with realistic synthetic output
    • Include a "when to use what" decision table (--diff vs trends vs future cohort)
  2. Update docs/index.md

    • Restructure into "Guides" and "Metric References" groups
    • Add "Comparing Runs" entry
  3. Update docs/getting-started.md

    • Expand the "Compare two runs" mention to a short paragraph with a link to comparing-runs.md
  4. Update docs/ci-integration.md

    • Add a cross-reference link from the "Regression detection" section to comparing-runs.md for the full workflow
  5. Create towncrier fragment

    • changes/258.doc — "Document the raki report --diff workflow for comparing evaluation runs."
  6. Tests — N/A (docs-only, no code changes)

Dependencies

  • None — this documents existing functionality

Estimated scope

  • Small (~200 lines of markdown, 4 files touched)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions