Parent
Child of #257 (cohort comparison).
Problem
raki report --diff baseline.json compare.json already produces a full comparison report (CLI + HTML) with metric deltas, direction indicators, session matching, and regression detection. But this workflow is undocumented — users don't know they can:
- Run
raki run with a manifest scoped to pre-change sessions → get report-before.json
- Run
raki run with a manifest scoped to post-change sessions → get report-after.json
- Run
raki report --diff report-before.json report-after.json --html diff.html
Deliverable
Add a "Comparing Runs" section to the docs covering:
- The
--diff subcommand with a worked example
- How to scope manifests for before/after session sets
- What the diff report shows (metric deltas, ▲/▼ direction, matched/new/dropped sessions, judge config mismatch warnings)
--fail-on-regression for CI gating
- When to use
--diff vs raki trends (point-in-time comparison vs trajectory)
Scope
Docs-only. No code changes.
Acceptance criteria
Spec
Context
RAKI already has a complete diff comparison pipeline (raki report --diff baseline.json compare.json) that produces:
- CLI output: metric deltas with ▲/▼ direction indicators, matched/new/dropped session counts, judge config mismatch warnings, per-session verdict transitions (improvement/regression)
- HTML output: self-contained dark-theme diff report via
write_diff_html_report
- CI gating:
--fail-on-regression exits non-zero when metrics regress beyond a 2% noise margin
This feature is mentioned in passing in two places:
getting-started.md — 3-line example under "Understanding the output"
ci-integration.md — brief mention under "Regression detection with --fail-on-regression"
Neither doc explains the full workflow, what the diff report shows, or when to use --diff vs raki trends.
What to document
Create docs/comparing-runs.md with these sections (ordered per doc writer review — procedure first, concept last):
1. Quick start (worked example)
Get users to a result in 60 seconds. Each step introduced with a lead-in sentence, command in its own block (no inline # Step N: comments per Red Hat code-block conventions).
To evaluate the pre-change sessions, run the following command:
uv run raki run -m manifests/before.yaml -o results/ --include-sessions
To evaluate the post-change sessions, run the following command:
uv run raki run -m manifests/after.yaml -o results/ --include-sessions
To compare the two reports, run the following command:
uv run raki report --diff results/before.json results/after.json
To generate an HTML diff report, add --html:
uv run raki report --diff results/before.json results/after.json --html diff-report.html
To fail CI when metrics regress, add --fail-on-regression:
uv run raki report --diff results/baseline.json results/current.json --fail-on-regression
Include realistic synthetic CLI output so field descriptions are anchored to something concrete:
Comparing raki-20260501T100000 → raki-20260512T140000
Matched: 12/14 sessions (2 new)
First-pass success rate 0.36 → 0.67 (+0.31) ▲
Rework cycles 0.72 → 0.33 (-0.39) ▲
Cost / session $12.30 → $8.50 (-$3.80) ▲
Severity score 0.45 → 0.22 (-0.23) ▲
Improvements: 4 sessions (fail→pass: 3, rework→pass: 1)
Regressions: 1 session (pass→rework: 1)
2. Reading the diff output
Document what each section of the CLI diff shows:
- Header: baseline run ID → compare run ID
- Judge config warnings: when provider/model differ between runs, or when only one used a judge
- Agent model warnings: when the session agent models differ
- Coverage line: matched/total sessions, new/dropped counts
- Metric deltas: each metric with baseline → compare value, delta, and direction (▲ improved / ▼ regressed / = flat). Direction respects
higher_is_better per metric
- Session transitions: sessions whose verdict changed (pass/rework/fail), grouped into improvements and regressions
3. Producing reports to compare
- Manifest scoping: one manifest per session set, or one manifest with all sessions and two runs
--include-sessions is required for per-session verdict transitions (improvements/regressions). Without it, only aggregate metric deltas are shown
- Both reports should use the same judge config (provider + model) for analytical metrics; RAKI warns when they differ
4. When to use --diff vs raki trends
raki report --diff: point-in-time comparison between two specific evaluation runs — "did this change improve things?"
raki trends: trajectory over many runs — "are we getting better over time?"
- Decision table: use
--diff for before/after a specific change, trends for ongoing monitoring
5. Tips and caveats
- Reports without
--include-sessions still show aggregate deltas but no per-session transitions
- New/dropped sessions (session IDs present in one report but not the other) are tracked and displayed; aggregate deltas use matched sessions only
- Judge config mismatches are warned but do not block comparison — analytical metric deltas may not be meaningful if the judge changed
- Exit codes: 0 (no regression), 3 (regression detected), 4 (regression + threshold violation)
Where it lives
- New file:
docs/comparing-runs.md
- Add entry in
docs/index.md — group guides and metric references separately:
## Guides
- [Getting Started](getting-started.md)
- [Interpreting Results](interpreting-results.md)
- [Results Interpretation Reference](interpretation-reference.md)
- [Comparing Runs](comparing-runs.md)
- [CI Integration](ci-integration.md)
- [Ground Truth Curation Guide](curation-guide.md)
- [Adapter Guide](adapter-guide.md)
- [Session Schema Reference](session-schema.md)
## Metric References
- [Operational Metrics](metrics/operational.md)
- [Knowledge Metrics](metrics/knowledge.md)
- [Analytical Metrics](metrics/analytical.md)
- [Rationale and Interpretation](metrics/rationale-and-interpretation.md)
- Cross-reference from
docs/getting-started.md (expand the 3-line mention)
- Cross-reference from
docs/ci-integration.md (expand the regression detection section)
Out of scope
Style requirements (from doc writer review)
- No contractions (cannot, not can't; do not, not don't)
- Active voice throughout (RAKI warns when configs differ, not a warning is shown)
- that included in subordinate clauses (Verify that both reports use the same judge config)
- Demonstrative pronouns followed by nouns (this command, not this)
- Code blocks introduced with a complete lead-in sentence ending in a colon
- No
# comment explanations inside code blocks — move them to prose above
- Descriptive link text (see Comparing Runs, not click here)
Plan
Tasks
-
Create docs/comparing-runs.md
- Write all five sections in the order specified above (quick start first, concept last)
- Include worked CLI examples with realistic synthetic output
- Include a "when to use what" decision table (
--diff vs trends vs future cohort)
-
Update docs/index.md
- Restructure into "Guides" and "Metric References" groups
- Add "Comparing Runs" entry
-
Update docs/getting-started.md
- Expand the "Compare two runs" mention to a short paragraph with a link to
comparing-runs.md
-
Update docs/ci-integration.md
- Add a cross-reference link from the "Regression detection" section to
comparing-runs.md for the full workflow
-
Create towncrier fragment
changes/258.doc — "Document the raki report --diff workflow for comparing evaluation runs."
-
Tests — N/A (docs-only, no code changes)
Dependencies
- None — this documents existing functionality
Estimated scope
- Small (~200 lines of markdown, 4 files touched)
Parent
Child of #257 (cohort comparison).
Problem
raki report --diff baseline.json compare.jsonalready produces a full comparison report (CLI + HTML) with metric deltas, direction indicators, session matching, and regression detection. But this workflow is undocumented — users don't know they can:raki runwith a manifest scoped to pre-change sessions → getreport-before.jsonraki runwith a manifest scoped to post-change sessions → getreport-after.jsonraki report --diff report-before.json report-after.json --html diff.htmlDeliverable
Add a "Comparing Runs" section to the docs covering:
--diffsubcommand with a worked example--fail-on-regressionfor CI gating--diffvsraki trends(point-in-time comparison vs trajectory)Scope
Docs-only. No code changes.
Acceptance criteria
raki trendsdocs+diff-docs.docor similar)Spec
Context
RAKI already has a complete diff comparison pipeline (
raki report --diff baseline.json compare.json) that produces:write_diff_html_report--fail-on-regressionexits non-zero when metrics regress beyond a 2% noise marginThis feature is mentioned in passing in two places:
getting-started.md— 3-line example under "Understanding the output"ci-integration.md— brief mention under "Regression detection with --fail-on-regression"Neither doc explains the full workflow, what the diff report shows, or when to use
--diffvsraki trends.What to document
Create
docs/comparing-runs.mdwith these sections (ordered per doc writer review — procedure first, concept last):1. Quick start (worked example)
Get users to a result in 60 seconds. Each step introduced with a lead-in sentence, command in its own block (no inline
# Step N:comments per Red Hat code-block conventions).To evaluate the pre-change sessions, run the following command:
To evaluate the post-change sessions, run the following command:
To compare the two reports, run the following command:
To generate an HTML diff report, add
--html:To fail CI when metrics regress, add
--fail-on-regression:Include realistic synthetic CLI output so field descriptions are anchored to something concrete:
2. Reading the diff output
Document what each section of the CLI diff shows:
higher_is_betterper metric3. Producing reports to compare
--include-sessionsis required for per-session verdict transitions (improvements/regressions). Without it, only aggregate metric deltas are shown4. When to use
--diffvsraki trendsraki report --diff: point-in-time comparison between two specific evaluation runs — "did this change improve things?"raki trends: trajectory over many runs — "are we getting better over time?"--difffor before/after a specific change,trendsfor ongoing monitoring5. Tips and caveats
--include-sessionsstill show aggregate deltas but no per-session transitionsWhere it lives
docs/comparing-runs.mddocs/index.md— group guides and metric references separately:docs/getting-started.md(expand the 3-line mention)docs/ci-integration.md(expand the regression detection section)Out of scope
raki trendsdeep-dive — that is its own doc if neededStyle requirements (from doc writer review)
# commentexplanations inside code blocks — move them to prose abovePlan
Tasks
Create
docs/comparing-runs.md--diffvstrendsvs futurecohort)Update
docs/index.mdUpdate
docs/getting-started.mdcomparing-runs.mdUpdate
docs/ci-integration.mdcomparing-runs.mdfor the full workflowCreate towncrier fragment
changes/258.doc— "Document theraki report --diffworkflow for comparing evaluation runs."Tests — N/A (docs-only, no code changes)
Dependencies
Estimated scope