docs: document raki report --diff workflow for before/after comparison

## Parent

Child of #257 (cohort comparison).

## Problem

`raki report --diff baseline.json compare.json` already produces a full comparison report (CLI + HTML) with metric deltas, direction indicators, session matching, and regression detection. But this workflow is undocumented — users don't know they can:

1. Run `raki run` with a manifest scoped to pre-change sessions → get `report-before.json`
2. Run `raki run` with a manifest scoped to post-change sessions → get `report-after.json`
3. Run `raki report --diff report-before.json report-after.json --html diff.html`

## Deliverable

Add a **"Comparing Runs"** section to the docs covering:

- The `--diff` subcommand with a worked example
- How to scope manifests for before/after session sets
- What the diff report shows (metric deltas, ▲/▼ direction, matched/new/dropped sessions, judge config mismatch warnings)
- `--fail-on-regression` for CI gating
- When to use `--diff` vs `raki trends` (point-in-time comparison vs trajectory)

## Scope

Docs-only. No code changes.

## Acceptance criteria

- [ ] New doc section with worked example
- [ ] Cross-reference from `raki trends` docs
- [ ] Towncrier fragment (`+diff-docs.doc` or similar)

---

## Spec

### Context

RAKI already has a complete diff comparison pipeline (`raki report --diff baseline.json compare.json`) that produces:
- **CLI output**: metric deltas with ▲/▼ direction indicators, matched/new/dropped session counts, judge config mismatch warnings, per-session verdict transitions (improvement/regression)
- **HTML output**: self-contained dark-theme diff report via `write_diff_html_report`
- **CI gating**: `--fail-on-regression` exits non-zero when metrics regress beyond a 2% noise margin

This feature is mentioned in passing in two places:
- `getting-started.md` — 3-line example under "Understanding the output"
- `ci-integration.md` — brief mention under "Regression detection with --fail-on-regression"

Neither doc explains the full workflow, what the diff report shows, or when to use `--diff` vs `raki trends`.

### What to document

Create `docs/comparing-runs.md` with these sections (ordered per doc writer review — procedure first, concept last):

#### 1. Quick start (worked example)

Get users to a result in 60 seconds. Each step introduced with a lead-in sentence, command in its own block (no inline `# Step N:` comments per Red Hat code-block conventions).

To evaluate the pre-change sessions, run the following command:

```bash
uv run raki run -m manifests/before.yaml -o results/ --include-sessions
```

To evaluate the post-change sessions, run the following command:

```bash
uv run raki run -m manifests/after.yaml -o results/ --include-sessions
```

To compare the two reports, run the following command:

```bash
uv run raki report --diff results/before.json results/after.json
```

To generate an HTML diff report, add `--html`:

```bash
uv run raki report --diff results/before.json results/after.json --html diff-report.html
```

To fail CI when metrics regress, add `--fail-on-regression`:

```bash
uv run raki report --diff results/baseline.json results/current.json --fail-on-regression
```

Include realistic synthetic CLI output so field descriptions are anchored to something concrete:

```
Comparing raki-20260501T100000 → raki-20260512T140000
Matched: 12/14 sessions (2 new)

  First-pass success rate  0.36 → 0.67  (+0.31)  ▲
  Rework cycles            0.72 → 0.33  (-0.39)  ▲
  Cost / session           $12.30 → $8.50  (-$3.80)  ▲
  Severity score           0.45 → 0.22  (-0.23)  ▲

Improvements: 4 sessions (fail→pass: 3, rework→pass: 1)
Regressions:  1 session (pass→rework: 1)
```

#### 2. Reading the diff output

Document what each section of the CLI diff shows:
- **Header**: baseline run ID → compare run ID
- **Judge config warnings**: when provider/model differ between runs, or when only one used a judge
- **Agent model warnings**: when the session agent models differ
- **Coverage line**: matched/total sessions, new/dropped counts
- **Metric deltas**: each metric with baseline → compare value, delta, and direction (▲ improved / ▼ regressed / = flat). Direction respects `higher_is_better` per metric
- **Session transitions**: sessions whose verdict changed (pass/rework/fail), grouped into improvements and regressions

#### 3. Producing reports to compare
- Manifest scoping: one manifest per session set, or one manifest with all sessions and two runs
- `--include-sessions` is required for per-session verdict transitions (improvements/regressions). Without it, only aggregate metric deltas are shown
- Both reports should use the same judge config (provider + model) for analytical metrics; RAKI warns when they differ

#### 4. When to use `--diff` vs `raki trends`
- **`raki report --diff`**: point-in-time comparison between two specific evaluation runs — "did this change improve things?"
- **`raki trends`**: trajectory over many runs — "are we getting better over time?"
- Decision table: use `--diff` for before/after a specific change, `trends` for ongoing monitoring

#### 5. Tips and caveats
- Reports without `--include-sessions` still show aggregate deltas but no per-session transitions
- New/dropped sessions (session IDs present in one report but not the other) are tracked and displayed; aggregate deltas use matched sessions only
- Judge config mismatches are warned but do not block comparison — analytical metric deltas may not be meaningful if the judge changed
- Exit codes: 0 (no regression), 3 (regression detected), 4 (regression + threshold violation)

### Where it lives

- New file: `docs/comparing-runs.md`
- Add entry in `docs/index.md` — group guides and metric references separately:

```markdown
## Guides

- [Getting Started](getting-started.md)
- [Interpreting Results](interpreting-results.md)
- [Results Interpretation Reference](interpretation-reference.md)
- [Comparing Runs](comparing-runs.md)
- [CI Integration](ci-integration.md)
- [Ground Truth Curation Guide](curation-guide.md)
- [Adapter Guide](adapter-guide.md)
- [Session Schema Reference](session-schema.md)

## Metric References

- [Operational Metrics](metrics/operational.md)
- [Knowledge Metrics](metrics/knowledge.md)
- [Analytical Metrics](metrics/analytical.md)
- [Rationale and Interpretation](metrics/rationale-and-interpretation.md)
```

- Cross-reference from `docs/getting-started.md` (expand the 3-line mention)
- Cross-reference from `docs/ci-integration.md` (expand the regression detection section)

### Out of scope

- Cohort comparison within a single run (#260) — documented separately when that feature ships
- `raki trends` deep-dive — that is its own doc if needed
- Changes to diff code or behavior — this is docs-only

### Style requirements (from doc writer review)

- No contractions (*cannot*, not *can't*; *do not*, not *don't*)
- Active voice throughout (*RAKI warns when configs differ*, not *a warning is shown*)
- *that* included in subordinate clauses (*Verify that both reports use the same judge config*)
- Demonstrative pronouns followed by nouns (*this command*, not *this*)
- Code blocks introduced with a complete lead-in sentence ending in a colon
- No `# comment` explanations inside code blocks — move them to prose above
- Descriptive link text (*see [Comparing Runs](comparing-runs.md)*, not *click here*)

---

## Plan

### Tasks

1. **Create `docs/comparing-runs.md`**
   - Write all five sections in the order specified above (quick start first, concept last)
   - Include worked CLI examples with realistic synthetic output
   - Include a "when to use what" decision table (`--diff` vs `trends` vs future `cohort`)

2. **Update `docs/index.md`**
   - Restructure into "Guides" and "Metric References" groups
   - Add "Comparing Runs" entry

3. **Update `docs/getting-started.md`**
   - Expand the "Compare two runs" mention to a short paragraph with a link to `comparing-runs.md`

4. **Update `docs/ci-integration.md`**
   - Add a cross-reference link from the "Regression detection" section to `comparing-runs.md` for the full workflow

5. **Create towncrier fragment**
   - `changes/258.doc` — "Document the ``raki report --diff`` workflow for comparing evaluation runs."

6. **Tests** — N/A (docs-only, no code changes)

### Dependencies
- None — this documents existing functionality

### Estimated scope
- Small (~200 lines of markdown, 4 files touched)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: document raki report --diff workflow for before/after comparison #258

Parent

Problem

Deliverable

Scope

Acceptance criteria

Spec

Context

What to document

1. Quick start (worked example)

2. Reading the diff output

3. Producing reports to compare

4. When to use `--diff` vs `raki trends`

5. Tips and caveats

Where it lives

Out of scope

Style requirements (from doc writer review)

Plan

Tasks

Dependencies

Estimated scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

docs: document raki report --diff workflow for before/after comparison #258

Description

Parent

Problem

Deliverable

Scope

Acceptance criteria

Spec

Context

What to document

1. Quick start (worked example)

2. Reading the diff output

3. Producing reports to compare

4. When to use --diff vs raki trends

5. Tips and caveats

Where it lives

Out of scope

Style requirements (from doc writer review)

Plan

Tasks

Dependencies

Estimated scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

4. When to use `--diff` vs `raki trends`