Skip to content

feat: raki cohort command — date-based split and diff within a single report #260

@decko

Description

@decko

Parent

Child of #257 (cohort comparison). Depends on #259 (reaggregation helper).

Problem

After shipping pipeline improvements, there is no way to answer "did things get better?" from a single raki report. Sessions from before and after a change are aggregated together, diluting improvements (see #257 for the SODA v0.5.0 case: 67% first-pass improvement masked by 70 older sessions).

Solution

New raki cohort subcommand that splits a saved JSON report by date and produces a diff using existing infrastructure.

How it works

  1. Load saved JSON report via load_json_report()
  2. Split report.sample_results by sample.session.started_at into "before" and "after" lists
  3. Call reaggregate_scores() (feat: persist per-sample metric scores in JSON report #259) on each list to get per-cohort aggregate scores
  4. Feed both aggregate dicts into existing compute_deltas() from report/diff.py
  5. Build a DiffReport with cohort labels instead of run IDs
  6. Render via existing print_diff_summary() (CLI) and write_diff_html_report() (HTML)

CLI design

raki cohort REPORT_JSON --since DATE [--until DATE] [--html PATH] [--fail-on-regression] [--json] [-q]

Cohort labeling

  • --since 2026-05-12 → "Before 2026-05-12" vs "Since 2026-05-12"
  • --since 2026-05-01 --until 2026-05-15 → "Before 2026-05-01" vs "2026-05-01 to 2026-05-15"

Acceptance criteria

  • raki cohort subcommand registered in cli.py under main
  • --since (required) splits sessions by started_at date
  • --until (optional) caps the "after" cohort
  • CLI output reuses print_diff_summary() with cohort labels instead of run IDs
  • --html produces diff HTML report using write_diff_html_report()
  • --fail-on-regression exits non-zero on regression (reuses gates/regression.py)
  • --json outputs machine-readable diff data
  • -q quiet mode for CI
  • Small-n warning when either cohort has fewer than 10 sessions
  • Error when either cohort is empty ("No sessions found in the 'after' cohort")
  • Error when report has no sample_results (stripped or empty)
  • Tests with synthetic report data covering: normal split, empty cohort, single-session cohort, small-n warning
  • Towncrier fragment
  • Update docs/comparing-runs.md with cohort command section

Implementation Plan

Task 1: Create cohort splitting helper

File: src/raki/report/cohort.py (new)

Write failing tests first in tests/test_cohort.py:

  • test_split_by_date_divides_sessions — 5 sessions, split at midpoint, verify 2 groups
  • test_split_by_date_empty_before — all sessions after the date → error
  • test_split_by_date_empty_after — all sessions before the date → error
  • test_split_by_date_with_until — sessions outside the until range go to "before"
  • test_split_with_no_sample_results — empty list → error

Implement:

@dataclass
class CohortSplit:
    before: list[SampleResult]
    after: list[SampleResult]
    before_label: str
    after_label: str

def split_by_date(
    sample_results: list[SampleResult],
    since: datetime,
    until: datetime | None = None,
) -> CohortSplit:

Logic:

  1. For each SampleResult, check sample.session.started_at
  2. Sessions with started_at >= since (and <= until if set) go to after
  3. Everything else goes to before
  4. Raise click.UsageError if either cohort is empty
  5. Generate labels: "Before {date}" / "Since {date}" (or "... to {until}")

Task 2: Create cohort diff builder

File: src/raki/report/cohort.py

Write failing tests:

  • test_build_cohort_diff_produces_diff_report — verify output is a DiffReport
  • test_build_cohort_diff_uses_reaggregate — verify scores come from reaggregate_scores()
  • test_build_cohort_diff_labels — verify baseline_run_id and compare_run_id are the cohort labels

Implement:

def build_cohort_diff(split: CohortSplit) -> DiffReport:

Logic:

  1. Call reaggregate_scores(split.before) and reaggregate_scores(split.after)
  2. Call compute_deltas(before_scores, after_scores)
  3. Build DiffReport with:
    • baseline_run_id = split.before_label
    • compare_run_id = split.after_label
    • match_result with baseline_total=len(before), compare_total=len(after), no matching (different cohorts)
    • has_session_data = False (no per-session transitions for cohort comparison)

Task 3: Add raki cohort CLI command

File: src/raki/cli.py

Add a new Click command under main:

@main.command()
@click.argument("input_path")
@click.option("--since", required=True, type=click.DateTime(formats=["%Y-%m-%d"]))
@click.option("--until", type=click.DateTime(formats=["%Y-%m-%d"]), default=None)
@click.option("--html", "html_path", default=None)
@click.option("--fail-on-regression", is_flag=True)
@click.option("--json", "json_output", is_flag=True)
@click.option("-q", "--quiet", is_flag=True)
def cohort(input_path, since, until, html_path, fail_on_regression, json_output, quiet):

Logic:

  1. Load report via load_json_report()
  2. Validate sample_results is not empty
  3. Call split_by_date(report.sample_results, since, until)
  4. Print small-n warning if either cohort has < 10 sessions
  5. Call build_cohort_diff(split)
  6. If not quiet: print_diff_summary(diff_report)
  7. If html_path: write_diff_html_report(diff_report, html_path)
  8. If json_output: serialize diff to stdout
  9. If fail_on_regression: reuse _handle_diff regression logic

Task 4: Update comparing-runs docs

File: docs/comparing-runs.md

Add a "Cohort comparison" section after the existing diff workflow:

## Cohort comparison within a single report

Split sessions by date to compare before/after a pipeline change:

    raki cohort results/report.json --since 2026-05-12

This splits sessions into "Before 2026-05-12" and "Since 2026-05-12"
cohorts and produces the same diff output as `raki report --diff`.

Task 5: Towncrier fragment

changes/260.feature:

Add ``raki cohort`` command for date-based before/after comparison within a
single report. Splits sessions by ``--since`` date, reaggregates metrics per
cohort, and produces the same diff output as ``raki report --diff``.

Verification

uv run pytest tests/test_cohort.py -v
uv run pytest tests/ -v -m "not slow"
uv run ruff check src/ tests/
uv run ty check src/raki/

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions