You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Child of #257 (cohort comparison). Depends on #259 (reaggregation helper).
Problem
After shipping pipeline improvements, there is no way to answer "did things get better?" from a single raki report. Sessions from before and after a change are aggregated together, diluting improvements (see #257 for the SODA v0.5.0 case: 67% first-pass improvement masked by 70 older sessions).
Solution
New raki cohort subcommand that splits a saved JSON report by date and produces a diff using existing infrastructure.
How it works
Load saved JSON report via load_json_report()
Split report.sample_results by sample.session.started_at into "before" and "after" lists
Print small-n warning if either cohort has < 10 sessions
Call build_cohort_diff(split)
If not quiet: print_diff_summary(diff_report)
If html_path: write_diff_html_report(diff_report, html_path)
If json_output: serialize diff to stdout
If fail_on_regression: reuse _handle_diff regression logic
Task 4: Update comparing-runs docs
File: docs/comparing-runs.md
Add a "Cohort comparison" section after the existing diff workflow:
## Cohort comparison within a single report
Split sessions by date to compare before/after a pipeline change:
raki cohort results/report.json --since 2026-05-12
This splits sessions into "Before 2026-05-12" and "Since 2026-05-12"
cohorts and produces the same diff output as `raki report --diff`.
Task 5: Towncrier fragment
changes/260.feature:
Add ``raki cohort`` command for date-based before/after comparison within a
single report. Splits sessions by ``--since`` date, reaggregates metrics per
cohort, and produces the same diff output as ``raki report --diff``.
Verification
uv run pytest tests/test_cohort.py -v
uv run pytest tests/ -v -m "not slow"
uv run ruff check src/ tests/
uv run ty check src/raki/
Parent
Child of #257 (cohort comparison). Depends on #259 (reaggregation helper).
Problem
After shipping pipeline improvements, there is no way to answer "did things get better?" from a single raki report. Sessions from before and after a change are aggregated together, diluting improvements (see #257 for the SODA v0.5.0 case: 67% first-pass improvement masked by 70 older sessions).
Solution
New
raki cohortsubcommand that splits a saved JSON report by date and produces a diff using existing infrastructure.How it works
load_json_report()report.sample_resultsbysample.session.started_atinto "before" and "after" listsreaggregate_scores()(feat: persist per-sample metric scores in JSON report #259) on each list to get per-cohort aggregate scorescompute_deltas()fromreport/diff.pyDiffReportwith cohort labels instead of run IDsprint_diff_summary()(CLI) andwrite_diff_html_report()(HTML)CLI design
Cohort labeling
--since 2026-05-12→ "Before 2026-05-12" vs "Since 2026-05-12"--since 2026-05-01 --until 2026-05-15→ "Before 2026-05-01" vs "2026-05-01 to 2026-05-15"Acceptance criteria
raki cohortsubcommand registered incli.pyundermain--since(required) splits sessions bystarted_atdate--until(optional) caps the "after" cohortprint_diff_summary()with cohort labels instead of run IDs--htmlproduces diff HTML report usingwrite_diff_html_report()--fail-on-regressionexits non-zero on regression (reusesgates/regression.py)--jsonoutputs machine-readable diff data-qquiet mode for CIsample_results(stripped or empty)docs/comparing-runs.mdwith cohort command sectionImplementation Plan
Task 1: Create cohort splitting helper
File:
src/raki/report/cohort.py(new)Write failing tests first in
tests/test_cohort.py:test_split_by_date_divides_sessions— 5 sessions, split at midpoint, verify 2 groupstest_split_by_date_empty_before— all sessions after the date → errortest_split_by_date_empty_after— all sessions before the date → errortest_split_by_date_with_until— sessions outside the until range go to "before"test_split_with_no_sample_results— empty list → errorImplement:
Logic:
SampleResult, checksample.session.started_atstarted_at >= since(and<= untilif set) go toafterbeforeclick.UsageErrorif either cohort is emptyTask 2: Create cohort diff builder
File:
src/raki/report/cohort.pyWrite failing tests:
test_build_cohort_diff_produces_diff_report— verify output is aDiffReporttest_build_cohort_diff_uses_reaggregate— verify scores come fromreaggregate_scores()test_build_cohort_diff_labels— verifybaseline_run_idandcompare_run_idare the cohort labelsImplement:
Logic:
reaggregate_scores(split.before)andreaggregate_scores(split.after)compute_deltas(before_scores, after_scores)DiffReportwith:baseline_run_id = split.before_labelcompare_run_id = split.after_labelmatch_resultwithbaseline_total=len(before),compare_total=len(after), no matching (different cohorts)has_session_data = False(no per-session transitions for cohort comparison)Task 3: Add
raki cohortCLI commandFile:
src/raki/cli.pyAdd a new Click command under
main:Logic:
load_json_report()sample_resultsis not emptysplit_by_date(report.sample_results, since, until)build_cohort_diff(split)print_diff_summary(diff_report)write_diff_html_report(diff_report, html_path)_handle_diffregression logicTask 4: Update comparing-runs docs
File:
docs/comparing-runs.mdAdd a "Cohort comparison" section after the existing diff workflow:
Task 5: Towncrier fragment
changes/260.feature:Verification
uv run pytest tests/test_cohort.py -v uv run pytest tests/ -v -m "not slow" uv run ruff check src/ tests/ uv run ty check src/raki/