Skip to content

feat(cohort): manifest cohort tags and --group-by for multi-cohort comparison#314

Merged
decko merged 8 commits into
mainfrom
soda/261
May 21, 2026
Merged

feat(cohort): manifest cohort tags and --group-by for multi-cohort comparison#314
decko merged 8 commits into
mainfrom
soda/261

Conversation

@decko
Copy link
Copy Markdown
Owner

@decko decko commented May 20, 2026

Summary

Adds multi-cohort comparison capability to raki cohort via a new --group-by FIELD option. Sessions can be grouped by any SessionMeta field (orchestrator, model_id, provider, cohort) to produce side-by-side metric comparisons. Two groups use the existing diff summary; three or more groups render a Rich Table and a multi-cohort HTML report.

SessionMeta gains a cohort: str | None field for explicit cohort tagging, and EvalManifest gains cohort_tags: list[str] for manifest-level cohort label configuration. --group-by and --since are mutually exclusive.

Acceptance Criteria

  • raki cohort sessions/ --group-by orchestrator groups sessions by orchestrator and renders a side-by-side table
  • Two groups produced by --group-by use the existing diff summary (identical to --since output)
  • Three or more groups render a Rich Table and (optionally) a multi-cohort HTML report
  • --group-by and --since are mutually exclusive — using both raises a UsageError
  • --until combined with --group-by raises a UsageError (not silently ignored)
  • --group-by cohort with all-None cohort fields exits with a helpful error message
  • --fail-on-regression with N>2 groups prints a notice (not silently ignored)
  • SessionMeta.cohort and EvalManifest.cohort_tags fields are present and documented
  • Integration tests cover 2, 3, and 4-cohort CLI scenarios
  • HTML report renders correctly for multi-cohort output

Review Results

Findings addressed in this PR

# Severity File Issue Resolution
1 IMPORTANT cli.py:2043 --fail-on-regression warning never printed for N>2 groups due to wrong condition (group_count == 2 always False in that branch) Changed condition to if fail_on_regression:
2 MINOR cli.py:2018 Unused import dataclasses in the N>2 JSON output block Removed the unused import
3 MINOR docs/comparing-runs.md:379 Duplicate '### Filtering the compare cohort with --until' section Removed the duplicate section at lines 379–387
4 MINOR cli.py:1911 / html_report.py:923 Bare list type annotations lose type information Updated to list[EvalSample], dict[str, list[EvalSample]], and list[CohortSummary] with TYPE_CHECKING guards
5 IMPORTANT cli.py:1768 --until silently ignored when combined with --group-by (data correctness concern) Added click.UsageError when --until is passed with --group-by

Refs #261

Assisted-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
Assigned-by: decko

decko added 8 commits May 20, 2026 19:28
…ifest

- Add cohort: str | None = None to SessionMeta for tagging sessions with a cohort label
- Add cohort_tags: list[str] to EvalManifest for manifest-level cohort configuration
- Extend make_sample() factory in conftest.py with orchestrator, provider, cohort params
…to report/cohort.py

- group_by_field(samples, field_name) groups EvalSamples by any SessionMeta field
- inject_cohort_tags(groups, cohort_tags) relabels groups using manifest-defined tags
- build_multi_cohort_summaries(groups) runs metrics on each group and returns CohortSummary list
- CohortSummary dataclass holds label, samples, and aggregate scores per cohort
- print_multi_cohort_summary(summaries, group_by, console) renders a Rich Table
  with one column per cohort and one row per metric
- Color coding follows same rules as format_metric_line()
- Shows session count per cohort in column header (n=N)
…_report()

- Create templates/multi-cohort.html.j2 with dark-themed side-by-side comparison table
- Add write_multi_cohort_html_report(summaries, output, group_by) to html_report.py
- Template shows one column per cohort (with n=N count) and one row per metric
- Color coding uses same html_color_for_score() logic as existing reports
…ty and 2-vs-multi branching

- Add --group-by FIELD option (cohort, orchestrator, provider, model_id) to raki cohort
- --since and --group-by are mutually exclusive; exactly one is required
- 2-group path: reuses existing DiffReport + print_diff_summary() infrastructure
- N>2 group path: uses new build_multi_cohort_summaries() + print_multi_cohort_summary()
- HTML output: write_diff_html_report() for 2 groups, write_multi_cohort_html_report() for N>2
- --group-by cohort with all-None values: exits with helpful error suggesting manifest
- Extract _check_regression_from_diff() helper to avoid duplication
- Add TestGroupByField: tests for group_by_field() with orchestrator, model_id, cohort fields
- Add TestInjectCohortTags: tests for inject_cohort_tags() relabeling logic
- Add TestMultiCohortCli: integration tests covering:
  - Mutual exclusivity (--since and --group-by)
  - Neither flag error (exit 2)
  - 2-cohort --group-by (DiffReport output and JSON)
  - 3-cohort --group-by (multi-cohort table and JSON)
  - 4-cohort --group-by (multi-cohort table and JSON)
  - HTML output for 2 and 3 cohort cases
  - Single-group error (exit 1)
  - cohort all-None error (exit 1)
  - model_id grouping
- Fix _write_session_dir to use orchestrator as branch prefix (adapter infers from branch)
- Fix --json output: suppress informational 'Groups by' line when --json is active
…crier fragment

- Update comparing-runs.md Cohort comparison section:
  - Document --group-by FIELD option with examples
  - Add section on multi-cohort field grouping (3+ groups → side-by-side table)
  - Document cohort_tags manifest field and --group-by cohort usage
  - Expand Options table to include --group-by
  - Update Error conditions table with --group-by-specific errors
  - Update Differences table to show N-way comparison capability
- Create changes/261.feature towncrier fragment
… remove unused import, improve types, validate --until with --group-by, remove duplicate docs
@decko decko added the ai-assisted Implemented with AI assistance label May 20, 2026
@decko decko merged commit 12b2d3d into main May 21, 2026
4 checks passed
@decko decko deleted the soda/261 branch May 21, 2026 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-assisted Implemented with AI assistance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant