Conversation
…ifest - Add cohort: str | None = None to SessionMeta for tagging sessions with a cohort label - Add cohort_tags: list[str] to EvalManifest for manifest-level cohort configuration - Extend make_sample() factory in conftest.py with orchestrator, provider, cohort params
…to report/cohort.py - group_by_field(samples, field_name) groups EvalSamples by any SessionMeta field - inject_cohort_tags(groups, cohort_tags) relabels groups using manifest-defined tags - build_multi_cohort_summaries(groups) runs metrics on each group and returns CohortSummary list - CohortSummary dataclass holds label, samples, and aggregate scores per cohort
- print_multi_cohort_summary(summaries, group_by, console) renders a Rich Table with one column per cohort and one row per metric - Color coding follows same rules as format_metric_line() - Shows session count per cohort in column header (n=N)
…_report() - Create templates/multi-cohort.html.j2 with dark-themed side-by-side comparison table - Add write_multi_cohort_html_report(summaries, output, group_by) to html_report.py - Template shows one column per cohort (with n=N count) and one row per metric - Color coding uses same html_color_for_score() logic as existing reports
…ty and 2-vs-multi branching - Add --group-by FIELD option (cohort, orchestrator, provider, model_id) to raki cohort - --since and --group-by are mutually exclusive; exactly one is required - 2-group path: reuses existing DiffReport + print_diff_summary() infrastructure - N>2 group path: uses new build_multi_cohort_summaries() + print_multi_cohort_summary() - HTML output: write_diff_html_report() for 2 groups, write_multi_cohort_html_report() for N>2 - --group-by cohort with all-None values: exits with helpful error suggesting manifest - Extract _check_regression_from_diff() helper to avoid duplication
- Add TestGroupByField: tests for group_by_field() with orchestrator, model_id, cohort fields - Add TestInjectCohortTags: tests for inject_cohort_tags() relabeling logic - Add TestMultiCohortCli: integration tests covering: - Mutual exclusivity (--since and --group-by) - Neither flag error (exit 2) - 2-cohort --group-by (DiffReport output and JSON) - 3-cohort --group-by (multi-cohort table and JSON) - 4-cohort --group-by (multi-cohort table and JSON) - HTML output for 2 and 3 cohort cases - Single-group error (exit 1) - cohort all-None error (exit 1) - model_id grouping - Fix _write_session_dir to use orchestrator as branch prefix (adapter infers from branch) - Fix --json output: suppress informational 'Groups by' line when --json is active
…crier fragment - Update comparing-runs.md Cohort comparison section: - Document --group-by FIELD option with examples - Add section on multi-cohort field grouping (3+ groups → side-by-side table) - Document cohort_tags manifest field and --group-by cohort usage - Expand Options table to include --group-by - Update Error conditions table with --group-by-specific errors - Update Differences table to show N-way comparison capability - Create changes/261.feature towncrier fragment
… remove unused import, improve types, validate --until with --group-by, remove duplicate docs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds multi-cohort comparison capability to
raki cohortvia a new--group-by FIELDoption. Sessions can be grouped by anySessionMetafield (orchestrator,model_id,provider,cohort) to produce side-by-side metric comparisons. Two groups use the existing diff summary; three or more groups render a Rich Table and a multi-cohort HTML report.SessionMetagains acohort: str | Nonefield for explicit cohort tagging, andEvalManifestgainscohort_tags: list[str]for manifest-level cohort label configuration.--group-byand--sinceare mutually exclusive.Acceptance Criteria
raki cohort sessions/ --group-by orchestratorgroups sessions by orchestrator and renders a side-by-side table--group-byuse the existing diff summary (identical to--sinceoutput)--group-byand--sinceare mutually exclusive — using both raises aUsageError--untilcombined with--group-byraises aUsageError(not silently ignored)--group-by cohortwith all-None cohort fields exits with a helpful error message--fail-on-regressionwith N>2 groups prints a notice (not silently ignored)SessionMeta.cohortandEvalManifest.cohort_tagsfields are present and documentedReview Results
Findings addressed in this PR
cli.py:2043--fail-on-regressionwarning never printed for N>2 groups due to wrong condition (group_count == 2always False in that branch)if fail_on_regression:cli.py:2018import dataclassesin the N>2 JSON output blockdocs/comparing-runs.md:379cli.py:1911/html_report.py:923listtype annotations lose type informationlist[EvalSample],dict[str, list[EvalSample]], andlist[CohortSummary]withTYPE_CHECKINGguardscli.py:1768--untilsilently ignored when combined with--group-by(data correctness concern)click.UsageErrorwhen--untilis passed with--group-byRefs #261
Assisted-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
Assigned-by: decko