Conversation
…prediction_accuracy - Add .metric-context CSS class for muted sub-text in score cards - Show static threshold hint (>=80% = calibrated) for triage_calibration - Show micro precision/recall breakdown for file_prediction_accuracy when metric_details are available, with null guards - 2 new tests covering both metric-context blocks
…drill-down - Add title= attribute to .score-chip <span> sourcing the metric description from metric_metadata - Tooltips are visible on hover in all browsers natively - 1 new test verifying title= attribute and description text
…ly note
- Merge the two fallback {% elif %} / {% else %} branches into a single
{% else %} that renders only a <p class="footnote"> guidance note
- No category-section div is rendered when retrieval metrics are absent,
regardless of whether has_retrieval is True or False
- Update test_no_retrieval_metrics_shows_empty_state: assert footnote class
and 'Retrieval Quality' absent (was: assert empty-state and old text)
- Update test_footnote_when_no_retrieval: assert footnote class (was: assert
one of two legacy text strings)
- 2 new tests verifying no category-section div in both fallback cases
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds metric context to score cards in the HTML report, making it easier for users to interpret scores without external reference.
Changes
file_prediction_accuracycards show live micro precision and recall values read fromreport.metric_detailswith null guards.<span class="score-chip">elements now carrytitle=attributes populated from the metric metadata registry, so hovering reveals the metric description.category-sectiondiv is dropped and only a single footnote line is rendered: Retrieval metrics omitted — run with--judgeto enable LLM-backed evaluation.changes/289.featureadded.Acceptance Criteria
triage_calibrationscore card shows threshold hint textfile_prediction_accuracyscore card shows micro precision and recall when availablecategory-sectiondiv rendered)changes/289.featureReview Results
Refs #289
Assisted-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
Assigned-by: decko