Skip to content

[evals] Fix tool calls score rendering#1613

Merged
dgageot merged 1 commit intodocker:mainfrom
dgageot:improve-evals-4
Feb 5, 2026
Merged

[evals] Fix tool calls score rendering#1613
dgageot merged 1 commit intodocker:mainfrom
dgageot:improve-evals-4

Conversation

@dgageot
Copy link
Member

@dgageot dgageot commented Feb 5, 2026

No description provided.

Signed-off-by: David Gageot <david.gageot@docker.com>
@dgageot dgageot requested a review from a team as a code owner February 5, 2026 14:12
@dgageot dgageot changed the title Fix tool calls score rendering [evals] Fix tool calls score rendering Feb 5, 2026
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Found 1 issue with precision loss in the Relevance metric display. The casting from float64 to int truncates fractional values, which is inconsistent with how these values are stored and accumulated.

printMetric(out, "Sizes", summary.SizesPassed, summary.SizesTotal)
printMetricFloat(out, "Tool Calls", summary.ToolsPassed, summary.ToolsTotal)
printF1Score(out, "Tool Calls", summary.ToolsF1Sum, summary.ToolsCount)
printMetric(out, "Handoffs", summary.HandoffsPassed, summary.HandoffsTotal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loss of precision when casting Relevance scores from float64 to int

The RelevancePassed and RelevanceTotal fields in the Summary struct are defined as float64 (types.go lines 99-100) and can contain fractional values from accumulated relevance scores. Casting them to int here truncates decimal portions, losing precision.

This is inconsistent with:

  • How the Tool Calls metric preserves float precision (line 120 uses printF1Score)
  • How the underlying values are stored (float64 in the Summary struct)
  • How relevance scores are accumulated (can be fractional)

For example, a RelevancePassed value of 8.5 would display as 8, making the displayed metric inconsistent with the actual accumulated scores.

Suggestion: If you want to maintain float precision for Relevance like Tool Calls, consider creating a similar display function or using the previous printMetricFloat approach. If integer display is intentional, consider documenting why relevance is treated differently than other float-based metrics.

@dgageot dgageot merged commit 69fa161 into docker:main Feb 5, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants