[evals] Fix tool calls score rendering#1613
Conversation
Signed-off-by: David Gageot <david.gageot@docker.com>
| printMetric(out, "Sizes", summary.SizesPassed, summary.SizesTotal) | ||
| printMetricFloat(out, "Tool Calls", summary.ToolsPassed, summary.ToolsTotal) | ||
| printF1Score(out, "Tool Calls", summary.ToolsF1Sum, summary.ToolsCount) | ||
| printMetric(out, "Handoffs", summary.HandoffsPassed, summary.HandoffsTotal) |
There was a problem hiding this comment.
Loss of precision when casting Relevance scores from float64 to int
The RelevancePassed and RelevanceTotal fields in the Summary struct are defined as float64 (types.go lines 99-100) and can contain fractional values from accumulated relevance scores. Casting them to int here truncates decimal portions, losing precision.
This is inconsistent with:
- How the Tool Calls metric preserves float precision (line 120 uses
printF1Score) - How the underlying values are stored (float64 in the Summary struct)
- How relevance scores are accumulated (can be fractional)
For example, a RelevancePassed value of 8.5 would display as 8, making the displayed metric inconsistent with the actual accumulated scores.
Suggestion: If you want to maintain float precision for Relevance like Tool Calls, consider creating a similar display function or using the previous printMetricFloat approach. If integer display is intentional, consider documenting why relevance is treated differently than other float-based metrics.
No description provided.