feat: add StructuredOutputScorer for validating JSON outputs#23
feat: add StructuredOutputScorer for validating JSON outputs#23
Conversation
Implements a new scorer for evaluating structured data outputs (JSON) from LLM responses. Supports strict/fuzzy/custom matching modes, partial credit scoring, and flexible field validation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #23 +/- ##
==========================================
+ Coverage 84.50% 91.35% +6.85%
==========================================
Files 4 6 +2
Lines 400 706 +306
Branches 115 197 +82
==========================================
+ Hits 338 645 +307
+ Misses 62 61 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Added test cases to cover lines 51-52 in structuredOutputScorer.ts which handle function validators in fuzzy matching mode. This addresses the missing coverage reported by Codecov in PR #23. - Added test for successful function validator matching - Added test for failing function validator matching - Increases patch coverage to 100% 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
PR Feedback AddressedI've reviewed and addressed the feedback on this PR: 1. ✅ Codecov Coverage Issue (Fixed)
2. ✅ Copilot Comment Review (No action needed)
SummaryAll CI checks were passing before the changes, and the only actionable feedback was the missing test coverage, which has now been addressed. The PR is ready for review. |
- Revert making 'arguments' field optional in ToolCall type (breaking change) - Replace loose equality with strict equality for null/undefined checking - Make type coercion more explicit in fuzzy matching - Make error field configurable in StructuredOutputScorer These changes improve code quality and prevent potential breaking changes while maintaining backward compatibility.
- Changed expected parameter type from string to any in toEval matcher - Added @deprecated JSDoc comments to toEval (use describeEval instead) - This fixes the type mismatch when using StructuredOutputScorer with toEval The toEval matcher now properly supports scorers that expect non-string expected values, like StructuredOutputScorer which expects objects.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Create shared utils module with common matching logic - Extract BaseMatcherConfig interface for consistent config options - Move strictEquals, fuzzyMatch, and formatting utilities to utils - Add createMatcher factory for flexible matching strategies - Update ToolCallScorer and StructuredOutputScorer to use shared utilities - Export utility functions from scorers index for custom implementations This refactoring reduces code duplication and provides a consistent foundation for future scorer implementations while maintaining backward compatibility. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add FuzzyMatchOptions interface for fine-grained control over fuzzy matching behavior - Support customizable options: case sensitivity, substring matching, numeric tolerance, array ordering, and type coercion - Update StructuredOutputScorer and ToolCallScorer to accept fuzzyOptions configuration - Set appropriate defaults: tool calls use substring matching, structured output does not - Refactor fuzzyMatch function to use options object instead of context string 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Ensure fuzzy match finds unique matches for duplicate expected array items - Add comprehensive tests for fuzzy matching with extra fields and duplicates - Update fuzzyMatch to consume matched items preventing double-matching 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Change the type of `output` in the `Score` metadata from `string` to `any` to accommodate various data types. - Improve the `formatScores` function to handle both string and object outputs, ensuring proper formatting for each case. - Add comments to clarify the output formatting logic. These changes enhance the flexibility of the scoring system and improve the presentation of results.
…output - Added weather and calculator tools for demonstration in AI SDK integration tests. - Updated the real AI task to utilize the new tools, showcasing end-to-end functionality. - Implemented structured output task to return JSON formatted responses. - Enhanced test cases to validate tool usage and structured output, allowing for flexible matching and scoring. These changes improve the testing framework for AI SDK integrations, ensuring comprehensive coverage and better validation of tool interactions.
…s in StructuredOutputScorer - Introduced a new test case to verify that expecting undefined matches a missing JSON field. - Added a contrasting test to ensure that expecting a specific value does not match when the field is absent. - Enhanced the rationale checks in the test results for clarity. These additions improve the robustness of the scoring tests by covering edge cases in field expectations.
- Removed unnecessary conditional handling for match function parameters in StructuredOutputScorer. - Updated createMatcher function to accept an optional context parameter for enhanced flexibility. These changes streamline the matching process and improve the clarity of the scoring logic.
… context parameter - Updated strictEquals and fuzzyMatch functions to accept an optional context parameter for improved debugging and clarity during comparisons. - Modified related logic to propagate the context parameter through nested calls, ensuring consistent behavior across matching operations. These changes enhance the flexibility and traceability of the matching functions, facilitating better understanding of comparison results.
There was a problem hiding this comment.
Pull Request Overview
Adds a new StructuredOutputScorer for evaluating structured JSON outputs from language models with support for both strict and fuzzy matching modes. This scorer enables flexible validation of complex data structures with configurable validation options and comprehensive error handling.
- New
StructuredOutputScorerimplementation with strict/fuzzy/custom matching modes - Shared utilities abstraction for common scorer functionality
- Integration updates to export the new scorer and utilities
Reviewed Changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/scorers/utils.ts | Shared utility functions for strict/fuzzy matching and scorer configuration |
| src/scorers/structuredOutputScorer.ts | Main scorer implementation for validating JSON structured outputs |
| src/scorers/structuredOutputScorer.test.ts | Comprehensive test suite covering all scorer functionality |
| src/scorers/toolCallScorer.ts | Refactored to use shared utilities from utils.ts |
| src/scorers/index.ts | Exports new scorer and shared utilities |
| src/index.ts | Updates main exports and Score type to support any output type |
| src/ai-sdk-integration.test.ts | Updated integration tests demonstrating real AI SDK usage |
| package.json | Updated dependencies for AI SDK integration |
| README.md | Documentation updates about deprecated toEval matcher |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
| // Handle regex patterns | ||
| if (expected instanceof RegExp) { | ||
| return typeof actual === "string" && expected.test(actual); | ||
| } | ||
|
|
||
| // Handle functions (custom validators) | ||
| if (typeof expected === "function") { | ||
| return expected(actual); | ||
| } | ||
|
|
There was a problem hiding this comment.
[nitpick] The regex pattern handling should be moved before null/undefined checks for better performance, as regex checks are less common and more expensive than null checks.
| // Handle regex patterns | |
| if (expected instanceof RegExp) { | |
| return typeof actual === "string" && expected.test(actual); | |
| } | |
| // Handle functions (custom validators) | |
| if (typeof expected === "function") { | |
| return expected(actual); | |
| } |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Modified the test command in package.json to use 'vitest run' for consistency. - Added a new 'test:watch' command for easier development testing. - Updated README to reflect changes in the AI SDK integration, including new response handling and tool usage. - Refactored AI SDK integration tests to improve clarity and structure, ensuring better validation of tool interactions. These changes enhance the testing framework and improve documentation for better developer experience.
| name: string; | ||
| arguments: Record<string, any>; | ||
|
|
||
| // Result and timing |
There was a problem hiding this comment.
this shouldve never been in here tbqh - LLM hallucinated adding it to our types
Summary
StructuredOutputScorerfor evaluating structured JSON outputs from language modelsChanges
New scorer implementation:
src/scorers/structuredOutputScorer.ts(362 lines)Comprehensive test suite:
src/scorers/structuredOutputScorer.test.ts(483 lines)Integration updates:
src/scorers/index.tssrc/index.tsFeatures
Matching modes:
strict: Exact equality required (default)fuzzy: Case-insensitive strings, numeric tolerance, regex patterns, subset matchingConfiguration options:
requireAll: Whether all expected fields must be presentallowExtras: Whether to allow additional fields beyond expecteddebug: Enable detailed logging for troubleshootingType of change
Testing
Related issues
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com