-
Notifications
You must be signed in to change notification settings - Fork 168
feat(default_adapter): configurable eval returns score + feedback (#62) #147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(default_adapter): configurable eval returns score + feedback (#62) #147
Conversation
Changed Files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds configurable per-example evaluation functionality to the DefaultAdapter, enabling users to provide custom scoring logic and receive both scores and feedback for richer reflection capabilities.
Key changes:
- Added
per_example_evaluatorparameter to DefaultAdapter constructor allowing custom evaluation logic - Updated
evaluate()method to return per-example score and feedback in rollout outputs and trajectories - Refactored feedback generation into a reusable
_default_score_and_feedback()method with backward compatibility
Reviewed changes
Copilot reviewed 1 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| uv.lock | Version bump from 0.0.18 to 0.0.22 and added new greenlet wheel entries for musllinux_1_2 platforms |
| src/gepa/adapters/default_adapter/default_adapter.py | Added DefaultPerExampleEvaluator Protocol, per_example_evaluator parameter, score/feedback fields to TypedDicts, and refactored evaluation logic to support custom evaluators |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| score = 1.0 if data["answer"] in assistant_response else self.failure_score | ||
|
|
||
| if score > 0.0: | ||
| feedback = f"The generated response is correct. The response include the correct answer '{data['answer']}'" |
Copilot
AI
Dec 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feedback message contains a grammatical error: "The response include the correct answer" should be "The response includes the correct answer".
| feedback = f"The generated response is correct. The response include the correct answer '{data['answer']}'" | |
| feedback = f"The generated response is correct. The response includes the correct answer '{data['answer']}'" |
| def __call__(self, messages: Sequence[ChatMessage]) -> str: ... | ||
|
|
||
|
|
||
| class DefaultPerExampleEvaluator(Protocol): |
Copilot
AI
Dec 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new DefaultPerExampleEvaluator Protocol lacks documentation explaining its purpose, parameters, and expected return values. Adding a docstring would help users understand how to implement custom evaluators and what the tuple components (score, feedback) represent.
| class DefaultPerExampleEvaluator(Protocol): | |
| class DefaultPerExampleEvaluator(Protocol): | |
| """Protocol for per-example evaluators used by :class:`DefaultAdapter`. | |
| Implementations should compare the generated assistant response against the | |
| provided data (including the ground-truth answer and any additional | |
| context) and return a numeric score together with human-readable feedback. | |
| Args: | |
| data: The original input example, including the expected ``answer`` | |
| and any ``additional_context`` that may be useful for evaluation. | |
| assistant_response: The full response generated by the assistant/model | |
| for this example. | |
| Returns: | |
| A tuple ``(score, feedback)`` where: | |
| * ``score`` is a floating-point evaluation of the response | |
| (typically higher means better; ``DefaultAdapter`` may treat low | |
| scores as failures). | |
| * ``feedback`` is a natural-language explanation of the score that | |
| can be shown to users or used for further processing. | |
| """ |
| failure_score: float = 0.0, | ||
| max_litellm_workers: int = 10, | ||
| litellm_batch_completion_kwargs: dict[str, Any] = {}, | ||
| per_example_evaluator: DefaultPerExampleEvaluator | None = None, |
Copilot
AI
Dec 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new per_example_evaluator parameter and its usage lack test coverage. Consider adding tests to verify that custom evaluators are correctly invoked and that their returned score and feedback values are properly integrated into the rollout outputs and trajectories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks a lot for the PR @villurignanesh! |
Fixes #62
Hi @LakshyAAAgrawal, could you please take a look?
What changed