feat(default_adapter): configurable eval returns score + feedback (#62) #147

villurignanesh · 2025-12-19T00:46:14Z

Fixes #62

Hi @LakshyAAAgrawal, could you please take a look?

What changed

Added a configurable per-example evaluation hook to DefaultAdapter so scoring logic is user-defined.
Updated evaluate() to return per-example score and feedback in the rollout output/trace to support richer reflection.

… score/feedback (gepa-ai#62)

semanticdiff-com · 2025-12-19T00:46:16Z

Review changes with

Changed Files

File	Status
src/gepa/api.py	100% smaller
src/gepa/adapters/default_adapter/default_adapter.py	45% smaller
uv.lock	Unsupported file format

Copilot

Pull request overview

This PR adds configurable per-example evaluation functionality to the DefaultAdapter, enabling users to provide custom scoring logic and receive both scores and feedback for richer reflection capabilities.

Key changes:

Added per_example_evaluator parameter to DefaultAdapter constructor allowing custom evaluation logic
Updated evaluate() method to return per-example score and feedback in rollout outputs and trajectories
Refactored feedback generation into a reusable _default_score_and_feedback() method with backward compatibility

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 3 comments.

File	Description
uv.lock	Version bump from 0.0.18 to 0.0.22 and added new greenlet wheel entries for musllinux_1_2 platforms
src/gepa/adapters/default_adapter/default_adapter.py	Added DefaultPerExampleEvaluator Protocol, per_example_evaluator parameter, score/feedback fields to TypedDicts, and refactored evaluation logic to support custom evaluators

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-27T19:26:01Z

src/gepa/adapters/default_adapter/default_adapter.py

+        score = 1.0 if data["answer"] in assistant_response else self.failure_score
+
+        if score > 0.0:
+            feedback = f"The generated response is correct. The response include the correct answer '{data['answer']}'"


The feedback message contains a grammatical error: "The response include the correct answer" should be "The response includes the correct answer".

Suggested change

feedback = f"The generated response is correct. The response include the correct answer '{data['answer']}'"

feedback = f"The generated response is correct. The response includes the correct answer '{data['answer']}'"

Copilot · 2025-12-27T19:26:01Z

src/gepa/adapters/default_adapter/default_adapter.py

    def __call__(self, messages: Sequence[ChatMessage]) -> str: ...


+class DefaultPerExampleEvaluator(Protocol):


The new DefaultPerExampleEvaluator Protocol lacks documentation explaining its purpose, parameters, and expected return values. Adding a docstring would help users understand how to implement custom evaluators and what the tuple components (score, feedback) represent.

Suggested change

class DefaultPerExampleEvaluator(Protocol):

class DefaultPerExampleEvaluator(Protocol):

"""Protocol for per-example evaluators used by :class:`DefaultAdapter`.

Implementations should compare the generated assistant response against the

provided data (including the ground-truth answer and any additional

context) and return a numeric score together with human-readable feedback.

Args:

data: The original input example, including the expected ``answer``

and any ``additional_context`` that may be useful for evaluation.

assistant_response: The full response generated by the assistant/model

for this example.

Returns:

A tuple ``(score, feedback)`` where:

* ``score`` is a floating-point evaluation of the response

(typically higher means better; ``DefaultAdapter`` may treat low

scores as failures).

* ``feedback`` is a natural-language explanation of the score that

can be shown to users or used for further processing.

"""

Copilot · 2025-12-27T19:26:02Z

src/gepa/adapters/default_adapter/default_adapter.py

        failure_score: float = 0.0,
        max_litellm_workers: int = 10,
        litellm_batch_completion_kwargs: dict[str, Any] = {},
+        per_example_evaluator: DefaultPerExampleEvaluator | None = None,


The new per_example_evaluator parameter and its usage lack test coverage. Consider adding tests to verify that custom evaluators are correctly invoked and that their returned score and feedback values are properly integrated into the rollout outputs and trajectories.

Copilot

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/gepa/adapters/default_adapter/default_adapter.py

LakshyAAAgrawal · 2025-12-28T20:14:04Z

Thanks a lot for the PR @villurignanesh!

feat(default_adapter): make eval configurable and include per-example…

47615d5

… score/feedback (gepa-ai#62)

LakshyAAAgrawal requested a review from Copilot December 27, 2025 19:23

Copilot started reviewing on behalf of LakshyAAAgrawal December 27, 2025 19:23 View session

Copilot AI reviewed Dec 27, 2025

View reviewed changes

LakshyAAAgrawal added 2 commits December 28, 2025 11:42

Make the DefaultAdapter more general

61d8611

Ruff fix

e513cdd

LakshyAAAgrawal requested a review from Copilot December 28, 2025 19:47

Copilot started reviewing on behalf of LakshyAAAgrawal December 28, 2025 19:48 View session

Copilot AI reviewed Dec 28, 2025

View reviewed changes

src/gepa/adapters/default_adapter/default_adapter.py Show resolved Hide resolved

src/gepa/adapters/default_adapter/default_adapter.py Show resolved Hide resolved

src/gepa/adapters/default_adapter/default_adapter.py Show resolved Hide resolved

LakshyAAAgrawal merged commit edeee59 into gepa-ai:main Dec 28, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(default_adapter): configurable eval returns score + feedback (#62) #147

feat(default_adapter): configurable eval returns score + feedback (#62) #147

Uh oh!

villurignanesh commented Dec 19, 2025 •

edited

Loading

Uh oh!

semanticdiff-com bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 27, 2025

Uh oh!

Copilot AI Dec 27, 2025

Uh oh!

Copilot AI Dec 27, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LakshyAAAgrawal commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	feedback = f"The generated response is correct. The response include the correct answer '{data['answer']}'"
	feedback = f"The generated response is correct. The response includes the correct answer '{data['answer']}'"

		def __call__(self, messages: Sequence[ChatMessage]) -> str: ...


		class DefaultPerExampleEvaluator(Protocol):

-class DefaultPerExampleEvaluator(Protocol):
+class DefaultPerExampleEvaluator(Protocol):
+    """Protocol for per-example evaluators used by :class:`DefaultAdapter`.
+    Implementations should compare the generated assistant response against the
+    provided data (including the ground-truth answer and any additional
+    context) and return a numeric score together with human-readable feedback.
+    Args:
+        data: The original input example, including the expected ``answer``
+            and any ``additional_context`` that may be useful for evaluation.
+        assistant_response: The full response generated by the assistant/model
+            for this example.
+    Returns:
+        A tuple ``(score, feedback)`` where:
+            * ``score`` is a floating-point evaluation of the response
+              (typically higher means better; ``DefaultAdapter`` may treat low
+              scores as failures).
+            * ``feedback`` is a natural-language explanation of the score that
+              can be shown to users or used for further processing.
+    """

feat(default_adapter): configurable eval returns score + feedback (#62) #147

feat(default_adapter): configurable eval returns score + feedback (#62) #147

Uh oh!

Conversation

villurignanesh commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

semanticdiff-com bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LakshyAAAgrawal commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

villurignanesh commented Dec 19, 2025 •

edited

Loading

semanticdiff-com bot commented Dec 19, 2025 •

edited

Loading