Skip to content

Conversation

@villurignanesh
Copy link

@villurignanesh villurignanesh commented Dec 19, 2025

Fixes #62

Hi @LakshyAAAgrawal, could you please take a look?

What changed

  • Added a configurable per-example evaluation hook to DefaultAdapter so scoring logic is user-defined.
  • Updated evaluate() to return per-example score and feedback in the rollout output/trace to support richer reflection.

@semanticdiff-com
Copy link

semanticdiff-com bot commented Dec 19, 2025

Review changes with  SemanticDiff

Changed Files
File Status
  src/gepa/api.py  100% smaller
  src/gepa/adapters/default_adapter/default_adapter.py  45% smaller
  uv.lock Unsupported file format

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds configurable per-example evaluation functionality to the DefaultAdapter, enabling users to provide custom scoring logic and receive both scores and feedback for richer reflection capabilities.

Key changes:

  • Added per_example_evaluator parameter to DefaultAdapter constructor allowing custom evaluation logic
  • Updated evaluate() method to return per-example score and feedback in rollout outputs and trajectories
  • Refactored feedback generation into a reusable _default_score_and_feedback() method with backward compatibility

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 3 comments.

File Description
uv.lock Version bump from 0.0.18 to 0.0.22 and added new greenlet wheel entries for musllinux_1_2 platforms
src/gepa/adapters/default_adapter/default_adapter.py Added DefaultPerExampleEvaluator Protocol, per_example_evaluator parameter, score/feedback fields to TypedDicts, and refactored evaluation logic to support custom evaluators

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

score = 1.0 if data["answer"] in assistant_response else self.failure_score

if score > 0.0:
feedback = f"The generated response is correct. The response include the correct answer '{data['answer']}'"
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feedback message contains a grammatical error: "The response include the correct answer" should be "The response includes the correct answer".

Suggested change
feedback = f"The generated response is correct. The response include the correct answer '{data['answer']}'"
feedback = f"The generated response is correct. The response includes the correct answer '{data['answer']}'"

Copilot uses AI. Check for mistakes.
def __call__(self, messages: Sequence[ChatMessage]) -> str: ...


class DefaultPerExampleEvaluator(Protocol):
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new DefaultPerExampleEvaluator Protocol lacks documentation explaining its purpose, parameters, and expected return values. Adding a docstring would help users understand how to implement custom evaluators and what the tuple components (score, feedback) represent.

Suggested change
class DefaultPerExampleEvaluator(Protocol):
class DefaultPerExampleEvaluator(Protocol):
"""Protocol for per-example evaluators used by :class:`DefaultAdapter`.
Implementations should compare the generated assistant response against the
provided data (including the ground-truth answer and any additional
context) and return a numeric score together with human-readable feedback.
Args:
data: The original input example, including the expected ``answer``
and any ``additional_context`` that may be useful for evaluation.
assistant_response: The full response generated by the assistant/model
for this example.
Returns:
A tuple ``(score, feedback)`` where:
* ``score`` is a floating-point evaluation of the response
(typically higher means better; ``DefaultAdapter`` may treat low
scores as failures).
* ``feedback`` is a natural-language explanation of the score that
can be shown to users or used for further processing.
"""

Copilot uses AI. Check for mistakes.
failure_score: float = 0.0,
max_litellm_workers: int = 10,
litellm_batch_completion_kwargs: dict[str, Any] = {},
per_example_evaluator: DefaultPerExampleEvaluator | None = None,
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new per_example_evaluator parameter and its usage lack test coverage. Consider adding tests to verify that custom evaluators are correctly invoked and that their returned score and feedback values are properly integrated into the rollout outputs and trajectories.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@LakshyAAAgrawal LakshyAAAgrawal merged commit edeee59 into gepa-ai:main Dec 28, 2025
16 checks passed
@LakshyAAAgrawal
Copy link
Contributor

Thanks a lot for the PR @villurignanesh!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DefaultAdapter's eval function should be configurable, it can return score and feedback

3 participants