Conversation
…182) Add 24 tests covering ScoringState, score_rows, build_max_tokens_result, build_silent_zero_result, and enrich_details_with_failures — all of which will be extracted from the 4 duplicated Ragas metric files. Tests are failing (module does not exist yet) per TDD convention. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: orchestrator
The four Ragas metric files (faithfulness, precision, recall, relevancy) all duplicated ~60 lines of identical error-handling logic: - asyncio.Semaphore-bounded parallel scoring - InstructorSilentZeroError detection and silent-zero failure tracking - max_tokens error categorisation - judge_logger calls on success and failure - Post-loop early-return builders for all-failed scenarios - details dict enrichment with failure counts/warnings Extract these into: - ScoringState: accumulates scores, sample_scores, and two failure lists - score_rows(): the shared async scoring coroutine - build_max_tokens_result(): early-return guard for max_tokens failures - build_silent_zero_result(): early-return guard for silent-zero failures - enrich_details_with_failures(): adds failure keys to the details dict All 24 new tests pass. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: orchestrator
Replace the duplicated ~60-line score_all/score_one/error-handling block in FaithfulnessMetric.compute() with calls to the shared helpers from _scoring_loop: - score_rows() for the batched async scoring loop - build_max_tokens_result() for the all-max_tokens early return - build_silent_zero_result() for the all-silent-zero early return - enrich_details_with_failures() for failure counts in details dict - state.mean_score for the average calculation Also fix ty: ignore comment style (ty: ignore[unresolved-attribute] instead of type: ignore[union-attr]) in _scoring_loop.py. All 10 faithfulness tests pass. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: orchestrator
Replace duplicated score_all/score_one/error-handling block in ContextPrecisionMetric.compute() with shared _scoring_loop helpers. All 5 context_precision tests pass. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: orchestrator
Replace duplicated score_all/score_one/error-handling block in ContextRecallMetric.compute() with shared _scoring_loop helpers. All 4 context_recall tests pass. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: orchestrator
Replace duplicated score_all/score_one/error-handling block in AnswerRelevancyMetric.compute() with shared _scoring_loop helpers. All 8 non-slow answer_relevancy tests pass. The slow integration test (TestAnswerRelevancyIntegration) requires live Google credentials and is excluded from the standard test run (marked @pytest.mark.slow). Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: orchestrator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_scoring_loop.pyrun_scoring_loop()handles: silent zero detection, max_tokens errors, JudgeLogger calls, mixed-success aggregationscore_fncallback and metric-specific detail fieldsTest plan
uv run pytest tests/ -v -m "not slow"— 1159 passedRefs #182
🤖 Generated with SODA + Claude Code