refactor: extract shared scoring loop from Ragas metrics (#182) by decko · Pull Request #201 · decko/raki

decko · 2026-04-24T20:39:37Z

Summary

Extracts duplicated error handling (~80 lines per file) from 4 Ragas metric files into shared _scoring_loop.py
New run_scoring_loop() handles: silent zero detection, max_tokens errors, JudgeLogger calls, mixed-success aggregation
Each metric now provides only a score_fn callback and metric-specific detail fields
Reduces ~337 lines of duplication to ~193 lines of shared logic
396 lines of new tests for the scoring loop module

Test plan

uv run pytest tests/ -v -m "not slow" — 1159 passed
All 4 Ragas metrics (faithfulness, precision, recall, relevancy) use the shared loop
Error handling behavior unchanged (silent zero, max_tokens, mixed success)

🤖 Generated with SODA + Claude Code

…182) Add 24 tests covering ScoringState, score_rows, build_max_tokens_result, build_silent_zero_result, and enrich_details_with_failures — all of which will be extracted from the 4 duplicated Ragas metric files. Tests are failing (module does not exist yet) per TDD convention. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: orchestrator

The four Ragas metric files (faithfulness, precision, recall, relevancy) all duplicated ~60 lines of identical error-handling logic: - asyncio.Semaphore-bounded parallel scoring - InstructorSilentZeroError detection and silent-zero failure tracking - max_tokens error categorisation - judge_logger calls on success and failure - Post-loop early-return builders for all-failed scenarios - details dict enrichment with failure counts/warnings Extract these into: - ScoringState: accumulates scores, sample_scores, and two failure lists - score_rows(): the shared async scoring coroutine - build_max_tokens_result(): early-return guard for max_tokens failures - build_silent_zero_result(): early-return guard for silent-zero failures - enrich_details_with_failures(): adds failure keys to the details dict All 24 new tests pass. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: orchestrator

Replace the duplicated ~60-line score_all/score_one/error-handling block in FaithfulnessMetric.compute() with calls to the shared helpers from _scoring_loop: - score_rows() for the batched async scoring loop - build_max_tokens_result() for the all-max_tokens early return - build_silent_zero_result() for the all-silent-zero early return - enrich_details_with_failures() for failure counts in details dict - state.mean_score for the average calculation Also fix ty: ignore comment style (ty: ignore[unresolved-attribute] instead of type: ignore[union-attr]) in _scoring_loop.py. All 10 faithfulness tests pass. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: orchestrator

Replace duplicated score_all/score_one/error-handling block in ContextPrecisionMetric.compute() with shared _scoring_loop helpers. All 5 context_precision tests pass. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: orchestrator

Replace duplicated score_all/score_one/error-handling block in ContextRecallMetric.compute() with shared _scoring_loop helpers. All 4 context_recall tests pass. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: orchestrator

Replace duplicated score_all/score_one/error-handling block in AnswerRelevancyMetric.compute() with shared _scoring_loop helpers. All 8 non-slow answer_relevancy tests pass. The slow integration test (TestAnswerRelevancyIntegration) requires live Google credentials and is excluded from the standard test run (marked @pytest.mark.slow). Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: orchestrator

decko added 6 commits April 24, 2026 17:20

decko merged commit 7bc699d into main Apr 24, 2026
4 checks passed

decko deleted the soda/182 branch April 24, 2026 20:41

This was referenced Apr 24, 2026

refactor: extract shared error handling from 4 Ragas metric files #182

Closed

feat: track judge cost per report #174

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: extract shared scoring loop from Ragas metrics (#182)#201

refactor: extract shared scoring loop from Ragas metrics (#182)#201
decko merged 6 commits into
mainfrom
soda/182

decko commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

decko commented Apr 24, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant