Skip to content

refactor: extract shared scoring loop from Ragas metrics (#182)#201

Merged
decko merged 6 commits into
mainfrom
soda/182
Apr 24, 2026
Merged

refactor: extract shared scoring loop from Ragas metrics (#182)#201
decko merged 6 commits into
mainfrom
soda/182

Conversation

@decko
Copy link
Copy Markdown
Owner

@decko decko commented Apr 24, 2026

Summary

  • Extracts duplicated error handling (~80 lines per file) from 4 Ragas metric files into shared _scoring_loop.py
  • New run_scoring_loop() handles: silent zero detection, max_tokens errors, JudgeLogger calls, mixed-success aggregation
  • Each metric now provides only a score_fn callback and metric-specific detail fields
  • Reduces ~337 lines of duplication to ~193 lines of shared logic
  • 396 lines of new tests for the scoring loop module

Test plan

  • uv run pytest tests/ -v -m "not slow" — 1159 passed
  • All 4 Ragas metrics (faithfulness, precision, recall, relevancy) use the shared loop
  • Error handling behavior unchanged (silent zero, max_tokens, mixed success)

Refs #182

🤖 Generated with SODA + Claude Code

decko added 6 commits April 24, 2026 17:20
…182)

Add 24 tests covering ScoringState, score_rows, build_max_tokens_result,
build_silent_zero_result, and enrich_details_with_failures — all of which
will be extracted from the 4 duplicated Ragas metric files.

Tests are failing (module does not exist yet) per TDD convention.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: orchestrator
The four Ragas metric files (faithfulness, precision, recall, relevancy)
all duplicated ~60 lines of identical error-handling logic:
- asyncio.Semaphore-bounded parallel scoring
- InstructorSilentZeroError detection and silent-zero failure tracking
- max_tokens error categorisation
- judge_logger calls on success and failure
- Post-loop early-return builders for all-failed scenarios
- details dict enrichment with failure counts/warnings

Extract these into:
- ScoringState: accumulates scores, sample_scores, and two failure lists
- score_rows(): the shared async scoring coroutine
- build_max_tokens_result(): early-return guard for max_tokens failures
- build_silent_zero_result(): early-return guard for silent-zero failures
- enrich_details_with_failures(): adds failure keys to the details dict

All 24 new tests pass.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: orchestrator
Replace the duplicated ~60-line score_all/score_one/error-handling block in
FaithfulnessMetric.compute() with calls to the shared helpers from
_scoring_loop:

- score_rows() for the batched async scoring loop
- build_max_tokens_result() for the all-max_tokens early return
- build_silent_zero_result() for the all-silent-zero early return
- enrich_details_with_failures() for failure counts in details dict
- state.mean_score for the average calculation

Also fix ty: ignore comment style (ty: ignore[unresolved-attribute]
instead of type: ignore[union-attr]) in _scoring_loop.py.

All 10 faithfulness tests pass.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: orchestrator
Replace duplicated score_all/score_one/error-handling block in
ContextPrecisionMetric.compute() with shared _scoring_loop helpers.

All 5 context_precision tests pass.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: orchestrator
Replace duplicated score_all/score_one/error-handling block in
ContextRecallMetric.compute() with shared _scoring_loop helpers.

All 4 context_recall tests pass.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: orchestrator
Replace duplicated score_all/score_one/error-handling block in
AnswerRelevancyMetric.compute() with shared _scoring_loop helpers.

All 8 non-slow answer_relevancy tests pass. The slow integration test
(TestAnswerRelevancyIntegration) requires live Google credentials and
is excluded from the standard test run (marked @pytest.mark.slow).

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: orchestrator
@decko decko merged commit 7bc699d into main Apr 24, 2026
4 checks passed
@decko decko deleted the soda/182 branch April 24, 2026 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant