Summary
The detect-raise-catch-aggregate pattern for silent zero and max_tokens errors is duplicated across faithfulness.py, precision.py, recall.py, and relevancy.py — approximately 70-80 lines per file, not ~30 as originally estimated.
Current State
Each metric file duplicates:
silent_zero_failures / max_tokens_failures list initialization
InstructorSilentZeroError import and raise inside score_one() loop
except Exception catch block branching on is_max_tokens_error() and isinstance(exc, InstructorSilentZeroError)
JudgeLogger instantiation and .log() calls (success + error paths)
- Two all-failures early-return blocks (max_tokens, silent_zero)
- Mixed-success
details dict assembly with warning counts
The only differences between files are:
- Which Ragas metric class is imported/instantiated
- Which kwargs are passed to
ascore() (response= vs reference=)
- Faithfulness/relevancy include
context_source detection and experimental/caveat fields
- Relevancy additionally calls
create_ragas_embeddings()
Helpers is_instructor_silent_zero() and is_max_tokens_error() are already centralized in adapter.py. The per-metric usage pattern is what's duplicated.
Proposed
Extract to a shared scoring runner in adapter.py (or new scoring.py):
class RagasScoringRunner:
"""Handles the score-one/catch/accumulate/aggregate pattern."""
def __init__(self, metric_name: str, config: MetricConfig): ...
async def score_all(
self,
samples: list[EvalSample],
score_fn: Callable[[EvalSample], Awaitable[float]],
) -> list[MetricResult]:
"""Run score_fn per sample, handle errors, build results."""
Each metric file reduces to:
- Import the Ragas metric class
- Define a
score_fn that calls ascore() with the right kwargs
- Call
runner.score_all(samples, score_fn)
- Add any metric-specific detail fields (context_source, experimental, etc.)
Impact
Reduces ~280 lines of duplicated code across 4 files to ~40 lines of shared logic. Future error handling changes become one-file fixes.
Files
src/raki/metrics/ragas/faithfulness.py (184 lines)
src/raki/metrics/ragas/precision.py (155 lines)
src/raki/metrics/ragas/recall.py (155 lines)
src/raki/metrics/ragas/relevancy.py (184 lines)
src/raki/metrics/ragas/adapter.py (target for shared logic)
Flagged By
RAG Specialist review of PR #180.
Summary
The detect-raise-catch-aggregate pattern for silent zero and max_tokens errors is duplicated across
faithfulness.py,precision.py,recall.py, andrelevancy.py— approximately 70-80 lines per file, not ~30 as originally estimated.Current State
Each metric file duplicates:
silent_zero_failures/max_tokens_failureslist initializationInstructorSilentZeroErrorimport and raise insidescore_one()loopexcept Exceptioncatch block branching onis_max_tokens_error()andisinstance(exc, InstructorSilentZeroError)JudgeLoggerinstantiation and.log()calls (success + error paths)detailsdict assembly with warning countsThe only differences between files are:
ascore()(response=vsreference=)context_sourcedetection andexperimental/caveatfieldscreate_ragas_embeddings()Helpers
is_instructor_silent_zero()andis_max_tokens_error()are already centralized inadapter.py. The per-metric usage pattern is what's duplicated.Proposed
Extract to a shared scoring runner in
adapter.py(or newscoring.py):Each metric file reduces to:
score_fnthat callsascore()with the right kwargsrunner.score_all(samples, score_fn)Impact
Reduces ~280 lines of duplicated code across 4 files to ~40 lines of shared logic. Future error handling changes become one-file fixes.
Files
src/raki/metrics/ragas/faithfulness.py(184 lines)src/raki/metrics/ragas/precision.py(155 lines)src/raki/metrics/ragas/recall.py(155 lines)src/raki/metrics/ragas/relevancy.py(184 lines)src/raki/metrics/ragas/adapter.py(target for shared logic)Flagged By
RAG Specialist review of PR #180.