Skip to content

refactor: extract shared error handling from 4 Ragas metric files #182

@decko

Description

@decko

Summary

The detect-raise-catch-aggregate pattern for silent zero and max_tokens errors is duplicated across faithfulness.py, precision.py, recall.py, and relevancy.py — approximately 70-80 lines per file, not ~30 as originally estimated.

Current State

Each metric file duplicates:

  1. silent_zero_failures / max_tokens_failures list initialization
  2. InstructorSilentZeroError import and raise inside score_one() loop
  3. except Exception catch block branching on is_max_tokens_error() and isinstance(exc, InstructorSilentZeroError)
  4. JudgeLogger instantiation and .log() calls (success + error paths)
  5. Two all-failures early-return blocks (max_tokens, silent_zero)
  6. Mixed-success details dict assembly with warning counts

The only differences between files are:

  • Which Ragas metric class is imported/instantiated
  • Which kwargs are passed to ascore() (response= vs reference=)
  • Faithfulness/relevancy include context_source detection and experimental/caveat fields
  • Relevancy additionally calls create_ragas_embeddings()

Helpers is_instructor_silent_zero() and is_max_tokens_error() are already centralized in adapter.py. The per-metric usage pattern is what's duplicated.

Proposed

Extract to a shared scoring runner in adapter.py (or new scoring.py):

class RagasScoringRunner:
    """Handles the score-one/catch/accumulate/aggregate pattern."""
    
    def __init__(self, metric_name: str, config: MetricConfig): ...
    
    async def score_all(
        self,
        samples: list[EvalSample],
        score_fn: Callable[[EvalSample], Awaitable[float]],
    ) -> list[MetricResult]:
        """Run score_fn per sample, handle errors, build results."""

Each metric file reduces to:

  • Import the Ragas metric class
  • Define a score_fn that calls ascore() with the right kwargs
  • Call runner.score_all(samples, score_fn)
  • Add any metric-specific detail fields (context_source, experimental, etc.)

Impact

Reduces ~280 lines of duplicated code across 4 files to ~40 lines of shared logic. Future error handling changes become one-file fixes.

Files

  • src/raki/metrics/ragas/faithfulness.py (184 lines)
  • src/raki/metrics/ragas/precision.py (155 lines)
  • src/raki/metrics/ragas/recall.py (155 lines)
  • src/raki/metrics/ragas/relevancy.py (184 lines)
  • src/raki/metrics/ragas/adapter.py (target for shared logic)

Flagged By

RAG Specialist review of PR #180.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions