refactor: extract shared error handling from 4 Ragas metric files

## Summary

The detect-raise-catch-aggregate pattern for silent zero and max_tokens errors is duplicated across `faithfulness.py`, `precision.py`, `recall.py`, and `relevancy.py` — approximately **70-80 lines per file**, not ~30 as originally estimated.

## Current State

Each metric file duplicates:
1. `silent_zero_failures` / `max_tokens_failures` list initialization
2. `InstructorSilentZeroError` import and raise inside `score_one()` loop
3. `except Exception` catch block branching on `is_max_tokens_error()` and `isinstance(exc, InstructorSilentZeroError)`
4. `JudgeLogger` instantiation and `.log()` calls (success + error paths)
5. Two all-failures early-return blocks (max_tokens, silent_zero)
6. Mixed-success `details` dict assembly with warning counts

The **only differences** between files are:
- Which Ragas metric class is imported/instantiated
- Which kwargs are passed to `ascore()` (`response=` vs `reference=`)
- Faithfulness/relevancy include `context_source` detection and `experimental`/`caveat` fields
- Relevancy additionally calls `create_ragas_embeddings()`

Helpers `is_instructor_silent_zero()` and `is_max_tokens_error()` are already centralized in `adapter.py`. The per-metric *usage* pattern is what's duplicated.

## Proposed

Extract to a shared scoring runner in `adapter.py` (or new `scoring.py`):

```python
class RagasScoringRunner:
    """Handles the score-one/catch/accumulate/aggregate pattern."""
    
    def __init__(self, metric_name: str, config: MetricConfig): ...
    
    async def score_all(
        self,
        samples: list[EvalSample],
        score_fn: Callable[[EvalSample], Awaitable[float]],
    ) -> list[MetricResult]:
        """Run score_fn per sample, handle errors, build results."""
```

Each metric file reduces to:
- Import the Ragas metric class
- Define a `score_fn` that calls `ascore()` with the right kwargs
- Call `runner.score_all(samples, score_fn)`
- Add any metric-specific detail fields (context_source, experimental, etc.)

## Impact

Reduces ~280 lines of duplicated code across 4 files to ~40 lines of shared logic. Future error handling changes become one-file fixes.

## Files

- `src/raki/metrics/ragas/faithfulness.py` (184 lines)
- `src/raki/metrics/ragas/precision.py` (155 lines)
- `src/raki/metrics/ragas/recall.py` (155 lines)
- `src/raki/metrics/ragas/relevancy.py` (184 lines)
- `src/raki/metrics/ragas/adapter.py` (target for shared logic)

## Flagged By

RAG Specialist review of PR #180.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: extract shared error handling from 4 Ragas metric files #182

Summary

Current State

Proposed

Impact

Files

Flagged By

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

refactor: extract shared error handling from 4 Ragas metric files #182

Description

Summary

Current State

Proposed

Impact

Files

Flagged By

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions