Skip to content

fix(ragas): detect and skip instructor#1658 silent-zero scores from Google provider#180

Merged
decko merged 3 commits into
mainfrom
soda/169
Apr 24, 2026
Merged

fix(ragas): detect and skip instructor#1658 silent-zero scores from Google provider#180
decko merged 3 commits into
mainfrom
soda/169

Conversation

@decko
Copy link
Copy Markdown
Owner

@decko decko commented Apr 24, 2026

Summary

  • Detects when the instructor library silently returns value=0.0 / reason=None (instead of raising a ValidationError) for Google/Gemini structured output (upstream bug instructor#1658)
  • Adds InstructorSilentZeroError and is_instructor_silent_zero() in src/raki/metrics/ragas/adapter.py; integrates detection in all four Ragas metric scorers (faithfulness, precision, recall, relevancy)
  • Returns score=None (N/A) when every session in a run hits a silent-zero failure, preventing misleading 0.0 averages
  • Removes top_p from Google LLM model_args (mirrors existing Anthropic fix) to prevent API rejection

Acceptance Criteria

  • Silent-zero detection fires only for provider=="google" with value==0.0 AND no reason
  • Legitimate 0.0 scores (with a reason) are preserved
  • score=None returned when all sessions hit silent-zero failures
  • top_p removed from Google LLM path
  • 20 new unit tests added and passing
  • All 926 tests pass (4 slow/LLM integration tests deselected per convention)
  • No pre-commit hook failures (ruff check, ruff format, ty check)
  • Merge conflict in tests/test_cli.py resolved (kept ruff-formatted multi-line signature)

Review Results

Verification ✅

926 passed, 4 deselected. All 20 new tests pass. ruff check, ruff format, and ty check clean. raki validate --deep operational metrics compute successfully. The 12 warning[unused-ignore-comment] diagnostics from ty are all pre-existing on the base branch.

Code Review 🔴 → ✅ (rework applied)

CRITICAL (fixed): Unresolved Git conflict markers in tests/test_cli.py at lines 1623–1629 caused a SyntaxError. Resolved by keeping the ruff-formatted multi-line method signature and removing the conflict markers.

Minor (acknowledged, non-blocking): No partial-success test (some sessions succeed, some hit silent zero). Pre-existing 0.0 fallback when scores is empty and no recognized failure type was tracked — not introduced by this PR.

Refs

Refs #169


Assisted-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
Assigned-by: decko

decko and others added 2 commits April 23, 2026 22:07
…oogle provider

When the instructor library fails to parse Google/Gemini structured output,
it silently returns a Pydantic model with default values (value=0.0, reason=None)
instead of raising a ValidationError. This causes Ragas metrics to record a
misleading 0.0 for affected sessions, pulling the average score down without
any indication of failure.

Fix:
- Add InstructorSilentZeroError (RuntimeError subclass) to adapter.py so the
  per-session error handler can distinguish this failure from other errors.
- Add is_instructor_silent_zero(result, provider) detection function that fires
  only for provider="google" + result.value==0.0 + no reason.
- Apply detection in all four Ragas metrics (faithfulness, precision, recall,
  relevancy): raise InstructorSilentZeroError when detected, track
  silent_zero_failures, and return score=None when ALL failures are silent zeros.
- Pop top_p from Google LLM model_args after llm_factory() call, mirroring the
  existing Anthropic fix -- Google also rejects temperature + top_p together,
  which is one path that triggers the silent-zero bug.
- Add ty: ignore[unresolved-attribute] annotations for model_args.pop() calls
  (the attribute exists at runtime but is not in the type stub).
- Add towncrier fragment changes/169.fix.

Closes #169

Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Assigned-by: soda-orchestrator
Removed stash conflict markers (<<<, ===, >>>) and kept the
ruff-formatted multi-line signature for
test_docs_path_within_cwd_but_outside_manifest_parent_accepted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@decko decko added the ai-assisted Implemented with AI assistance label Apr 24, 2026
When some sessions score normally and others hit the instructor#1658
silent zero bug, the failure count and warning are now included in
the metric details. Previously silent zero failures were tracked
but not surfaced when valid scores also existed.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: decko
@decko decko merged commit f73128d into main Apr 24, 2026
4 checks passed
@decko decko deleted the soda/169 branch April 24, 2026 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-assisted Implemented with AI assistance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant