Conversation
…oogle provider When the instructor library fails to parse Google/Gemini structured output, it silently returns a Pydantic model with default values (value=0.0, reason=None) instead of raising a ValidationError. This causes Ragas metrics to record a misleading 0.0 for affected sessions, pulling the average score down without any indication of failure. Fix: - Add InstructorSilentZeroError (RuntimeError subclass) to adapter.py so the per-session error handler can distinguish this failure from other errors. - Add is_instructor_silent_zero(result, provider) detection function that fires only for provider="google" + result.value==0.0 + no reason. - Apply detection in all four Ragas metrics (faithfulness, precision, recall, relevancy): raise InstructorSilentZeroError when detected, track silent_zero_failures, and return score=None when ALL failures are silent zeros. - Pop top_p from Google LLM model_args after llm_factory() call, mirroring the existing Anthropic fix -- Google also rejects temperature + top_p together, which is one path that triggers the silent-zero bug. - Add ty: ignore[unresolved-attribute] annotations for model_args.pop() calls (the attribute exists at runtime but is not in the type stub). - Add towncrier fragment changes/169.fix. Closes #169 Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Assigned-by: soda-orchestrator
Removed stash conflict markers (<<<, ===, >>>) and kept the ruff-formatted multi-line signature for test_docs_path_within_cwd_but_outside_manifest_parent_accepted. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When some sessions score normally and others hit the instructor#1658 silent zero bug, the failure count and warning are now included in the metric details. Previously silent zero failures were tracked but not surfaced when valid scores also existed. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: decko
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
instructorlibrary silently returnsvalue=0.0/reason=None(instead of raising aValidationError) for Google/Gemini structured output (upstream bug instructor#1658)InstructorSilentZeroErrorandis_instructor_silent_zero()insrc/raki/metrics/ragas/adapter.py; integrates detection in all four Ragas metric scorers (faithfulness, precision, recall, relevancy)score=None(N/A) when every session in a run hits a silent-zero failure, preventing misleading 0.0 averagestop_pfrom Google LLMmodel_args(mirrors existing Anthropic fix) to prevent API rejectionAcceptance Criteria
provider=="google"withvalue==0.0AND no reasonscore=Nonereturned when all sessions hit silent-zero failurestop_premoved from Google LLM pathtests/test_cli.pyresolved (kept ruff-formatted multi-line signature)Review Results
Verification ✅
926 passed, 4 deselected. All 20 new tests pass.
ruff check,ruff format, andty checkclean.raki validate --deepoperational metrics compute successfully. The 12warning[unused-ignore-comment]diagnostics fromtyare all pre-existing on the base branch.Code Review 🔴 → ✅ (rework applied)
CRITICAL (fixed): Unresolved Git conflict markers in
tests/test_cli.pyat lines 1623–1629 caused aSyntaxError. Resolved by keeping the ruff-formatted multi-line method signature and removing the conflict markers.Minor (acknowledged, non-blocking): No partial-success test (some sessions succeed, some hit silent zero). Pre-existing
0.0fallback whenscoresis empty and no recognized failure type was tracked — not introduced by this PR.Refs
Refs #169
Assisted-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
Assigned-by: decko