chore(rag): v0.34.0 — delete LLMJudgeReranker outright#71
Merged
Conversation
Removes the LLM-as-judge RAG reranker class + factory + registry branch + tests. Per Brian directive 2026-05-25: ``[[preference_llm_calls_confined_to_research_module]]`` strict guardrail + operator-validated no-lift finding on the SEC-filings corpus + tier-5-not-SOTA implementation framing. **Why now:** 1. **Operator-validated regressor.** 2026-05-12 eval (recorded in ``alpha-engine-config/private-docs/EXPERIMENTS.md``) measured -14.2% recall@10 vs the hybrid w=0.7 baseline. The cross-encoder variant also regressed (-33.3% recall@10) but stays in the lib (no LLM exposure; same protocol surface for future revisits). 2. **Sub-SOTA implementation.** Tier-5 SOTA at best — LLM-as-judge reranking is for cases that need novel rubrics with no training labels (e.g., "rerank by recency-weighted financial materiality"), not for general relevance scoring. Single-integer 1-5 parse with neutral-3 fallback, per-candidate sequential calls (no listwise batching), no rationale capture, hardcoded model snapshot — these are all known anti-patterns vs. modern LLM-rerank surveys. 3. **Architectural exposure.** Strict-rule reading of ``[[preference_llm_calls_confined_to_research_module]]`` flags any LLM call site in the lib as a latent breach — even default-off, a future caller setting ``RAG_RERANK=llm_judge`` outside research would slip a covert LLM call past the guardrail. Outright deletion is structurally compliant. **Institutional rerank-revisit path** (recorded for the future): domain-finetune ``CrossEncoderReranker`` on operator-labeled (query, doc, relevance) triples mined from production retrieval logs. That's the SOTA pattern for institutional RAG reranking — finetuned CE models lift +5-15% recall@10 on domain corpora vs general-purpose CE. A P2 ROADMAP entry covering scope + gate is filed via ``alpha-engine-config`` companion PR. **Removed surfaces:** - ``LLMJudgeReranker`` class - ``_DEFAULT_LLM_RUBRIC`` constant - ``_default_llm_judge_factory`` + ``_LLM_JUDGE_FACTORY`` global - ``"llm_judge"`` branch in ``get_reranker`` - ``Callable`` typing import (unused after factory removal) - ``os`` import in this module (factory was the only consumer) - ``TestLLMJudgeReranker`` class + ``_mock_anthropic_client`` helper **Kept:** - ``CrossEncoderReranker`` (zero LLM exposure, regressor on SEC but same protocol shape for future domain-finetune retry) - ``RerankCache`` (still used by CE) - ``Reranker`` protocol + ``_attach_and_sort`` helper - ``get_reranker`` with ``"cross_encoder"`` as the sole supported name (and an explicit ``"llm_judge" was removed v0.34.0`` docstring note) **Consumer updates** (separate PRs this session): - alpha-engine-research: bump lib pin v0.33→v0.34, drop the ``"hybrid w=0.7 + llm rerank"`` entry from ``evals/rag_retrieval.py`` ``DEFAULT_CONDITIONS``, update ``qual_tools.py`` env-var docstring - alpha-engine-config: P2 ROADMAP entry + EXPERIMENTS.md update Suite 741 → 738 (-3: ``test_parses_haiku_integer_response``, ``test_cache_hit_skips_llm_call``, ``test_parse_failure_returns_neutral_three``). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Removes the LLM-as-judge RAG reranker per Brian directive 2026-05-25: `[[preference_llm_calls_confined_to_research_module]]` strict guardrail + operator-validated no-lift finding + tier-5-not-SOTA implementation framing.
Why now
1. Operator-validated regressor. 2026-05-12 eval (EXPERIMENTS.md): -14.2% recall@10 vs hybrid w=0.7 baseline on SEC-filings corpus.
2. Sub-SOTA implementation. Tier-5 SOTA at best — LLM-as-judge reranking is for novel rubrics without training labels, not general relevance scoring. Single-integer 1-5 parse with neutral-3 fallback, sequential per-candidate calls (no listwise batching), no rationale, hardcoded model snapshot.
3. Architectural exposure. Strict-rule reading: any LLM call site in the lib is a latent breach — a future caller setting `RAG_RERANK=llm_judge` outside research would slip a covert LLM call past the guardrail.
Institutional rerank-revisit path (recorded)
Domain-finetune `CrossEncoderReranker` on operator-labeled (query, doc, relevance) triples mined from production retrieval logs. P2 ROADMAP entry filed via companion alpha-engine-config PR.
Removed
Kept
Test plan
🤖 Generated with Claude Code