Improve BayesianScoreQuery and LogOddsFusionQuery with base rate prior, weighted Log-OP, and parameter estimation by jaepil · Pull Request #15948 · apache/lucene

jaepil · 2026-04-10T04:19:29Z

Summary

Follow-up to #15827. This PR extends BayesianScoreQuery and LogOddsFusionQuery with three improvements:

BayesianScoreEstimator: Auto-estimates sigmoid calibration parameters (alpha, beta) and corpus-level base rate from score distributions via pseudo-query sampling
Base rate prior for BayesianScoreQuery: Optional corpus-level relevance prior that shifts the posterior in log-odds space: sigmoid(alpha * (score - beta) + logit(baseRate)), improving calibration for rare-relevance corpora
Weighted Logarithmic Opinion Pooling for LogOddsFusionQuery: Per-signal weights enabling weighted Log-OP where each signal's log-odds contribution is scaled by its reliability weight, plus optional logit normalization bounds

Algorithm Details

BayesianScoreEstimator

Estimates BayesianScoreQuery parameters from corpus statistics via pseudo-query sampling:

Sample N documents randomly from the index (Fisher-Yates partial shuffle)
For each document, create a pseudo-query from its first few tokens in the target field
Run each pseudo-query via BM25 and collect the score distribution
Estimate: beta = median(scores), alpha = 1 / std(scores)
Estimate base rate: mean fraction of documents scoring above the 95th percentile, clamped to [1e-6, 0.5]

Base Rate Prior

When a base rate r is set on BayesianScoreQuery, the posterior is computed as:

P = sigmoid(alpha * (score - beta) + logit(r))

where logit(r) = log(r / (1 - r)). This shifts scores down for rare-relevance corpora (e.g., r = 0.01 adds a -4.6 logit offset), improving calibration without changing ranking order within a single query.

Weighted Log-OP

When per-signal weights are provided to LogOddsFusionQuery, the scoring formula changes from uniform mean to weighted sum:

uniform:  sigmoid(n^alpha * mean(softplus(logit(p_i))))
weighted: sigmoid(n^alpha * sum(w_i * gated(logit(p_i))))

Weights must be non-negative and sum to 1. Optional per-signal logit normalization bounds (logitMin, logitMax) enable min-max normalization as an alternative to softplus gating, useful when learned signal scales differ significantly.

New Files

File	Description
`BayesianScoreEstimator.java`	Auto-estimates alpha, beta, base rate from corpus score distributions

Modified Files

File	Description
`BayesianScoreQuery.java`	Add base rate prior support with logit-space shifting
`LogOddsFusionQuery.java`	Add per-signal weights, logit normalization bounds, and weighted Log-OP
`LogOddsFusionScorer.java`	Implement weighted scoring and logit normalization gating
`TestBayesianScoreQuery.java`	11 new tests for base rate and estimator
`TestLogOddsFusionQuery.java`	12 new tests for weighted fusion and normalization

Test Coverage (23 new tests)

BayesianScoreQuery base rate (7 tests)

Base rate lowers scores compared to no base rate
Scores remain in (0, 1) range with base rate
Max score correctness with WAND optimization
Explanation includes base rate details
QueryUtils.check, equals/hashCode, illegal argument validation

BayesianScoreEstimator (4 tests)

Estimated parameters are finite and valid
Estimated parameters produce valid scores in (0, 1)
Max score correctness with estimated parameters
Reproducibility with same random seed

LogOddsFusionQuery weighted fusion (10 tests)

Weighted fusion produces valid scores
Weights affect ranking order
Explanation correctness for weighted variant
equals/hashCode, toString, rewrite, QueryUtils.check
Illegal weight validation (wrong length, negative, non-unit-sum)
Three-way weighted combination

LogOddsFusionQuery logit normalization (2 tests)

Normalized fusion produces valid scores in (0, 1)
Max score correctness with normalization bounds

Test plan

./gradlew tidy passes (google-java-format via Spotless)
./gradlew :lucene:core:compileJava :lucene:core:compileTestJava passes
All 57 tests pass in TestBayesianScoreQuery and TestLogOddsFusionQuery

…ybrid search - Add BayesianScoreEstimator for auto-estimating sigmoid calibration parameters - Add base rate prior support to BayesianScoreQuery for log-odds shifting - Add per-signal weights to LogOddsFusionQuery for weighted Logarithmic Opinion Pooling - Add logit normalization support to LogOddsFusionScorer - Add comprehensive tests for BayesianScoreQuery and LogOddsFusionQuery

github-actions bot added the module:core/search label Apr 10, 2026

github-actions bot added this to the 11.0.0 milestone Apr 10, 2026

Move CHANGES.txt entry from 11.0.0 to 10.5.0 Improvements section

1de219b

github-actions bot modified the milestones: 11.0.0, 10.5.0 Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve BayesianScoreQuery and LogOddsFusionQuery with base rate prior, weighted Log-OP, and parameter estimation#15948

Improve BayesianScoreQuery and LogOddsFusionQuery with base rate prior, weighted Log-OP, and parameter estimation#15948
jaepil wants to merge 2 commits intoapache:mainfrom
jaepil:bayesian-bm25

jaepil commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jaepil commented Apr 10, 2026

Summary

Algorithm Details

BayesianScoreEstimator

Base Rate Prior

Weighted Log-OP

New Files

Modified Files

Test Coverage (23 new tests)

BayesianScoreQuery base rate (7 tests)

BayesianScoreEstimator (4 tests)

LogOddsFusionQuery weighted fusion (10 tests)

LogOddsFusionQuery logit normalization (2 tests)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant