Release v0.2.0 · cognica-io/bayesian-bm25

What's New

Base Rate Prior for Unsupervised Calibration

The auto-estimated sigmoid midpoint (beta = median(scores)) assigns ~50% relevance probability to scores that are almost never relevant, producing systematic overconfidence. The new base rate prior corrects this by decomposing the posterior into three additive log-odds terms:

logit(P) = logit(L) + logit(b_r) + logit(p)

This reduces expected calibration error (ECE) by 68-77% on BEIR datasets (NFCorpus, SciFact) without requiring any relevance labels.

Usage

from bayesian_bm25 import BayesianBM25Scorer

scorer = BayesianBM25Scorer(k1=1.2, b=0.75, base_rate="auto")
scorer.index(corpus_tokens)

doc_ids, probabilities = scorer.retrieve([["machine", "learning"]], k=10)

The base_rate parameter accepts:

None (default) -- no base rate correction
"auto" -- auto-estimate from corpus score distribution (95th percentile heuristic)
float in (0, 1) -- explicit base rate value

Changes

Add base_rate parameter to BayesianProbabilityTransform and BayesianBM25Scorer
Add auto-estimation via 95th percentile pseudo-query heuristic
Add calibration verification benchmark (benchmarks/calibration.py)
Add theorem verification tests for base rate log-odds decomposition
Update Bayesian BM25 paper with Section 4.4 (Base Rate Prior) and Section 11.3 (Calibration Verification)

Full Changelog: v0.1.1...v0.2.0
PyPI: https://pypi.org/project/bayesian-bm25/0.2.0/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0

Choose a tag to compare

Sorry, something went wrong.