LUCENE-9635: BM25FQuery - Mask encoded norm long value in array lookup #2138

yiluncui · 2020-12-09T21:43:26Z

Description

Through some experimentation with with the BM25FQuery on long documents, I've discovered that there is a bug that doesn't mask the encoded norm's long value during scoring. For long documents (or long fields) this may cause ArrayIndexOutOfBoundsExceptions.

The line where I suspect the bug is being exposed is here
https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java#L131

Here is a similar use in BM25Similarity with the masking

lucene-solr/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java

Line 233 in c413656

float normInverse = cache[((byte) encodedNorm) & 0xFF];

My experimentation shows that to expose this bug, there must be a match for a token in more than one field (which is what BM25FQuery is for). In addition one of the fields must be >= 32792 tokens long.

I've provided tests in the pull request to demonstrate this.

Solution

Change the array lookup to norm & 0xff

Tests

Added tests for single and multiple long documents that exposes this problem.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the master branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Ref Guide (for Solr changes only).

jpountz

Great catch! The fix looks good but I'm unhappy that the test needs to generate such large strings. Could you change the test to e.g. use a similarity that always encodes the length normalization factor as a negative number at index time so that we wouldn't need giant strings to test this?

mikemccand

Thanks @yiluncui -- I left one small comment about how exactly to do the byte to unsigned int conversion.

mikemccand · 2020-12-14T14:11:55Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java

@@ -128,7 +128,7 @@ public boolean advanceExact(int target) throws IOException {
      for (int i = 0; i < normsArr.length; i++) {
        boolean found = normsArr[i].advanceExact(target);
        assert found;
-        normValue += weightArr[i] * LENGTH_TABLE[(byte) normsArr[i].longValue()];
+        normValue += weightArr[i] * LENGTH_TABLE[(byte) normsArr[i].longValue() & 0xff];


Actually, I see that SimilarityBase.java uses Byte.toUnsignedInt instead -- should we use that here? It is the same computation?

Or, if you want to stick with & 0xff, could you add parens around the (byte) cast to make the order of operations clear?

I've updated the change using Byte.toUnsignedInt

…p to avoid negative norms in long documents

yiluncui · 2020-12-18T08:31:54Z

Updated the pull request addressing outstanding feedback.

…p to avoid negative norms in long documents (#2138)

jpountz · 2020-12-18T15:15:38Z

Thank you @yiluncui !

…p to avoid negative norms in long documents (apache#2138)

yiluncui force-pushed the fix-bm25f-multinormsimscorer branch 2 times, most recently from ca1b48b to 2a49e15 Compare December 9, 2020 22:00

jpountz reviewed Dec 14, 2020

View reviewed changes

mikemccand reviewed Dec 14, 2020

View reviewed changes

LUCENE-9635: BM25FQuery - Mask encoded norm long value in array looku…

72e88fc

…p to avoid negative norms in long documents

yiluncui force-pushed the fix-bm25f-multinormsimscorer branch from 2a49e15 to 72e88fc Compare December 18, 2020 08:29

jpountz merged commit 894b6b5 into apache:master Dec 18, 2020

jpountz pushed a commit that referenced this pull request Dec 18, 2020

LUCENE-9635: BM25FQuery - Mask encoded norm long value in array looku…

9a53155

…p to avoid negative norms in long documents (#2138)

ctargett pushed a commit to ctargett/lucene-solr that referenced this pull request Jan 11, 2021

LUCENE-9635: BM25FQuery - Mask encoded norm long value in array looku…

6066e72

…p to avoid negative norms in long documents (apache#2138)

epugh pushed a commit to epugh/lucene-solr-1 that referenced this pull request Jan 15, 2021

LUCENE-9635: BM25FQuery - Mask encoded norm long value in array looku…

a0a31fc

…p to avoid negative norms in long documents (apache#2138)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9635: BM25FQuery - Mask encoded norm long value in array lookup #2138

LUCENE-9635: BM25FQuery - Mask encoded norm long value in array lookup #2138

yiluncui commented Dec 9, 2020

jpountz left a comment •

edited

mikemccand left a comment

mikemccand Dec 14, 2020

yiluncui Dec 18, 2020

yiluncui commented Dec 18, 2020

jpountz commented Dec 18, 2020

LUCENE-9635: BM25FQuery - Mask encoded norm long value in array lookup #2138

LUCENE-9635: BM25FQuery - Mask encoded norm long value in array lookup #2138

Conversation

yiluncui commented Dec 9, 2020

Description

Solution

Tests

Checklist

jpountz left a comment • edited

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand Dec 14, 2020

Choose a reason for hiding this comment

yiluncui Dec 18, 2020

Choose a reason for hiding this comment

yiluncui commented Dec 18, 2020

jpountz commented Dec 18, 2020

jpountz left a comment •

edited