-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9635: BM25FQuery - Mask encoded norm long value in array lookup #2138
Conversation
ca1b48b
to
2a49e15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! The fix looks good but I'm unhappy that the test needs to generate such large strings. Could you change the test to e.g. use a similarity that always encodes the length normalization factor as a negative number at index time so that we wouldn't need giant strings to test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yiluncui -- I left one small comment about how exactly to do the byte to unsigned int conversion.
@@ -128,7 +128,7 @@ public boolean advanceExact(int target) throws IOException { | |||
for (int i = 0; i < normsArr.length; i++) { | |||
boolean found = normsArr[i].advanceExact(target); | |||
assert found; | |||
normValue += weightArr[i] * LENGTH_TABLE[(byte) normsArr[i].longValue()]; | |||
normValue += weightArr[i] * LENGTH_TABLE[(byte) normsArr[i].longValue() & 0xff]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I see that SimilarityBase.java
uses Byte.toUnsignedInt
instead -- should we use that here? It is the same computation?
Or, if you want to stick with & 0xff
, could you add parens around the (byte)
cast to make the order of operations clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the change using Byte.toUnsignedInt
…p to avoid negative norms in long documents
2a49e15
to
72e88fc
Compare
Updated the pull request addressing outstanding feedback. |
…p to avoid negative norms in long documents (#2138)
Thank you @yiluncui ! |
…p to avoid negative norms in long documents (apache#2138)
…p to avoid negative norms in long documents (apache#2138)
Description
Through some experimentation with with the BM25FQuery on long documents, I've discovered that there is a bug that doesn't mask the encoded norm's long value during scoring. For long documents (or long fields) this may cause ArrayIndexOutOfBoundsExceptions.
The line where I suspect the bug is being exposed is here
https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java#L131
Here is a similar use in BM25Similarity with the masking
lucene-solr/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java
Line 233 in c413656
My experimentation shows that to expose this bug, there must be a match for a token in more than one field (which is what BM25FQuery is for). In addition one of the fields must be >= 32792 tokens long.
I've provided tests in the pull request to demonstrate this.
Solution
Change the array lookup to norm & 0xff
Tests
Added tests for single and multiple long documents that exposes this problem.
Checklist
Please review the following and check all that apply:
master
branch../gradlew check
.