Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9635: BM25FQuery - Mask encoded norm long value in array lookup #2138

Merged
merged 1 commit into from
Dec 18, 2020

Conversation

yiluncui
Copy link
Contributor

@yiluncui yiluncui commented Dec 9, 2020

Description

Through some experimentation with with the BM25FQuery on long documents, I've discovered that there is a bug that doesn't mask the encoded norm's long value during scoring. For long documents (or long fields) this may cause ArrayIndexOutOfBoundsExceptions.

The line where I suspect the bug is being exposed is here
https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java#L131

Here is a similar use in BM25Similarity with the masking

float normInverse = cache[((byte) encodedNorm) & 0xFF];

My experimentation shows that to expose this bug, there must be a match for a token in more than one field (which is what BM25FQuery is for). In addition one of the fields must be >= 32792 tokens long.

I've provided tests in the pull request to demonstrate this.

Solution

Change the array lookup to norm & 0xff

Tests

Added tests for single and multiple long documents that exposes this problem.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the master branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Ref Guide (for Solr changes only).

@yiluncui yiluncui force-pushed the fix-bm25f-multinormsimscorer branch 2 times, most recently from ca1b48b to 2a49e15 Compare December 9, 2020 22:00
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! The fix looks good but I'm unhappy that the test needs to generate such large strings. Could you change the test to e.g. use a similarity that always encodes the length normalization factor as a negative number at index time so that we wouldn't need giant strings to test this?

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yiluncui -- I left one small comment about how exactly to do the byte to unsigned int conversion.

@@ -128,7 +128,7 @@ public boolean advanceExact(int target) throws IOException {
for (int i = 0; i < normsArr.length; i++) {
boolean found = normsArr[i].advanceExact(target);
assert found;
normValue += weightArr[i] * LENGTH_TABLE[(byte) normsArr[i].longValue()];
normValue += weightArr[i] * LENGTH_TABLE[(byte) normsArr[i].longValue() & 0xff];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I see that SimilarityBase.java uses Byte.toUnsignedInt instead -- should we use that here? It is the same computation?

Or, if you want to stick with & 0xff, could you add parens around the (byte) cast to make the order of operations clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the change using Byte.toUnsignedInt

@yiluncui
Copy link
Contributor Author

Updated the pull request addressing outstanding feedback.

@jpountz jpountz merged commit 894b6b5 into apache:master Dec 18, 2020
jpountz pushed a commit that referenced this pull request Dec 18, 2020
@jpountz
Copy link
Contributor

jpountz commented Dec 18, 2020

Thank you @yiluncui !

ctargett pushed a commit to ctargett/lucene-solr that referenced this pull request Jan 11, 2021
epugh pushed a commit to epugh/lucene-solr-1 that referenced this pull request Jan 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants