Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failing BaseVectorSimilarityQueryTestCase#testApproximate #12922

Merged
merged 3 commits into from
Dec 13, 2023

Conversation

kaivalnp
Copy link
Contributor

Discovered in #12921, and introduced in #12679

The first issue is that we weren't advancing the VectorScorer here -- so it was still un-positioned while trying to compute the similarity score

Earlier in the PR, the underlying delegate of the FilteredDocIdSetIterator was scorer.iterator() (see here) -- so we didn't need to explicitly advance it

Later, we decided to maintain parity to AbstractKnnVectorQuery and introduce filtering in AbstractVectorSimilarityQuery (see this commit) to determine the visitLimit of approximate search -- after which the underlying iterator changed to the accepted docs (see here) and I missed advancing the VectorScorer explicitly..

After doing so, we no longer get the original java.lang.ArrayIndexOutOfBoundsException -- but the BaseVectorSimilarityQueryTestCase#testApproximate starts failing because it falls back to exact search, as the limit of the prefilter is met during graph search

Relaxed the parameters of the test to fix this (making the filter less restrictive, and trying to visit a fewer number of nodes so that approximate search completes without hitting its limit)

Sorry for missing this earlier!

Comment on lines +258 to +262
// Advance the scorer
if (!scorer.advanceExact(doc)) {
return false;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells me that none of the filtered cases drop to using exact search? If they did they would match on docs that were actually filtered out.

Could you add a test exercising this path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells me that none of the filtered cases drop to using exact search?

We do have a test case that covers this: testRandomFilter randomly filters a sub-range of documents and expects all of them to be found -- which always falls back to exact search (because a traversalSimilarity of Float.NEGATIVE_INFINITY here will visit the entire graph and exhaust any limit)

While running this test, I put debug points here to check the scorer, and it was always unpositioned (docid = -1) but doesn't throw an error in many cases

The error is thrown when the underlying values are randomly chosen to SimpleTextFloatVectorValues - try ./gradlew test --tests TestFloatVectorSimilarityQuery.testRandomFilter -Dtests.seed=119135B1F0803918 for a failing case (as mentioned here, thanks @epotyom!)

In other cases (for example when values are DenseOffHeapVectorValues), the scorer does not fail even when unpositioned and returns some (garbage) value - but the count of results is still correct (and asserted)

I wonder if we can put an assert values.docID() != -1 before this and this line to ensure it is positioned before trying to compute scores? -- I tried this offline, and the issue was caught immediately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a commit to demonstrate this^, please let me know if we can improve this / do it some other way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding an assert is nice :). In reality it would throw trying to read garbage.

@benwtrent benwtrent merged commit 6c5dcc1 into apache:main Dec 13, 2023
4 checks passed
benwtrent pushed a commit that referenced this pull request Dec 13, 2023
Discovered in #12921, and introduced in #12679

The first issue is that we weren't advancing the `VectorScorer` [here](https://github.com/apache/lucene/blob/cf13a9295052288b748ed8f279f05ee26f3bfd5f/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L257-L262) -- so it was still un-positioned while trying to compute the similarity score

Earlier in the PR, the underlying delegate of the `FilteredDocIdSetIterator` was `scorer.iterator()` (see [here](https://github.com/apache/lucene/blob/cad565439be512ac6e95a698007b1fc971173f00/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L107)) -- so we didn't need to explicitly advance it

Later, we decided to maintain parity to `AbstractKnnVectorQuery` and introduce filtering in `AbstractVectorSimilarityQuery` (see [this commit](5096790)) to determine the `visitLimit` of approximate search -- after which the underlying iterator changed to the accepted docs (see [here](https://github.com/apache/lucene/blob/5096790f281e477c529a7c8311aeb353ccdffdeb/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L255)) and I missed advancing the `VectorScorer` explicitly..

After doing so, we no longer get the original `java.lang.ArrayIndexOutOfBoundsException` -- but the `BaseVectorSimilarityQueryTestCase#testApproximate` starts failing because it falls back to exact search, as the limit of the prefilter is met during graph search

Relaxed the parameters of the test to fix this (making the filter less restrictive, and trying to visit a fewer number of nodes so that approximate search completes without hitting its limit)

Sorry for missing this earlier!
@kaivalnp kaivalnp deleted the fix-scorer branch January 2, 2024 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants