-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix failing BaseVectorSimilarityQueryTestCase#testApproximate #12922
Conversation
// Advance the scorer | ||
if (!scorer.advanceExact(doc)) { | ||
return false; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tells me that none of the filtered cases drop to using exact search? If they did they would match on docs that were actually filtered out.
Could you add a test exercising this path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tells me that none of the filtered cases drop to using exact search?
We do have a test case that covers this: testRandomFilter
randomly filters a sub-range of documents and expects all of them to be found -- which always falls back to exact search (because a traversalSimilarity
of Float.NEGATIVE_INFINITY
here will visit the entire graph and exhaust any limit)
While running this test, I put debug points here to check the scorer
, and it was always unpositioned (docid
= -1) but doesn't throw an error in many cases
The error is thrown when the underlying values
are randomly chosen to SimpleTextFloatVectorValues
- try ./gradlew test --tests TestFloatVectorSimilarityQuery.testRandomFilter -Dtests.seed=119135B1F0803918
for a failing case (as mentioned here, thanks @epotyom!)
In other cases (for example when values
are DenseOffHeapVectorValues
), the scorer does not fail even when unpositioned and returns some (garbage) value - but the count of results is still correct (and asserted)
I wonder if we can put an assert values.docID() != -1
before this and this line to ensure it is positioned before trying to compute scores? -- I tried this offline, and the issue was caught immediately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a commit to demonstrate this^, please let me know if we can improve this / do it some other way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding an assert is nice :). In reality it would throw trying to read garbage.
Discovered in #12921, and introduced in #12679 The first issue is that we weren't advancing the `VectorScorer` [here](https://github.com/apache/lucene/blob/cf13a9295052288b748ed8f279f05ee26f3bfd5f/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L257-L262) -- so it was still un-positioned while trying to compute the similarity score Earlier in the PR, the underlying delegate of the `FilteredDocIdSetIterator` was `scorer.iterator()` (see [here](https://github.com/apache/lucene/blob/cad565439be512ac6e95a698007b1fc971173f00/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L107)) -- so we didn't need to explicitly advance it Later, we decided to maintain parity to `AbstractKnnVectorQuery` and introduce filtering in `AbstractVectorSimilarityQuery` (see [this commit](5096790)) to determine the `visitLimit` of approximate search -- after which the underlying iterator changed to the accepted docs (see [here](https://github.com/apache/lucene/blob/5096790f281e477c529a7c8311aeb353ccdffdeb/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L255)) and I missed advancing the `VectorScorer` explicitly.. After doing so, we no longer get the original `java.lang.ArrayIndexOutOfBoundsException` -- but the `BaseVectorSimilarityQueryTestCase#testApproximate` starts failing because it falls back to exact search, as the limit of the prefilter is met during graph search Relaxed the parameters of the test to fix this (making the filter less restrictive, and trying to visit a fewer number of nodes so that approximate search completes without hitting its limit) Sorry for missing this earlier!
Discovered in #12921, and introduced in #12679
The first issue is that we weren't advancing the
VectorScorer
here -- so it was still un-positioned while trying to compute the similarity scoreEarlier in the PR, the underlying delegate of the
FilteredDocIdSetIterator
wasscorer.iterator()
(see here) -- so we didn't need to explicitly advance itLater, we decided to maintain parity to
AbstractKnnVectorQuery
and introduce filtering inAbstractVectorSimilarityQuery
(see this commit) to determine thevisitLimit
of approximate search -- after which the underlying iterator changed to the accepted docs (see here) and I missed advancing theVectorScorer
explicitly..After doing so, we no longer get the original
java.lang.ArrayIndexOutOfBoundsException
-- but theBaseVectorSimilarityQueryTestCase#testApproximate
starts failing because it falls back to exact search, as the limit of the prefilter is met during graph searchRelaxed the parameters of the test to fix this (making the filter less restrictive, and trying to visit a fewer number of nodes so that approximate search completes without hitting its limit)
Sorry for missing this earlier!