New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10559: Add Prefilter Option to KnnGraphTester #932
Conversation
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java
Outdated
Show resolved
Hide resolved
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) { | |||
return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]); | |||
} | |||
|
|||
public void setBitSet(BitSet bitSet, int cost) { | |||
bitSets[ord] = bitSet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So .. this relies on the fact that BitSetCollector.doSetNextReader
will have been called prior to this, setting the appropriate ord
. This is very clever, but it relies heavily on knowledge of internal implementation details that aren't really part of the published API, and this will tie us to this BitSetCollector implementation. I wonder if we could create a new KnnVectorQuery constructor that would accept a Function<Integer, BitSetIterator>, rather than a Query, that would be responsible for returning a BitSetIterator for a given ord?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we create a new Query
type backed by a BitSet
? For example we have a BitSetQuery
in TestScorerPerf
, maybe we could pull this out as a test class or just duplicate it. That way we wouldn't have to modify KnnVectorQuery
just for these test purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So .. this relies on the fact that
BitSetCollector.doSetNextReader
will have been called prior to this, setting the appropriateord
. This is very clever, but it relies heavily on knowledge of internal implementation details that aren't really part of the published API, and this will tie us to this BitSetCollector implementation. I wonder if we could create a new KnnVectorQuery constructor that would accept a Function<Integer, BitSetIterator>, rather than a Query, that would be responsible for returning a BitSetIterator for a given ord?
This is interesting and would solve many problems of wrapping the BitSet in a Query (and associated BulkScorer). Given this function, we could add a new constructor to BitSetCollector (to instantiate the internal BitSets from these BitSetIterators instead), and no indexSearcher.search would be required thereafter
My concern is that it creates different search paths and constructors for this function, which will only be used from our test and might complicate the class (Though this addition would be helpful if we want to delegate the task of populating the filtered docs into BitSets outside of KnnVectorQuery)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we create a new
Query
type backed by aBitSet
? For example we have aBitSetQuery
inTestScorerPerf
, maybe we could pull this out as a test class or just duplicate it. That way we wouldn't have to modifyKnnVectorQuery
just for these test purposes.
The current SelectiveQuery class is in essence a Query backed by FixedBitSet. The problem arises during hit collection, which happens doc by doc (so we have to iterate over the entire BitSet, and call collect on set bits), which adds a large arbitrary time. To prevent this copying of our BitSet into BitSetCollector's internal one (bit by bit), I have overloaded the BulkScorer to directly update it's underlying reference
An alternative to adding these modifiers could be using the Reflection package to manually update the variables. This way we won't need to change defined classes for tests (but it might make the test somewhat hacky?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently the BitSet copying is a fairly large cost factor relative to the HNSW search. Also, I do think we would like to eventually have a fully supported (not just for tests) mechanism for this so that we can make more efficient use of cached queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jtibshirani @msokolov for your input!
I agree that we shouldn't modify core classes for tests, and we can create a separate issue for possible collection optimizations
However I feel that this test is important as well, for deciding which query would be suitable for pre-filtering (by comparing with post-filtering + over selection time) and we should address this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, it definitely makes sense to do both changes. @kaivalnp would you like to open a new JIRA issue focused on using cached/ precomputed filters directly? I'm also happy to file the issue and tag you for credit + awareness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank You! I have opened a JIRA issue, hope it is okay
Please feel free to suggest edits / alternate approaches / provide feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you think of this plan?
- Spin off a separate issue around removing overhead from copying
BitSet
when the query is cached or precomputed. Maybe we'll end up with something similar to your change where we access the iterator directly.- Either hold off on this PR until that overhead is addressed, or merge it but without a special workaround to prevent copying. To unblock any testing you could fork
KnnGraphTester
locally orKnnVectorQuery
to add a workaround?
Now that the issue for the overhead is addressed, should we look into this PR again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Feel free to ping me for a review once it's updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few small comments. @msokolov you'd like to take a look too, since you have context on the tests you're running?
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java
Outdated
Show resolved
Hide resolved
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java
Outdated
Show resolved
Hide resolved
private FixedBitSet selectedBits; | ||
private long cost; | ||
|
||
@SuppressForbidden(reason = "Uses Math.random()") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is in the tests module, you could use org.apache.lucene.tests.util.LuceneTestCase.random
to avoid the suppression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.. Makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the test is run from a static context (main method), I'm unable to use LuceneTestCase
We can:
- Keep using Math.random and suppress warnings
- Change the code a bit to test using
run
function instead ofmain
(not depend on instantiatingKnnGraphTester
)
The second might be better for reproducibility (ability to run using seed)
Any suggestions on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please go ahead and refactor to make it testable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a nice follow-up if you're up for it! For this PR, it's no problem with me to just use Math.random()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, fine to follow up separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a nice follow-up if you're up for it! For this PR, it's no problem with me to just use
Math.random()
.
Hey, I did some follow-up on this, and here's the sample code
Some key points:
- Separate indexing and search operations
- Allow passing index path instead (we can now test HNSW search on any existing index, even those not created by
KnnGraphTest
. This can be beneficial for testing Knn search time on custom indexes) - Reduced redundant arguments (we needed to pass
maxConn
,beamWidth
,dim
etc even for searching, which aren't required/should be inferred from index automatically) - Ability to pass a
seed
(for reproducibility),maxSegments
(if one wants a single segment index; this was enforced earlier, can be optional),knnField
(while searching in custom indexes),cache
(to save on brute force search time using precomputed results) - Considered corner cases where if
selectivity
is high,topK
results may not be found (current implementation requires topK results as far as I know)
Indexing Params:
-Doperation=index
-Ddocs=(path of vec file containing docs)
-Ddim=(dimension of doc vectors)
-DnumDocs=(number of vectors)
-Dindex=(index path to be created)
-DknnField=(knn field name in index; optional, defaults to `knn`)
-DmaxConn=(`maxConn` used for indexing)
-DbeamWidth=(`beamWidth` used for indexing)
-DmaxSegments=(max segments desired; optional, defaults to no merges)
Search Params:
-Doperation=search
-Dindex=(path of index; `dim` will be inferred)
-DknnField=knn field name in index; optional, defaults to `knn`)
-Dqueries=(path of vec file containing queries)
-DnumQueries=(number of queries to run)
-Dcache=(path to cache; read from cache if found, else compute and write new)
-DtopK=(desired `topK`)
-DfilterSelectivity=(selectivity of filter; optional, defaults to 1)
-Dseed=(seed; optional)
Yet to be added:
- Index
info
operation (some segment info?) - pre/post filter (currently pre-filters by default, add option for over-selection + post-filter)
- some output formatting (only basic details now)
Some considerations:
- Not extended
LuceneTestCase
as being a unit test, it can only access limited resources and write to temporary files. However, incorporated aseed
argument for reproducibility - Shifted to JVM arguments for cleaner code (directly access property, no boilerplate required)
Please let me know if this is in the right direction.. (or if I missed something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani @msokolov Any suggestions/thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Kaival - would you mind opening a new issue or maybe just a PR where we can discuss? I went looking for this comment and had a little trouble figuring out where it was, I guess because this PR is closed it no longer shows up on the list of open PRs, duh.
As for the specifics of the proposal, your ideas seem good, but (1) I am trying to push another PR for handling low-precision vector quantization that will make a number of changes to the existing KnnGraphTester -sorry! but can you think about also incorporating those changes? and (2) maybe a smaller refactor exposing a function that accepts all the needed arguments (that can be called from a main that gets its args somehow without changing any of the other implementation details would be a good first step? And then we could have an actual unit test so that we can ensure that other changes are safe. And then we could start restructuring the internals to make it better? We could also change command-line parsing if it seems better to do it some other way, but can we do these as separate PRs please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do! This is the link
Sorry for the delay! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! I just left some tiny comments. Before we merge, could you add an entry to CHANGES.txt
(in the 9.4 section under 'Other')?
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java
Outdated
Show resolved
Hide resolved
Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be able to compare pre and post-filtering benchmarks. `filterSelectivity` expresses the selectivity of a filter as proportion of passing docs that are randomly selected. We store these in a FixedBitSet and use this to calculate true KNN as well as in HNSW search. In case of post-filter, we over-select results as `topK / filterSelectivity` to get final hits close to actual requested `topK`. For pre-filter, we wrap the FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery.
Description
Link to Jira
Solution
Added a
prefilter
andfilterSelectivity
argument to KnnGraphTester to be able to compare pre and post-filtering benchmarksfilterSelectivity
expresses the selectivity of a filter as proportion of passing docs that are randomly selected. We store these in a FixedBitSet and use this to calculate true KNN as well as in HNSW searchIn case of post-filter, we over-select results as
topK / filterSelectivity
to get final hits close to actual requestedtopK
For pre-filter, we wrap the FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery
Edit 1: I should also add that a custom BulkScorer is provided to the wrapped query that directly updates the reference of the BitSetCollector (to prevent arbitrary hit collection time of pre-filter query, so we can focus on HNSW search time)
Edit 2: We have decided to split the possible collection optimization to a new issue. This PR only adds prefilter testing functionality to KnnGraphTester, and depends on KnnVectorQuery for internal implementations