LUCENE-10559: Add Prefilter Option to KnnGraphTester #932

kaivalnp · 2022-05-31T06:39:59Z

Description

Link to Jira

Solution

Added a prefilter and filterSelectivity argument to KnnGraphTester to be able to compare pre and post-filtering benchmarks

filterSelectivity expresses the selectivity of a filter as proportion of passing docs that are randomly selected. We store these in a FixedBitSet and use this to calculate true KNN as well as in HNSW search

In case of post-filter, we over-select results as topK / filterSelectivity to get final hits close to actual requested topK
For pre-filter, we wrap the FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery

Edit 1: I should also add that a custom BulkScorer is provided to the wrapped query that directly updates the reference of the BitSetCollector (to prevent arbitrary hit collection time of pre-filter query, so we can focus on HNSW search time)

Edit 2: We have decided to split the possible collection optimization to a new issue. This PR only adds prefilter testing functionality to KnnGraphTester, and depends on KnnVectorQuery for internal implementations

lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java

msokolov · 2022-05-31T14:12:53Z

lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java

@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
      return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
    }

+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;


So .. this relies on the fact that BitSetCollector.doSetNextReader will have been called prior to this, setting the appropriate ord. This is very clever, but it relies heavily on knowledge of internal implementation details that aren't really part of the published API, and this will tie us to this BitSetCollector implementation. I wonder if we could create a new KnnVectorQuery constructor that would accept a Function<Integer, BitSetIterator>, rather than a Query, that would be responsible for returning a BitSetIterator for a given ord?

Could we create a new Query type backed by a BitSet? For example we have a BitSetQuery in TestScorerPerf, maybe we could pull this out as a test class or just duplicate it. That way we wouldn't have to modify KnnVectorQuery just for these test purposes.

So .. this relies on the fact that BitSetCollector.doSetNextReader will have been called prior to this, setting the appropriate ord. This is very clever, but it relies heavily on knowledge of internal implementation details that aren't really part of the published API, and this will tie us to this BitSetCollector implementation. I wonder if we could create a new KnnVectorQuery constructor that would accept a Function<Integer, BitSetIterator>, rather than a Query, that would be responsible for returning a BitSetIterator for a given ord?

This is interesting and would solve many problems of wrapping the BitSet in a Query (and associated BulkScorer). Given this function, we could add a new constructor to BitSetCollector (to instantiate the internal BitSets from these BitSetIterators instead), and no indexSearcher.search would be required thereafter

My concern is that it creates different search paths and constructors for this function, which will only be used from our test and might complicate the class (Though this addition would be helpful if we want to delegate the task of populating the filtered docs into BitSets outside of KnnVectorQuery)

Could we create a new Query type backed by a BitSet? For example we have a BitSetQuery in TestScorerPerf, maybe we could pull this out as a test class or just duplicate it. That way we wouldn't have to modify KnnVectorQuery just for these test purposes.

The current SelectiveQuery class is in essence a Query backed by FixedBitSet. The problem arises during hit collection, which happens doc by doc (so we have to iterate over the entire BitSet, and call collect on set bits), which adds a large arbitrary time. To prevent this copying of our BitSet into BitSetCollector's internal one (bit by bit), I have overloaded the BulkScorer to directly update it's underlying reference

An alternative to adding these modifiers could be using the Reflection package to manually update the variables. This way we won't need to change defined classes for tests (but it might make the test somewhat hacky?)

Apparently the BitSet copying is a fairly large cost factor relative to the HNSW search. Also, I do think we would like to eventually have a fully supported (not just for tests) mechanism for this so that we can make more efficient use of cached queries.

Thank you @jtibshirani @msokolov for your input!

I agree that we shouldn't modify core classes for tests, and we can create a separate issue for possible collection optimizations

However I feel that this test is important as well, for deciding which query would be suitable for pre-filtering (by comparing with post-filtering + over selection time) and we should address this change

Sounds good, it definitely makes sense to do both changes. @kaivalnp would you like to open a new JIRA issue focused on using cached/ precomputed filters directly? I'm also happy to file the issue and tag you for credit + awareness.

Thank You! I have opened a JIRA issue, hope it is okay
Please feel free to suggest edits / alternate approaches / provide feedback

What would you think of this plan?

Spin off a separate issue around removing overhead from copying BitSet when the query is cached or precomputed. Maybe we'll end up with something similar to your change where we access the iterator directly.

Either hold off on this PR until that overhead is addressed, or merge it but without a special workaround to prevent copying. To unblock any testing you could fork KnnGraphTester locally or KnnVectorQuery to add a workaround?

Now that the issue for the overhead is addressed, should we look into this PR again?

Sounds good! Feel free to ping me for a review once it's updated.

jtibshirani

I left a few small comments. @msokolov you'd like to take a look too, since you have context on the tests you're running?

lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java

jtibshirani · 2022-07-01T07:26:45Z

lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java

+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")


Since this is in the tests module, you could use org.apache.lucene.tests.util.LuceneTestCase.random to avoid the suppression.

Sure.. Makes sense

Since the test is run from a static context (main method), I'm unable to use LuceneTestCase

We can:

Keep using Math.random and suppress warnings

Change the code a bit to test using run function instead of main (not depend on instantiating KnnGraphTester)

The second might be better for reproducibility (ability to run using seed)

Any suggestions on this?

Please go ahead and refactor to make it testable

This seems like a nice follow-up if you're up for it! For this PR, it's no problem with me to just use Math.random().

yes, fine to follow up separately

This seems like a nice follow-up if you're up for it! For this PR, it's no problem with me to just use Math.random().

Hey, I did some follow-up on this, and here's the sample code

Some key points:

Separate indexing and search operations

Allow passing index path instead (we can now test HNSW search on any existing index, even those not created by KnnGraphTest. This can be beneficial for testing Knn search time on custom indexes)

Reduced redundant arguments (we needed to pass maxConn, beamWidth, dim etc even for searching, which aren't required/should be inferred from index automatically)

Ability to pass a seed (for reproducibility), maxSegments (if one wants a single segment index; this was enforced earlier, can be optional), knnField (while searching in custom indexes), cache (to save on brute force search time using precomputed results)

Considered corner cases where if selectivity is high, topK results may not be found (current implementation requires topK results as far as I know)

Indexing Params:

-Doperation=index -Ddocs=(path of vec file containing docs) -Ddim=(dimension of doc vectors) -DnumDocs=(number of vectors) -Dindex=(index path to be created) -DknnField=(knn field name in index; optional, defaults to `knn`) -DmaxConn=(`maxConn` used for indexing) -DbeamWidth=(`beamWidth` used for indexing) -DmaxSegments=(max segments desired; optional, defaults to no merges)

Search Params:

-Doperation=search -Dindex=(path of index; `dim` will be inferred) -DknnField=knn field name in index; optional, defaults to `knn`) -Dqueries=(path of vec file containing queries) -DnumQueries=(number of queries to run) -Dcache=(path to cache; read from cache if found, else compute and write new) -DtopK=(desired `topK`) -DfilterSelectivity=(selectivity of filter; optional, defaults to 1) -Dseed=(seed; optional)

Yet to be added:

Index info operation (some segment info?)

pre/post filter (currently pre-filters by default, add option for over-selection + post-filter)

some output formatting (only basic details now)

Some considerations:

Not extended LuceneTestCase as being a unit test, it can only access limited resources and write to temporary files. However, incorporated a seed argument for reproducibility

Shifted to JVM arguments for cleaner code (directly access property, no boilerplate required)

Please let me know if this is in the right direction.. (or if I missed something)

@jtibshirani @msokolov Any suggestions/thoughts on this?

Hi Kaival - would you mind opening a new issue or maybe just a PR where we can discuss? I went looking for this comment and had a little trouble figuring out where it was, I guess because this PR is closed it no longer shows up on the list of open PRs, duh.

As for the specifics of the proposal, your ideas seem good, but (1) I am trying to push another PR for handling low-precision vector quantization that will make a number of changes to the existing KnnGraphTester -sorry! but can you think about also incorporating those changes? and (2) maybe a smaller refactor exposing a function that accepts all the needed arguments (that can be called from a main that gets its args somehow without changing any of the other implementation details would be a good first step? And then we could have an actual unit test so that we can ensure that other changes are safe. And then we could start restructuring the internals to make it better? We could also change command-line parsing if it seems better to do it some other way, but can we do these as separate PRs please?

Sure, will do! This is the link

kaivalnp · 2022-07-28T07:24:44Z

Sorry for the delay!

jtibshirani

This looks good to me! I just left some tiny comments. Before we merge, could you add an entry to CHANGES.txt (in the 9.4 section under 'Other')?

lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java

Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be able to compare pre and post-filtering benchmarks. `filterSelectivity` expresses the selectivity of a filter as proportion of passing docs that are randomly selected. We store these in a FixedBitSet and use this to calculate true KNN as well as in HNSW search. In case of post-filter, we over-select results as `topK / filterSelectivity` to get final hits close to actual requested `topK`. For pre-filter, we wrap the FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery.

mocobeta reviewed May 31, 2022

View reviewed changes

lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java Outdated Show resolved Hide resolved

msokolov reviewed May 31, 2022

View reviewed changes

kaivalnp mentioned this pull request Jun 23, 2022

LUCENE-10606: Optimize Prefilter Hit Collection #951

Merged

jtibshirani reviewed Jul 1, 2022

View reviewed changes

jtibshirani approved these changes Jul 28, 2022

View reviewed changes

lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java Show resolved Hide resolved

lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java Outdated Show resolved Hide resolved

LUCENE-10559: Add Prefilter Option to KnnGraphTester

e7d9d89

jtibshirani approved these changes Jul 29, 2022

View reviewed changes

jtibshirani merged commit 1ad28a3 into apache:main Jul 29, 2022

kaivalnp deleted the graph_tester branch July 31, 2022 11:54

asfimport mentioned this pull request Aug 7, 2022

Add preFilter/postFilter options to KnnGraphTester [LUCENE-10559] #11595

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10559: Add Prefilter Option to KnnGraphTester #932

LUCENE-10559: Add Prefilter Option to KnnGraphTester #932

kaivalnp commented May 31, 2022 •

edited

msokolov May 31, 2022

jtibshirani Jun 2, 2022

kaivalnp Jun 2, 2022

kaivalnp Jun 2, 2022

msokolov Jun 2, 2022

kaivalnp Jun 8, 2022

jtibshirani Jun 8, 2022

kaivalnp Jun 9, 2022

kaivalnp Jun 28, 2022

jtibshirani Jun 28, 2022

jtibshirani left a comment

jtibshirani Jul 1, 2022

kaivalnp Jul 1, 2022

kaivalnp Jul 28, 2022

msokolov Jul 28, 2022

jtibshirani Jul 28, 2022

msokolov Jul 29, 2022

kaivalnp Aug 2, 2022

kaivalnp Aug 6, 2022

msokolov Aug 7, 2022

kaivalnp Aug 7, 2022

kaivalnp commented Jul 28, 2022

jtibshirani left a comment

LUCENE-10559: Add Prefilter Option to KnnGraphTester #932

LUCENE-10559: Add Prefilter Option to KnnGraphTester #932

Conversation

kaivalnp commented May 31, 2022 • edited

Description

Solution

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaivalnp commented Jul 28, 2022

jtibshirani left a comment

Choose a reason for hiding this comment

kaivalnp commented May 31, 2022 •

edited