Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10559: Add Prefilter Option to KnnGraphTester #932

Merged
merged 1 commit into from Jul 29, 2022
Merged

LUCENE-10559: Add Prefilter Option to KnnGraphTester #932

merged 1 commit into from Jul 29, 2022

Conversation

kaivalnp
Copy link
Contributor

@kaivalnp kaivalnp commented May 31, 2022

Description

Link to Jira

Solution

Added a prefilter and filterSelectivity argument to KnnGraphTester to be able to compare pre and post-filtering benchmarks

filterSelectivity expresses the selectivity of a filter as proportion of passing docs that are randomly selected. We store these in a FixedBitSet and use this to calculate true KNN as well as in HNSW search

In case of post-filter, we over-select results as topK / filterSelectivity to get final hits close to actual requested topK
For pre-filter, we wrap the FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery

Edit 1: I should also add that a custom BulkScorer is provided to the wrapped query that directly updates the reference of the BitSetCollector (to prevent arbitrary hit collection time of pre-filter query, so we can focus on HNSW search time)

Edit 2: We have decided to split the possible collection optimization to a new issue. This PR only adds prefilter testing functionality to KnnGraphTester, and depends on KnnVectorQuery for internal implementations

@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
}

public void setBitSet(BitSet bitSet, int cost) {
bitSets[ord] = bitSet;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So .. this relies on the fact that BitSetCollector.doSetNextReader will have been called prior to this, setting the appropriate ord. This is very clever, but it relies heavily on knowledge of internal implementation details that aren't really part of the published API, and this will tie us to this BitSetCollector implementation. I wonder if we could create a new KnnVectorQuery constructor that would accept a Function<Integer, BitSetIterator>, rather than a Query, that would be responsible for returning a BitSetIterator for a given ord?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we create a new Query type backed by a BitSet? For example we have a BitSetQuery in TestScorerPerf, maybe we could pull this out as a test class or just duplicate it. That way we wouldn't have to modify KnnVectorQuery just for these test purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So .. this relies on the fact that BitSetCollector.doSetNextReader will have been called prior to this, setting the appropriate ord. This is very clever, but it relies heavily on knowledge of internal implementation details that aren't really part of the published API, and this will tie us to this BitSetCollector implementation. I wonder if we could create a new KnnVectorQuery constructor that would accept a Function<Integer, BitSetIterator>, rather than a Query, that would be responsible for returning a BitSetIterator for a given ord?

This is interesting and would solve many problems of wrapping the BitSet in a Query (and associated BulkScorer). Given this function, we could add a new constructor to BitSetCollector (to instantiate the internal BitSets from these BitSetIterators instead), and no indexSearcher.search would be required thereafter

My concern is that it creates different search paths and constructors for this function, which will only be used from our test and might complicate the class (Though this addition would be helpful if we want to delegate the task of populating the filtered docs into BitSets outside of KnnVectorQuery)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we create a new Query type backed by a BitSet? For example we have a BitSetQuery in TestScorerPerf, maybe we could pull this out as a test class or just duplicate it. That way we wouldn't have to modify KnnVectorQuery just for these test purposes.

The current SelectiveQuery class is in essence a Query backed by FixedBitSet. The problem arises during hit collection, which happens doc by doc (so we have to iterate over the entire BitSet, and call collect on set bits), which adds a large arbitrary time. To prevent this copying of our BitSet into BitSetCollector's internal one (bit by bit), I have overloaded the BulkScorer to directly update it's underlying reference

An alternative to adding these modifiers could be using the Reflection package to manually update the variables. This way we won't need to change defined classes for tests (but it might make the test somewhat hacky?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently the BitSet copying is a fairly large cost factor relative to the HNSW search. Also, I do think we would like to eventually have a fully supported (not just for tests) mechanism for this so that we can make more efficient use of cached queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jtibshirani @msokolov for your input!

I agree that we shouldn't modify core classes for tests, and we can create a separate issue for possible collection optimizations

However I feel that this test is important as well, for deciding which query would be suitable for pre-filtering (by comparing with post-filtering + over selection time) and we should address this change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, it definitely makes sense to do both changes. @kaivalnp would you like to open a new JIRA issue focused on using cached/ precomputed filters directly? I'm also happy to file the issue and tag you for credit + awareness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank You! I have opened a JIRA issue, hope it is okay
Please feel free to suggest edits / alternate approaches / provide feedback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think of this plan?

  • Spin off a separate issue around removing overhead from copying BitSet when the query is cached or precomputed. Maybe we'll end up with something similar to your change where we access the iterator directly.
  • Either hold off on this PR until that overhead is addressed, or merge it but without a special workaround to prevent copying. To unblock any testing you could fork KnnGraphTester locally or KnnVectorQuery to add a workaround?

Now that the issue for the overhead is addressed, should we look into this PR again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Feel free to ping me for a review once it's updated.

Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few small comments. @msokolov you'd like to take a look too, since you have context on the tests you're running?

private FixedBitSet selectedBits;
private long cost;

@SuppressForbidden(reason = "Uses Math.random()")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is in the tests module, you could use org.apache.lucene.tests.util.LuceneTestCase.random to avoid the suppression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.. Makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the test is run from a static context (main method), I'm unable to use LuceneTestCase

We can:

  • Keep using Math.random and suppress warnings
  • Change the code a bit to test using run function instead of main (not depend on instantiating KnnGraphTester)

The second might be better for reproducibility (ability to run using seed)

Any suggestions on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please go ahead and refactor to make it testable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a nice follow-up if you're up for it! For this PR, it's no problem with me to just use Math.random().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, fine to follow up separately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a nice follow-up if you're up for it! For this PR, it's no problem with me to just use Math.random().

Hey, I did some follow-up on this, and here's the sample code

Some key points:

  • Separate indexing and search operations
  • Allow passing index path instead (we can now test HNSW search on any existing index, even those not created by KnnGraphTest. This can be beneficial for testing Knn search time on custom indexes)
  • Reduced redundant arguments (we needed to pass maxConn, beamWidth, dim etc even for searching, which aren't required/should be inferred from index automatically)
  • Ability to pass a seed (for reproducibility), maxSegments (if one wants a single segment index; this was enforced earlier, can be optional), knnField (while searching in custom indexes), cache (to save on brute force search time using precomputed results)
  • Considered corner cases where if selectivity is high, topK results may not be found (current implementation requires topK results as far as I know)

Indexing Params:

-Doperation=index
-Ddocs=(path of vec file containing docs)
-Ddim=(dimension of doc vectors)
-DnumDocs=(number of vectors)
-Dindex=(index path to be created)
-DknnField=(knn field name in index; optional, defaults to `knn`)
-DmaxConn=(`maxConn` used for indexing)
-DbeamWidth=(`beamWidth` used for indexing)
-DmaxSegments=(max segments desired; optional, defaults to no merges)

Search Params:

-Doperation=search
-Dindex=(path of index; `dim` will be inferred)
-DknnField=knn field name in index; optional, defaults to `knn`)
-Dqueries=(path of vec file containing queries)
-DnumQueries=(number of queries to run)
-Dcache=(path to cache; read from cache if found, else compute and write new)
-DtopK=(desired `topK`)
-DfilterSelectivity=(selectivity of filter; optional, defaults to 1)
-Dseed=(seed; optional)

Yet to be added:

  • Index info operation (some segment info?)
  • pre/post filter (currently pre-filters by default, add option for over-selection + post-filter)
  • some output formatting (only basic details now)

Some considerations:

  • Not extended LuceneTestCase as being a unit test, it can only access limited resources and write to temporary files. However, incorporated a seed argument for reproducibility
  • Shifted to JVM arguments for cleaner code (directly access property, no boilerplate required)

Please let me know if this is in the right direction.. (or if I missed something)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani @msokolov Any suggestions/thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Kaival - would you mind opening a new issue or maybe just a PR where we can discuss? I went looking for this comment and had a little trouble figuring out where it was, I guess because this PR is closed it no longer shows up on the list of open PRs, duh.

As for the specifics of the proposal, your ideas seem good, but (1) I am trying to push another PR for handling low-precision vector quantization that will make a number of changes to the existing KnnGraphTester -sorry! but can you think about also incorporating those changes? and (2) maybe a smaller refactor exposing a function that accepts all the needed arguments (that can be called from a main that gets its args somehow without changing any of the other implementation details would be a good first step? And then we could have an actual unit test so that we can ensure that other changes are safe. And then we could start restructuring the internals to make it better? We could also change command-line parsing if it seems better to do it some other way, but can we do these as separate PRs please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do! This is the link

@kaivalnp
Copy link
Contributor Author

Sorry for the delay!

Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I just left some tiny comments. Before we merge, could you add an entry to CHANGES.txt (in the 9.4 section under 'Other')?

@jtibshirani jtibshirani merged commit 1ad28a3 into apache:main Jul 29, 2022
jtibshirani pushed a commit that referenced this pull request Jul 29, 2022
Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be
able to compare pre and post-filtering benchmarks.

`filterSelectivity` expresses the selectivity of a filter as proportion of
passing docs that are randomly selected. We store these in a FixedBitSet and
use this to calculate true KNN as well as in HNSW search.

In case of post-filter, we over-select results as `topK / filterSelectivity` to
get final hits close to actual requested `topK`. For pre-filter, we wrap the
FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery.
@kaivalnp kaivalnp deleted the graph_tester branch July 31, 2022 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants