Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark KNN search with ann-benchmarks [LUCENE-9625] #10665

Open
Tracked by #84324
asfimport opened this issue Nov 30, 2020 · 18 comments
Open
Tracked by #84324

Benchmark KNN search with ann-benchmarks [LUCENE-9625] #10665

asfimport opened this issue Nov 30, 2020 · 18 comments

Comments

@asfimport
Copy link

In addition to benchmarking with luceneutil, it would be good to be able to make use of ann-benchmarks, which is publishing results from many approximate knn algorithms, including the hnsw implementation from its authors. We don't expect to challenge the performance of these native code libraries, however it would be good to know just how far off we are.

I started looking into this and posted a fork of ann-benchmarks that uses KnnGraphTester class to run these: https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to manually copy jars and the KnnGraphTester.class to the test host machine rather than downloading from a distribution. KnnGraphTester needs some modifications in order to support this process - this issue is mostly about that.

One thing I noticed is that some of the index builds with higher fanout (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is concerning and I'll open a separate issue for trying to improve that.


Migrated from LUCENE-9625 by Michael Sokolov (@msokolov), updated May 20 2022
Pull requests: apache/lucene-solr#2157

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

Hm I used this issue number incorrectly for the attached PR; somehow I got it in my head this issue was about implementing the HNSW diversity, but now I see it is not, and that issue doesn't exist! So I'll open it and see if I can transfer the PR over there, or maybe re-open it if need be

@asfimport
Copy link
Author

asfimport commented May 27, 2021

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

I somehow completely missed this before posting my own ann-benchmarks results to #10976 and #10980! It's great your setup avoids py4j, it looks easier to maintain than the one I posted. Is the plan to keep this open until Lucene 9 is released and we can integrate it into ann-benchmarks repo?

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

Yes, once we have a publicly-available release I think we can modify the scripts in that patch to download it instead of expecting it to be present locally already. 

@asfimport
Copy link
Author

Balmukund Mandal (migrated from JIRA)

@msokolov 

As you've mentioned that jars should be copied but not very clear that which all Lucene jars we've to copy?

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

> As you've mentioned that jars should be copied but not very clear that which all Lucene jars we've to copy?

I think lucene-core should be enough. What have you tried?

@asfimport
Copy link
Author

Balmukund Mandal (migrated from JIRA)

Thank you Michael for your response. Even, i tried the same but getting below error. Also, i copied jar and KnnGraphTester.class files into lib and classes folder which are created inside algorithms folder.

Error: Unable to initialize main class org.apache.lucene.util.hnsw.KnnGraphTester
Caused by: java.lang.NoClassDefFoundError: org/apache/lucene/util/hnsw/KnnGraphTester$1

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

So, clearly it's not finding KnnGraphTester. Probably it needs to go in the proper folder structure (org/apache/lucene/util/KnnGraphTester.class) relative to the classpath root. TBH I don't remember the way files are set up in this ann-benchmarks environment, but I hope that helps you figure it out!

@asfimport
Copy link
Author

Balmukund Mandal (migrated from JIRA)

Thank you Michael, it works. I think its easy to include only two jars i.e. core and test  rather including KnnGraphTester.class which requires proper directories . Hence, little modification in luceneknn.py file and read all from lib/* only.

@asfimport
Copy link
Author

Balmukund Mandal (migrated from JIRA)

I was trying to run the benchmark and has a couple of questions. Indexing takes a long time, so is there a way to configure the benchmark to use an already existing index for search? Also, is there a way to configure the benchmark to use multiple threads for indexing (looks to me that it’s a single-threaded indexing)?

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

There's no support for using an existing index; creating the index is an important part of the benchmark, I think? As for threading, no, it would be necessary to modify the test harness. But maybe you should consider contributing to ann-benchmarks?

@asfimport
Copy link
Author

Balmukund Mandal (migrated from JIRA)

Thank you Michael

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

I realized maybe this deserves a better explanation: I didn't use multi-threading in the KnnGraphTester that builds the index since my goal at the time was really to evaluate whether our algorithm implementation is correct and how it performs on a single HNSW graph index. If we use multiple threads, this is going to lead to a more fragmented graph due to the way Lucene indexes segments, and while this would be a useful point of comparison, it also creates a different variable to tune in the benchmark evaluation. If you do want to pursue this, I would suggest configuring the IndexWriterConfig with large buffers so that each thread creates a single segment, and exposing the number of threads/segments as a tunable parameter since it is going to impact the recall and latency reported by the benchmark

@benwtrent
Copy link
Member

I opened a PR for ann-benchmarks: erikbern/ann-benchmarks#315

I tested PyLucene locally, comparing it to @msokolov's "batch" methodology (writing to disk and spinning up a Java process). PyLucene provided faster numbers in both Batch and iterative on my m1 mac. This indicates that the overhead experienced is acceptable for now.

There are improvements we can make to PyLucene (native numpy support is a big one). But, it seems prudent to get measurements recorded in ann-benchmarks and iteratively improve our python over time :).

Also, its for Lucene 9.1. I am going to see about getting a newer PyLucene version built hitting the latest release. We can update our benchmark implementation to hit the new version when its available.

@msokolov @jtibshirani IDK if y'all want to review the ann-benchmarks implementation or not. Let me know what you think.

@jtibshirani
Copy link
Member

Thanks @benwtrent, it's great to see that PyLucene works well and has low overhead! It feels more solid than what we were doing before.

+1 to preparing a new version. As I mentioned on the PR, we made a big correction to the algorithm in Lucene 9.2 (https://issues.apache.org/jira/browse/LUCENE-10527). It would be too bad if results came out and Lucene did much worse than hnswlib because it was missing this fix!

It'd also be great to compare the results against hnswlib as part of the submission. We can double-check that recall is the same for a given set of parameters. This would give confidence we've interpreted all the parameters right in the ann-benchmarks runner. I'm happy to help with this since I have hnswlib set up locally.

@ovalhub
Copy link

ovalhub commented Oct 31, 2022 via email

@benwtrent
Copy link
Member

benwtrent commented Oct 31, 2022

@ovalhub numpy collections are already native. To use them, I have to pull them into python collections and then cast them to be native again.

Example:

X = X.tolist()
doc.add(KnnVectorField("knn", JArray('float')(X), fieldType))

The .toList() is weird to me as X is already a natively backed float array, just a numpy object.

Admittedly, I may be misunderstanding how the communication between Python -> native -> Java is done. But it SEEMS that calling tolist for numpy unnecessarily pulls an already native collection into a generic python list.

EDIT:

I am using "native" here to indicate a contiguous space in native memory for a particular data type. In this particular instance, float.

@benwtrent
Copy link
Member

It'd also be great to compare the results against hnswlib as part of the submission. We can double-check that recall is the same for a given set of parameters. This would give confidence we've interpreted all the parameters right in the ann-benchmarks runner. I'm happy to help with this since I have hnswlib set up locally.

Ah, for sure. That would be useful. We would have to wait for a new version of PyLucene to be built for this though as right now we still have the bug in: https://issues.apache.org/jira/browse/LUCENE-10527

@ovalhub
Copy link

ovalhub commented Oct 31, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants