New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark KNN search with ann-benchmarks [LUCENE-9625] #10665
Comments
Michael Sokolov (@msokolov) (migrated from JIRA) Hm I used this issue number incorrectly for the attached PR; somehow I got it in my head this issue was about implementing the HNSW diversity, but now I see it is not, and that issue doesn't exist! So I'll open it and see if I can transfer the PR over there, or maybe re-open it if need be |
Julie Tibshirani (@jtibshirani) (migrated from JIRA) I somehow completely missed this before posting my own ann-benchmarks results to #10976 and #10980! It's great your setup avoids py4j, it looks easier to maintain than the one I posted. Is the plan to keep this open until Lucene 9 is released and we can integrate it into ann-benchmarks repo? |
Michael Sokolov (@msokolov) (migrated from JIRA) Yes, once we have a publicly-available release I think we can modify the scripts in that patch to download it instead of expecting it to be present locally already. |
Balmukund Mandal (migrated from JIRA) As you've mentioned that jars should be copied but not very clear that which all Lucene jars we've to copy? |
Michael Sokolov (@msokolov) (migrated from JIRA) > As you've mentioned that jars should be copied but not very clear that which all Lucene jars we've to copy? I think lucene-core should be enough. What have you tried? |
Balmukund Mandal (migrated from JIRA) Thank you Michael for your response. Even, i tried the same but getting below error. Also, i copied jar and KnnGraphTester.class files into lib and classes folder which are created inside algorithms folder. Error: Unable to initialize main class org.apache.lucene.util.hnsw.KnnGraphTester |
Michael Sokolov (@msokolov) (migrated from JIRA) So, clearly it's not finding KnnGraphTester. Probably it needs to go in the proper folder structure (org/apache/lucene/util/KnnGraphTester.class) relative to the classpath root. TBH I don't remember the way files are set up in this ann-benchmarks environment, but I hope that helps you figure it out! |
Balmukund Mandal (migrated from JIRA) Thank you Michael, it works. I think its easy to include only two jars i.e. core and test rather including KnnGraphTester.class which requires proper directories . Hence, little modification in luceneknn.py file and read all from lib/* only. |
Balmukund Mandal (migrated from JIRA) I was trying to run the benchmark and has a couple of questions. Indexing takes a long time, so is there a way to configure the benchmark to use an already existing index for search? Also, is there a way to configure the benchmark to use multiple threads for indexing (looks to me that it’s a single-threaded indexing)? |
Michael Sokolov (@msokolov) (migrated from JIRA) There's no support for using an existing index; creating the index is an important part of the benchmark, I think? As for threading, no, it would be necessary to modify the test harness. But maybe you should consider contributing to ann-benchmarks? |
Balmukund Mandal (migrated from JIRA) Thank you Michael |
Michael Sokolov (@msokolov) (migrated from JIRA) I realized maybe this deserves a better explanation: I didn't use multi-threading in the KnnGraphTester that builds the index since my goal at the time was really to evaluate whether our algorithm implementation is correct and how it performs on a single HNSW graph index. If we use multiple threads, this is going to lead to a more fragmented graph due to the way Lucene indexes segments, and while this would be a useful point of comparison, it also creates a different variable to tune in the benchmark evaluation. If you do want to pursue this, I would suggest configuring the IndexWriterConfig with large buffers so that each thread creates a single segment, and exposing the number of threads/segments as a tunable parameter since it is going to impact the recall and latency reported by the benchmark |
I opened a PR for ann-benchmarks: erikbern/ann-benchmarks#315 I tested PyLucene locally, comparing it to @msokolov's "batch" methodology (writing to disk and spinning up a Java process). PyLucene provided faster numbers in both Batch and iterative on my m1 mac. This indicates that the overhead experienced is acceptable for now. There are improvements we can make to PyLucene (native numpy support is a big one). But, it seems prudent to get measurements recorded in ann-benchmarks and iteratively improve our python over time :). Also, its for Lucene 9.1. I am going to see about getting a newer PyLucene version built hitting the latest release. We can update our benchmark implementation to hit the new version when its available. @msokolov @jtibshirani IDK if y'all want to review the ann-benchmarks implementation or not. Let me know what you think. |
Thanks @benwtrent, it's great to see that PyLucene works well and has low overhead! It feels more solid than what we were doing before. +1 to preparing a new version. As I mentioned on the PR, we made a big correction to the algorithm in Lucene 9.2 (https://issues.apache.org/jira/browse/LUCENE-10527). It would be too bad if results came out and Lucene did much worse than hnswlib because it was missing this fix! It'd also be great to compare the results against hnswlib as part of the submission. We can double-check that recall is the same for a given set of parameters. This would give confidence we've interpreted all the parameters right in the ann-benchmarks runner. I'm happy to help with this since I have hnswlib set up locally. |
On Mon, 31 Oct 2022, Benjamin Trent wrote:
I tested PyLucene locally, comparing it to @msokolov's "batch" methodology
(writing to disk and spinning up a Java process). PyLucene provided faster
numbers in both Batch and iterative on my m1 mac. This indicates that the
overhead experienced is acceptable for now.
There are improvements we can make to PyLucene (native numpy support is a
big one). But, it seems prudent to get measurements recorded in
ann-benchmarks and iteratively improve our python over time :).
What do you mean with "native numpy" ? Are you under the impression that
PyLucene is a Python implementation of Lucene ? It's not. It's an
auto-generated CPython wrapper around an auto-generated C++ wrapper
that calls Java Lucene via JNI and that embeds a Java VM into a Python VM.
PyLucene _is_ Java Lucene, unchanged, embedded in a Python VM.
Andi..
|
@ovalhub Example:
The Admittedly, I may be misunderstanding how the communication between Python -> EDIT: I am using "native" here to indicate a contiguous space in native memory for a particular data type. In this particular instance, |
Ah, for sure. That would be useful. We would have to wait for a new version of PyLucene to be built for this though as right now we still have the bug in: https://issues.apache.org/jira/browse/LUCENE-10527 |
On Mon, 31 Oct 2022, Benjamin Trent wrote:
@ovalhub `numpy` collections are already native. To use them, I have to
pull them into python collections and then cast them to be native again.
Example: ``` X = X.tolist() doc.add(KnnVectorField("knn",
JArray('float')(X), fieldType)) ``` The `.toList()` is weird to me as `X`
is already a natively backed float array, just a numpy object.
Admittedly, I may be misunderstanding how the communication between Python
-> `native` -> Java is done. But it SEEMS that calling `tolist` for numpy
unnecessarily pulls an already native collection into a generic python
list.
When you make a JArray with "JArray('float')(X), fieldType))" the
JArray<float> code in JCC's JArray.h will iterate 'X' "as a sequence",
calling PySequence_GetItem(sequence, i) and fill a Java float array with the
values returned.
If native numpy collections are exposed to Python as sequences (ie, ordered
things that can be accessed with int indexes), then you should not have to
call .tolist() on it before using it with PyLucene/JCC.
It could also be that numpy's .tolist() just creates a wrapper object around
the native array to expose it as a Python sequence ? (I don't know numpy).
|
In addition to benchmarking with luceneutil, it would be good to be able to make use of ann-benchmarks, which is publishing results from many approximate knn algorithms, including the hnsw implementation from its authors. We don't expect to challenge the performance of these native code libraries, however it would be good to know just how far off we are.
I started looking into this and posted a fork of ann-benchmarks that uses KnnGraphTester class to run these: https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to manually copy jars and the KnnGraphTester.class to the test host machine rather than downloading from a distribution. KnnGraphTester needs some modifications in order to support this process - this issue is mostly about that.
One thing I noticed is that some of the index builds with higher fanout (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is concerning and I'll open a separate issue for trying to improve that.
Migrated from LUCENE-9625 by Michael Sokolov (@msokolov), updated May 20 2022
Pull requests: apache/lucene-solr#2157
The text was updated successfully, but these errors were encountered: