Benchmark KNN search with ann-benchmarks [LUCENE-9625] #10665

asfimport · 2020-11-30T13:58:38Z

In addition to benchmarking with luceneutil, it would be good to be able to make use of ann-benchmarks, which is publishing results from many approximate knn algorithms, including the hnsw implementation from its authors. We don't expect to challenge the performance of these native code libraries, however it would be good to know just how far off we are.

I started looking into this and posted a fork of ann-benchmarks that uses KnnGraphTester class to run these: https://github.com/msokolov/ann-benchmarks. It's still a WIP; you have to manually copy jars and the KnnGraphTester.class to the test host machine rather than downloading from a distribution. KnnGraphTester needs some modifications in order to support this process - this issue is mostly about that.

One thing I noticed is that some of the index builds with higher fanout (efConstruction) settings time out at 2h (on an AWS c5 instance), so this is concerning and I'll open a separate issue for trying to improve that.

Migrated from LUCENE-9625 by Michael Sokolov (@msokolov), updated May 20 2022
Pull requests: apache/lucene-solr#2157

asfimport · 2020-12-18T13:00:06Z

Michael Sokolov (@msokolov) (migrated from JIRA)

Hm I used this issue number incorrectly for the attached PR; somehow I got it in my head this issue was about implementing the HNSW diversity, but now I see it is not, and that issue doesn't exist! So I'll open it and see if I can transfer the PR over there, or maybe re-open it if need be

asfimport · 2021-05-27T16:14:26Z

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

I somehow completely missed this before posting my own ann-benchmarks results to #10976 and #10980! It's great your setup avoids py4j, it looks easier to maintain than the one I posted. Is the plan to keep this open until Lucene 9 is released and we can integrate it into ann-benchmarks repo?

asfimport · 2021-05-27T16:38:53Z

Michael Sokolov (@msokolov) (migrated from JIRA)

Yes, once we have a publicly-available release I think we can modify the scripts in that patch to download it instead of expecting it to be present locally already.

asfimport · 2022-04-05T09:20:46Z

Balmukund Mandal (migrated from JIRA)

@msokolov

As you've mentioned that jars should be copied but not very clear that which all Lucene jars we've to copy?

asfimport · 2022-04-05T14:20:55Z

Michael Sokolov (@msokolov) (migrated from JIRA)

> As you've mentioned that jars should be copied but not very clear that which all Lucene jars we've to copy?

I think lucene-core should be enough. What have you tried?

asfimport · 2022-04-05T14:45:35Z

Balmukund Mandal (migrated from JIRA)

Thank you Michael for your response. Even, i tried the same but getting below error. Also, i copied jar and KnnGraphTester.class files into lib and classes folder which are created inside algorithms folder.

Error: Unable to initialize main class org.apache.lucene.util.hnsw.KnnGraphTester
Caused by: java.lang.NoClassDefFoundError: org/apache/lucene/util/hnsw/KnnGraphTester$1

asfimport · 2022-04-05T14:51:56Z

Michael Sokolov (@msokolov) (migrated from JIRA)

So, clearly it's not finding KnnGraphTester. Probably it needs to go in the proper folder structure (org/apache/lucene/util/KnnGraphTester.class) relative to the classpath root. TBH I don't remember the way files are set up in this ann-benchmarks environment, but I hope that helps you figure it out!

asfimport · 2022-04-06T05:00:57Z

Balmukund Mandal (migrated from JIRA)

Thank you Michael, it works. I think its easy to include only two jars i.e. core and test rather including KnnGraphTester.class which requires proper directories . Hence, little modification in luceneknn.py file and read all from lib/* only.

asfimport · 2022-05-09T16:59:33Z

Balmukund Mandal (migrated from JIRA)

I was trying to run the benchmark and has a couple of questions. Indexing takes a long time, so is there a way to configure the benchmark to use an already existing index for search? Also, is there a way to configure the benchmark to use multiple threads for indexing (looks to me that it’s a single-threaded indexing)?

asfimport · 2022-05-17T18:34:54Z

Michael Sokolov (@msokolov) (migrated from JIRA)

There's no support for using an existing index; creating the index is an important part of the benchmark, I think? As for threading, no, it would be necessary to modify the test harness. But maybe you should consider contributing to ann-benchmarks?

asfimport · 2022-05-20T11:18:46Z

Balmukund Mandal (migrated from JIRA)

Thank you Michael

asfimport · 2022-05-20T20:15:54Z

Michael Sokolov (@msokolov) (migrated from JIRA)

I realized maybe this deserves a better explanation: I didn't use multi-threading in the KnnGraphTester that builds the index since my goal at the time was really to evaluate whether our algorithm implementation is correct and how it performs on a single HNSW graph index. If we use multiple threads, this is going to lead to a more fragmented graph due to the way Lucene indexes segments, and while this would be a useful point of comparison, it also creates a different variable to tune in the benchmark evaluation. If you do want to pursue this, I would suggest configuring the IndexWriterConfig with large buffers so that each thread creates a single segment, and exposing the number of threads/segments as a tunable parameter since it is going to impact the recall and latency reported by the benchmark

benwtrent · 2022-10-31T14:07:45Z

I opened a PR for ann-benchmarks: erikbern/ann-benchmarks#315

I tested PyLucene locally, comparing it to @msokolov's "batch" methodology (writing to disk and spinning up a Java process). PyLucene provided faster numbers in both Batch and iterative on my m1 mac. This indicates that the overhead experienced is acceptable for now.

There are improvements we can make to PyLucene (native numpy support is a big one). But, it seems prudent to get measurements recorded in ann-benchmarks and iteratively improve our python over time :).

Also, its for Lucene 9.1. I am going to see about getting a newer PyLucene version built hitting the latest release. We can update our benchmark implementation to hit the new version when its available.

@msokolov @jtibshirani IDK if y'all want to review the ann-benchmarks implementation or not. Let me know what you think.

jtibshirani · 2022-10-31T15:52:21Z

Thanks @benwtrent, it's great to see that PyLucene works well and has low overhead! It feels more solid than what we were doing before.

+1 to preparing a new version. As I mentioned on the PR, we made a big correction to the algorithm in Lucene 9.2 (https://issues.apache.org/jira/browse/LUCENE-10527). It would be too bad if results came out and Lucene did much worse than hnswlib because it was missing this fix!

It'd also be great to compare the results against hnswlib as part of the submission. We can double-check that recall is the same for a given set of parameters. This would give confidence we've interpreted all the parameters right in the ann-benchmarks runner. I'm happy to help with this since I have hnswlib set up locally.

ovalhub · 2022-10-31T16:42:49Z

On Mon, 31 Oct 2022, Benjamin Trent wrote: I tested PyLucene locally, comparing it to @msokolov's "batch" methodology (writing to disk and spinning up a Java process). PyLucene provided faster numbers in both Batch and iterative on my m1 mac. This indicates that the overhead experienced is acceptable for now. There are improvements we can make to PyLucene (native numpy support is a big one). But, it seems prudent to get measurements recorded in ann-benchmarks and iteratively improve our python over time :).

What do you mean with "native numpy" ? Are you under the impression that PyLucene is a Python implementation of Lucene ? It's not. It's an auto-generated CPython wrapper around an auto-generated C++ wrapper that calls Java Lucene via JNI and that embeds a Java VM into a Python VM. PyLucene _is_ Java Lucene, unchanged, embedded in a Python VM. Andi..

benwtrent · 2022-10-31T17:29:28Z

@ovalhub numpy collections are already native. To use them, I have to pull them into python collections and then cast them to be native again.

Example:

X = X.tolist()
doc.add(KnnVectorField("knn", JArray('float')(X), fieldType))

The .toList() is weird to me as X is already a natively backed float array, just a numpy object.

Admittedly, I may be misunderstanding how the communication between Python -> native -> Java is done. But it SEEMS that calling tolist for numpy unnecessarily pulls an already native collection into a generic python list.

EDIT:

I am using "native" here to indicate a contiguous space in native memory for a particular data type. In this particular instance, float.

benwtrent · 2022-10-31T17:37:46Z

It'd also be great to compare the results against hnswlib as part of the submission. We can double-check that recall is the same for a given set of parameters. This would give confidence we've interpreted all the parameters right in the ann-benchmarks runner. I'm happy to help with this since I have hnswlib set up locally.

Ah, for sure. That would be useful. We would have to wait for a new version of PyLucene to be built for this though as right now we still have the bug in: https://issues.apache.org/jira/browse/LUCENE-10527

ovalhub · 2022-10-31T17:57:56Z

On Mon, 31 Oct 2022, Benjamin Trent wrote: @ovalhub `numpy` collections are already native. To use them, I have to pull them into python collections and then cast them to be native again. Example: ``` X = X.tolist() doc.add(KnnVectorField("knn", JArray('float')(X), fieldType)) ``` The `.toList()` is weird to me as `X` is already a natively backed float array, just a numpy object. Admittedly, I may be misunderstanding how the communication between Python -> `native` -> Java is done. But it SEEMS that calling `tolist` for numpy unnecessarily pulls an already native collection into a generic python list.

When you make a JArray with "JArray('float')(X), fieldType))" the JArray<float> code in JCC's JArray.h will iterate 'X' "as a sequence", calling PySequence_GetItem(sequence, i) and fill a Java float array with the values returned. If native numpy collections are exposed to Python as sequences (ie, ordered things that can be accessed with int indexes), then you should not have to call .tolist() on it before using it with PyLucene/JCC. It could also be that numpy's .tolist() just creates a wrapper object around the native array to expose it as a Python sequence ? (I don't know numpy).

jtibshirani mentioned this issue Oct 31, 2022

ANN search improvements elastic/elasticsearch#84324

Open

43 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark KNN search with ann-benchmarks [LUCENE-9625] #10665

Benchmark KNN search with ann-benchmarks [LUCENE-9625] #10665

asfimport commented Nov 30, 2020

asfimport commented Dec 18, 2020

asfimport commented May 27, 2021 •

edited

asfimport commented May 27, 2021

asfimport commented Apr 5, 2022

asfimport commented Apr 5, 2022

asfimport commented Apr 5, 2022

asfimport commented Apr 5, 2022

asfimport commented Apr 6, 2022

asfimport commented May 9, 2022

asfimport commented May 17, 2022

asfimport commented May 20, 2022

asfimport commented May 20, 2022

benwtrent commented Oct 31, 2022

jtibshirani commented Oct 31, 2022

ovalhub commented Oct 31, 2022 via email

benwtrent commented Oct 31, 2022 •

edited

benwtrent commented Oct 31, 2022

ovalhub commented Oct 31, 2022 via email

Benchmark KNN search with ann-benchmarks [LUCENE-9625] #10665

Benchmark KNN search with ann-benchmarks [LUCENE-9625] #10665

Comments

asfimport commented Nov 30, 2020

asfimport commented Dec 18, 2020

asfimport commented May 27, 2021 • edited

asfimport commented May 27, 2021

asfimport commented Apr 5, 2022

asfimport commented Apr 5, 2022

asfimport commented Apr 5, 2022

asfimport commented Apr 5, 2022

asfimport commented Apr 6, 2022

asfimport commented May 9, 2022

asfimport commented May 17, 2022

asfimport commented May 20, 2022

asfimport commented May 20, 2022

benwtrent commented Oct 31, 2022

jtibshirani commented Oct 31, 2022

ovalhub commented Oct 31, 2022 via email

benwtrent commented Oct 31, 2022 • edited

benwtrent commented Oct 31, 2022

ovalhub commented Oct 31, 2022 via email

asfimport commented May 27, 2021 •

edited

benwtrent commented Oct 31, 2022 •

edited