Add ParentJoin KNN support #12434

benwtrent · 2023-07-11T21:28:01Z

A join within Lucene is built by adding child-docs and parent-docs in order. Since our vector field already supports sparse indexing, it should be able to support parent join indexing.

However, when searching for the closest k, it is still the k nearest children vectors with no way to join back to the parent.

This commit adds this ability through some significant changes:

New leaf reader function that allows a collector for knn results
The knn results can then utilize bit-sets to join back to the parent id

This type of support is critical for nearest passage retrieval over larger documents. Generally, you want the top-k documents and knowledge of the nearest passages over each top-k document. Lucene's join functionality is a nice fit for this.

This does not replace the need for multi-valued vectors, which is important for other ranking methods (e.g. colbert token embeddings). But, it could be used in the case when metadata about the passage embedding must be stored (e.g. the related passage).

…pport-knn

jpountz · 2023-07-11T22:24:50Z

From a quick look, this lower level KNN collection API looks interesting. It has currently a high surface - presumably because extending the queue was easier to have a working prototype, which is cool - I'm curious how much leaner it can be made. It feels like we'd need at least collect(int docID, float similarity), float minSimilarity() and TopDocs topDocs(), would it be enough or is there more?

benwtrent · 2023-07-12T12:21:37Z

@alessandrobenedetti I took some of your ideas on deduplicating vector IDs based on some other id for this PR. If this work continues, I think some of it can transfer to the native multi-vector support in Lucene.

benwtrent · 2023-07-12T13:19:43Z

would it be enough or is there more?

I will dig a bit more on making this cleaner.

My biggest performance concerns are around keeping track of the heap-index -> ID and shuffling those around so often and resolving the docId by vector ordinal on every push.

benwtrent · 2023-07-12T19:02:41Z

@jpountz I took another shot at the KnnResults interface. I restricted the abstract and @Override methods to narrow the API. Additionally, I disconnected it from the queue, but it still has a queue object internally that sub-classes can utilize.

benwtrent · 2023-07-17T16:06:28Z

@jpountz my original benchmarks were flawed. There was a bug in my testing. Nested is actually 80% slower (or 1.8x times) than the current search times.

I am investigating the current possible causes.

benwtrent · 2023-08-06T19:01:28Z

@msokolov let me know if there are further changes required.

msokolov

thanks, I think you addressed my comments and I don't have anything else. I guess my only outstanding question is whether we have any approach to performance testing this -- we don't have any sample documents structured like this or test queries today in luceneutil, but that would be a nice followup

…pport-knn

A `join` within Lucene is built by adding child-docs and parent-docs in order. Since our vector field already supports sparse indexing, it should be able to support parent join indexing. However, when searching for the closest `k`, it is still the k nearest children vectors with no way to join back to the parent. This commit adds this ability through some significant changes: - New leaf reader function that allows a collector for knn results - The knn results can then utilize bit-sets to join back to the parent id This type of support is critical for nearest passage retrieval over larger documents. Generally, you want the top-k documents and knowledge of the nearest passages over each top-k document. Lucene's join functionality is a nice fit for this. This does not replace the need for multi-valued vectors, which is important for other ranking methods (e.g. colbert token embeddings). But, it could be used in the case when metadata about the passage embedding must be stored (e.g. the related passage).

…when parents are missing (#12504) This is a follow up to: #12434 Adds a test for when parents are missing in the index and verifies we return no hits. Previously this would have thrown an NPE

…rn highest score child doc ID by parent id (#12510) The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search). So, I changed the new `ToParentBlockJoin[Byte|Float]KnnVectorQuery` to `DiversifyingChildren[Byte|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id. Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set. Related to: #12434

alessandrobenedetti · 2023-10-06T15:37:07Z

Thanks @benwtrent for this work! I finally had the chance to take a look.
It's a lot and I see it's already merged, so I don't have any meaningful comment at the moment, but if I have time I'll dive into it in the future! (mostly when and if I resume the work on multi-valued, for which I am still waiting for fundings).
The work here drastically changes the way also my Pull Request should look like right now.

As a side note, do you happen to have any performance benchmark? I am quite curious as I always label nested docs approaches in Lucene to be 'slow', but having some facts (that potentially contradicts my statement) would be super cool!

benwtrent · 2023-10-06T15:56:34Z

The work here drastically changes the way also my Pull Request should look like right now.

Yes, I am sorry about that. But the good news is that the integration for multi-value vectors has some nicer APIs to take advantage of (e.g. KnnCollector) and it could possibly copy/paste the deduplicating nearest neighbor min-heap implementation.

As a side note, do you happen to have any performance benchmark?

The following test was completed over 139004 documents with 768 float32 dimensions.

The statistics for the nested value distributions:

1944 total unique documents
62.0 median number of nested values
71.50411522633745 mean number of nested values
309 max number of nested values
1 min number of nested values
2156.9469722481676 variance of nested values

|                                        50th percentile latency |          knn-search-10-100 |   3.10031     |     ms |
|                                        90th percentile latency |          knn-search-10-100 |   3.5629      |     ms |
|                                        99th percentile latency |          knn-search-10-100 |   4.60912     |     ms |
|                                      99.9th percentile latency |          knn-search-10-100 |  14.322       |     ms |
|                                       100th percentile latency |          knn-search-10-100 |  72.6463      |     ms |
|                                        50th percentile latency |   knn-nested-search-10-100 |   6.2615      |     ms |
|                                        90th percentile latency |   knn-nested-search-10-100 |   6.95849     |     ms |
|                                        99th percentile latency |   knn-nested-search-10-100 |   7.8881      |     ms |
|                                      99.9th percentile latency |   knn-nested-search-10-100 |  12.0871      |     ms |
|                                       100th percentile latency |   knn-nested-search-10-100 |  57.9238      |     ms |
|                                        50th percentile latency |        knn-search-100-1000 |   7.30288     |     ms |
|                                        90th percentile latency |        knn-search-100-1000 |   8.18694     |     ms |
|                                        99th percentile latency |        knn-search-100-1000 |   9.23673     |     ms |
|                                      99.9th percentile latency |        knn-search-100-1000 |  18.7072      |     ms |
|                                       100th percentile latency |        knn-search-100-1000 |  23.8712      |     ms |
|                                        50th percentile latency | knn-search-nested-100-1000 |  26.6446      |     ms |
|                                        90th percentile latency | knn-search-nested-100-1000 |  38.2561      |     ms |
|                                        99th percentile latency | knn-search-nested-100-1000 |  44.3627      |     ms |
|                                      99.9th percentile latency | knn-search-nested-100-1000 |  51.1843      |     ms |
|                                       100th percentile latency | knn-search-nested-100-1000 |  52.0864      |     ms |

GASP! Nested seems 2x to 4x slower!

But, keep in mind, we are eagerly joining! When I dug into the difference, I discovered that eagerly joining on this dataset meant we were visiting 3x to 5x more vectors. Consequently doing 3-5x more vector comparisons and deeper exploration of the graph. This lines up really nicely with the performance difference.

Since HNSW is log(n) the overall constant overhead of nested seems rather minor compared to the need to gather nearest vectors.

I am not sure these numbers are reflective of other nested/block-joining operations (like a term search).

alessandrobenedetti · 2023-10-06T16:18:14Z

The work here drastically changes the way also my Pull Request should look like right now.

Yes, I am sorry about that. But the good news is that the integration for multi-value vectors has some nicer APIs to take advantage of (e.g. KnnCollector) and it could possibly copy/paste the deduplicating nearest neighbor min-heap implementation.

No worries at all! My work is still paused, looking for sponsors, so no harm! When I resume it as you said I may find benefits (and do improvements) to the new data structures added (I admint I got lost in the amount of KnnCollectors and similar classes added, but I'm super curious to explore each of them thoroughfully.

As a side note, do you happen to have any performance benchmark?

The following test was completed over 139004 documents with 768 float32 dimensions.

The statistics for the nested value distributions:

1944 total unique documents 62.0 median number of nested values 71.50411522633745 mean number of nested values 309 max number of nested values 1 min number of nested values 2156.9469722481676 variance of nested values
|                                        50th percentile latency |          knn-search-10-100 |   3.10031     |     ms |
|                                        90th percentile latency |          knn-search-10-100 |   3.5629      |     ms |
|                                        99th percentile latency |          knn-search-10-100 |   4.60912     |     ms |
|                                      99.9th percentile latency |          knn-search-10-100 |  14.322       |     ms |
|                                       100th percentile latency |          knn-search-10-100 |  72.6463      |     ms |
|                                        50th percentile latency |   knn-nested-search-10-100 |   6.2615      |     ms |
|                                        90th percentile latency |   knn-nested-search-10-100 |   6.95849     |     ms |
|                                        99th percentile latency |   knn-nested-search-10-100 |   7.8881      |     ms |
|                                      99.9th percentile latency |   knn-nested-search-10-100 |  12.0871      |     ms |
|                                       100th percentile latency |   knn-nested-search-10-100 |  57.9238      |     ms |
|                                        50th percentile latency |        knn-search-100-1000 |   7.30288     |     ms |
|                                        90th percentile latency |        knn-search-100-1000 |   8.18694     |     ms |
|                                        99th percentile latency |        knn-search-100-1000 |   9.23673     |     ms |
|                                      99.9th percentile latency |        knn-search-100-1000 |  18.7072      |     ms |
|                                       100th percentile latency |        knn-search-100-1000 |  23.8712      |     ms |
|                                        50th percentile latency | knn-search-nested-100-1000 |  26.6446      |     ms |
|                                        90th percentile latency | knn-search-nested-100-1000 |  38.2561      |     ms |
|                                        99th percentile latency | knn-search-nested-100-1000 |  44.3627      |     ms |
|                                      99.9th percentile latency | knn-search-nested-100-1000 |  51.1843      |     ms |
|                                       100th percentile latency | knn-search-nested-100-1000 |  52.0864      |     ms |
GASP! Nested seems 2x to 4x slower!

But, keep in mind, we are eagerly joining! When I dug into the difference, I discovered that eagerly joining on this dataset meant we were visiting 3x to 5x more vectors. Consequently doing 3-5x more vector comparisons and deeper exploration of the graph. This lines up really nicely with the performance difference.

Since HNSW is log(n) the overall constant overhead of nested seems rather minor compared to the need to gather nearest vectors.

I am not sure these numbers are reflective of other nested/block-joining operations (like a term search).
Interesting and thanks for the heads up, I hope to investigate this further as well in the future!

david-sitsky · 2023-12-01T01:07:11Z

@benwtrent - did this really make it into 9.8.0? I downloaded the 9.8.0 release and ToParentBlockJoinFloatKnnVectorQuery does not seem to be present.

lucene-9.8.0/modules$ ls
lucene-analysis-common-9.8.0.jar      lucene-codecs-9.8.0.jar       lucene-queries-9.8.0.jar
lucene-analysis-icu-9.8.0.jar         lucene-core-9.8.0.jar         lucene-queryparser-9.8.0.jar
lucene-analysis-kuromoji-9.8.0.jar    lucene-demo-9.8.0.jar         lucene-replicator-9.8.0.jar
lucene-analysis-morfologik-9.8.0.jar  lucene-expressions-9.8.0.jar  lucene-sandbox-9.8.0.jar
lucene-analysis-nori-9.8.0.jar        lucene-facet-9.8.0.jar        lucene-spatial3d-9.8.0.jar
lucene-analysis-opennlp-9.8.0.jar     lucene-grouping-9.8.0.jar     lucene-spatial-extras-9.8.0.jar
lucene-analysis-phonetic-9.8.0.jar    lucene-highlighter-9.8.0.jar  lucene-suggest-9.8.0.jar
lucene-analysis-smartcn-9.8.0.jar     lucene-join-9.8.0.jar         META-INF
lucene-analysis-stempel-9.8.0.jar     lucene-luke-9.8.0.jar         module-info.class
lucene-backward-codecs-9.8.0.jar      lucene-memory-9.8.0.jar       org
lucene-benchmark-9.8.0.jar            lucene-misc-9.8.0.jar
lucene-classification-9.8.0.jar       lucene-monitor-9.8.0.jar
lucene-9.8.0/modules$ for file in *.jar; do unzip -v $file | grep ToParentBlockJoinFloatKnnVectorQuery; done
lucene-9.8.0/modules$ for file in *.jar; do unzip -v $file | grep ToParentBlockJoin; done
     948  Defl:N      541  43% 2023-09-21 21:59 3e2c2007  org/apache/lucene/search/join/ToParentBlockJoinQuery$1.class
    7806  Defl:N     3472  56% 2023-09-21 21:59 8e8db572  org/apache/lucene/search/join/ToParentBlockJoinQuery$BlockJoinScorer.class
    1814  Defl:N      709  61% 2023-09-21 21:59 d028b861  org/apache/lucene/search/join/ToParentBlockJoinQuery$BlockJoinWeight$1.class
    4114  Defl:N     1462  65% 2023-09-21 21:59 1e07024a  org/apache/lucene/search/join/ToParentBlockJoinQuery$BlockJoinWeight.class
    1589  Defl:N      838  47% 2023-09-21 21:59 cf93f84a  org/apache/lucene/search/join/ToParentBlockJoinQuery$ParentApproximation.class
    1869  Defl:N      836  55% 2023-09-21 21:59 1c0824bd  org/apache/lucene/search/join/ToParentBlockJoinQuery$ParentTwoPhase.class
    4723  Defl:N     1900  60% 2023-09-21 21:59 53b7ab29  org/apache/lucene/search/join/ToParentBlockJoinQuery.class
    2824  Defl:N     1085  62% 2023-09-21 21:59 323f6562  org/apache/lucene/search/join/ToParentBlockJoinSortField$1.class
    3191  Defl:N     1106  65% 2023-09-21 21:59 c9d8a699  org/apache/lucene/search/join/ToParentBlockJoinSortField$2$1.class
    1491  Defl:N      607  59% 2023-09-21 21:59 3a82677f  org/apache/lucene/search/join/ToParentBlockJoinSortField$2.class
    3196  Defl:N     1105  65% 2023-09-21 21:59 ec8017c1  org/apache/lucene/search/join/ToParentBlockJoinSortField$3$1.class
    1484  Defl:N      600  60% 2023-09-21 21:59 5bd0b2df  org/apache/lucene/search/join/ToParentBlockJoinSortField$3.class
    1368  Defl:N      576  58% 2023-09-21 21:59 29e61acb  org/apache/lucene/search/join/ToParentBlockJoinSortField$4$1$1.class
    3413  Defl:N     1152  66% 2023-09-21 21:59 4f73794f  org/apache/lucene/search/join/ToParentBlockJoinSortField$4$1.class
    1489  Defl:N      606  59% 2023-09-21 21:59 5132747c  org/apache/lucene/search/join/ToParentBlockJoinSortField$4.class
    1367  Defl:N      568  58% 2023-09-21 21:59 f6deee3a  org/apache/lucene/search/join/ToParentBlockJoinSortField$5$1$1.class
    3418  Defl:N     1151  66% 2023-09-21 21:59 d09c4733  org/apache/lucene/search/join/ToParentBlockJoinSortField$5$1.class
    1494  Defl:N      607  59% 2023-09-21 21:59 1e11e4cf  org/apache/lucene/search/join/ToParentBlockJoinSortField$5.class
    1266  Defl:N      685  46% 2023-09-21 21:59 18b34568  org/apache/lucene/search/join/ToParentBlockJoinSortField$6.class
    5837  Defl:N     2120  64% 2023-09-21 21:59 bfc259c3  org/apache/lucene/search/join/ToParentBlockJoinSortField.class

benwtrent · 2023-12-01T12:17:02Z

@david-sitsky sorry for the confusion, it was renamed DiversifyingChildren*KnnVectorQuery

david-sitsky · 2023-12-01T22:11:29Z

@david-sitsky sorry for the confusion, it was renamed DiversifyingChildren*KnnVectorQuery

Ah.. no worries, thanks. We should update the changelog https://lucene.apache.org/core/9_8_0/changes/Changes.html#v9.8.0.new_features since it is still referring to the old classnames.

benwtrent added 17 commits June 29, 2023 07:16

Add join support for knn

17491f3

Adding de-duplicating neighborqueue

ab950d0

adding tests

7e3b95f

moving things around

05ac40e

add queue results tests

506c7b7

more changes

d57a0b8

iterating on api design

5b03cd3

Merge branch 'main' into feature/add-join-support-knn

3622492

adding new leaf function

1481f74

adding to parent block vector query

f7aa19f

updating exact match for parent join vector query

1992840

fixing ordinal lookup

e431bd0

fixing api

1d33c4c

Merge remote-tracking branch 'upstream/main' into feature/add-join-su…

147488c

…pport-knn

javadoc

3f7ce5b

fixing tests and formatting

898d400

fixing tests

83ff209

cleaning up interface

b5efd9c

benwtrent added 7 commits July 18, 2023 15:22

formatting and reverting unnecessary changes

f1230df

fixing bad changes

43fec9d

removing gradle update

72b57c8

changing variable declaration

e1109fd

formatting

6b04bca

adding more tests

28c0c88

further adjusting API, need to fix tests

fb826de

cleaning up

9cf92e3

benwtrent requested a review from msokolov August 1, 2023 14:09

msokolov approved these changes Aug 6, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into feature/add-join-su…

a2693ba

…pport-knn

benwtrent merged commit a65cf89 into apache:main Aug 7, 2023
5 checks passed

benwtrent deleted the feature/add-join-support-knn branch August 7, 2023 18:46

benwtrent mentioned this pull request Aug 11, 2023

ToParentBlockJoin[Byte|Float]KnnVectorQuery needs to handle the case when parents are missing #12504

Merged

benwtrent mentioned this pull request Aug 15, 2023

Adjust ToParentBlockJoin[Byte|Float]KnnVectorQuery to return highest score child doc ID by parent id #12510

Merged

vamshin mentioned this pull request Aug 25, 2023

[FEATURE] Improved multi vector support using Nested fields opensearch-project/k-NN#1065

Closed

zhaih added this to the 9.8.0 milestone Sep 20, 2023

heemin32 mentioned this pull request Sep 26, 2023

Pass parent filter to inner query in nested query opensearch-project/OpenSearch#10246

Merged

6 tasks

cpoerschke mentioned this pull request Sep 28, 2023

SOLR-16985 Upgrade Lucene to 9.8.0 apache/solr#1958

Merged

7 tasks

heemin32 mentioned this pull request Oct 10, 2023

[Feature Request]Vector deduplication facebookresearch/faiss#3087

Open

kaivalnp mentioned this pull request Oct 14, 2023

[DISCUSS] Should there be a threshold-based vector search API? #12579

Closed

david-sitsky mentioned this pull request Nov 29, 2023

Multi-value Support for KnnVectorField #12313

Open

sarthakn7 mentioned this pull request Mar 1, 2024

Lucene 9.8 Yelp/nrtsearch#624

Merged

This was referenced Jul 23, 2024

[WIP] Multi-Vector support for HNSW search #13525

Open

Add support for KNN ParentJoin Benchmarks mikemccand/luceneutil#284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ParentJoin KNN support #12434

Add ParentJoin KNN support #12434

benwtrent commented Jul 11, 2023 •

edited

Loading

jpountz commented Jul 11, 2023

benwtrent commented Jul 12, 2023

benwtrent commented Jul 12, 2023

benwtrent commented Jul 12, 2023

benwtrent commented Jul 17, 2023

benwtrent commented Aug 6, 2023

msokolov left a comment

alessandrobenedetti commented Oct 6, 2023

benwtrent commented Oct 6, 2023

alessandrobenedetti commented Oct 6, 2023

david-sitsky commented Dec 1, 2023

benwtrent commented Dec 1, 2023

david-sitsky commented Dec 1, 2023

Add ParentJoin KNN support #12434

Add ParentJoin KNN support #12434

Conversation

benwtrent commented Jul 11, 2023 • edited Loading

jpountz commented Jul 11, 2023

benwtrent commented Jul 12, 2023

benwtrent commented Jul 12, 2023

benwtrent commented Jul 12, 2023

benwtrent commented Jul 17, 2023

benwtrent commented Aug 6, 2023

msokolov left a comment

Choose a reason for hiding this comment

alessandrobenedetti commented Oct 6, 2023

benwtrent commented Oct 6, 2023

alessandrobenedetti commented Oct 6, 2023

david-sitsky commented Dec 1, 2023

benwtrent commented Dec 1, 2023

david-sitsky commented Dec 1, 2023

benwtrent commented Jul 11, 2023 •

edited

Loading