New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

SOLR-15880: K Nearest Neighbors Search #476

Merged

alessandrobenedetti merged 80 commits into apache:main from SeaseLtd:feature/index_vector_field

Jan 24, 2022

Contributor

alessandrobenedetti commented Dec 23, 2021 •

edited

Loading

https://issues.apache.org/jira/browse/SOLR-15880

Description

This is the first milestone in integrating neural search in Apache Solr.
We are reviewing the Pull Request internally and finishing the documentation.
Other than that, the code is ready for review.
It includes a DenseVectorField to wrap the Lucene implementation using Hierarchical Navigable Small World to index the dense vector.
It includes a Knn query parser to run K Nearest Neighbors query.

Solution

A new Apache Solr field Type has been added.
A new Apache Solr Query Parser has been added.

Tests

Dense vector fields and the query parser have tests.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

eliaporciani and others added 10 commits

December 14, 2021 11:40


          Implementation of

e96d41d

- indexing knn vector field
- custom query parser for knn vector search


          Merge branch 'main' of github.com:apache/solr into feature/index_vect…

232de04

…or_field


          Added simple parsing for similarity functions

d05f992


          small fixes


          Merge remote-tracking branch 'upstream/main' into feature/index_vecto…

80aa323

…r_field


          some testing and refactor

6c32c93


          Dense Vector Field Type draft completed

6eb8875


          First Documentation Draft for field type

188f2f7


          Query parser review + test

35a7d08


          Merge remote-tracking branch 'upstream/main' into feature/index_vecto…

d307b21

…r_field

alessandrobenedetti marked this pull request as draft

December 23, 2021 17:59

alessandrobenedetti self-assigned this

alessandrobenedetti requested review from mocobeta and msokolov

December 23, 2021 18:08

alessandrobenedetti added 2 commits

December 23, 2021 18:16


          license fix

6065fcb


          documentation fix

561129b

epugh reviewed

View reviewed changes

solr/solr-ref-guide/src/neural-search.adoc Outdated Show resolved Hide resolved

alessandrobenedetti changed the title ~~SOLR-14397~~ SOLR-15880


          minor fixes

7d906ad

sonatype-lift bot reviewed

View reviewed changes

solr/core/src/java/org/apache/solr/schema/DenseVectorField.java Outdated Show resolved Hide resolved

alessandrobenedetti added 3 commits

December 24, 2021 13:55


          first documentation draft

e724a49


          documentation complete

5198e79


          documentation complete

273e325

alessandrobenedetti marked this pull request as ready for review

December 24, 2021 16:43

alessandrobenedetti requested a review from ctargett

December 24, 2021 16:43

Contributor Author

alessandrobenedetti commented Dec 24, 2021

The Pull Request is now complete and ready for review.
Adding @ctargett to the loop as there are plenty of documentation additions.
Merry Christmas :D

alessandrobenedetti added 2 commits

December 27, 2021 12:45


          minor documentation fix

88b81e4


          minor documentation fix

11bb38a

msokolov reviewed

View reviewed changes

Contributor

msokolov left a comment

This is great! Very excited to see this getting in to Solr. My main comments are about the de/serialization format. I wonder also if you have a plan for providing access to the index-time hyperparameters, like graph fanout (maxConns) etc. It is pretty important, for best performance, to be able to optimize these for a particular set of vectors embeddings.

solr/core/src/java/org/apache/solr/schema/DenseVectorField.java Outdated

+                  }
+                  private IndexableField createStoredField(SchemaField field, Object value) {
+                      return new StoredField(field.getName(), value.toString());

Contributor

msokolov Dec 27, 2021

This doesn't seem like something very useful for numeric vector fields? You'd end up with strings like [0.3485723998437523,...] which would be a very inefficient way to store the information

Contributor Author

alessandrobenedetti Dec 30, 2021

This should have been resolved with latest commit

solr/core/src/java/org/apache/solr/schema/DenseVectorField.java Outdated

+                  @Override
+                  public void write(TextResponseWriter writer, String name, IndexableField f) throws IOException {
+                      writer.writeVal(name, f.stringValue());

Contributor

msokolov Dec 27, 2021

Not sure what stringValue will return here - will it be the Arrays.toString() of the vector? Something else?

Contributor Author

alessandrobenedetti Dec 30, 2021

This should have been resolved with latest commit

solr/core/src/java/org/apache/solr/schema/DenseVectorField.java Outdated

+                  /**
+                   * Parses a String vector.
+                   *
+                   * @param value with format: [f1, f2, f3, f4...fn]

Contributor

msokolov Dec 27, 2021

Have you considered an alternate more compressed string format? Maybe b64-encode the bytes of the float array?

Contributor Author

alessandrobenedetti Dec 30, 2021

This should have been resolved with latest commit

solr/core/src/java/org/apache/solr/schema/DenseVectorField.java Show resolved Hide resolved

Contributor Author

alessandrobenedetti commented Dec 27, 2021

thanks @msokolov for your initial feedback!
me and @eliaporciani will think about it and design some changes in the next few days!

alessandrobenedetti added 8 commits

January 19, 2022 11:27


          documentation changes in response to feedback

33e5973


          documentation changes in response to feedback

7b35f24


          documentation changes in response to feedback

403564b


          documentation changes in response to feedback

bddc802


          documentation changes in response to feedback

7e0c985


          documentation changes in response to feedback

fdec065


          documentation changes in response to feedback

b197ef0


          documentation changes in response to feedback

297c6a0

Contributor Author

alessandrobenedetti commented Jan 19, 2022

I should have addressed most of the comments regarding the documentation.
Thank you very much for the suggestions @ctargett , they are really valuable, especially for a software engineer that rarely has touched the solr doc like me :)

Only a few questions remain, that I don't think are really critical, so I'll wait for some reply on them before committing.

In regards to the "Neural Search" name, as I expressed in a comment before, I think it's fine, but I'll raise the matter to the wider community through slack for a quick feedback!

Cheers and thanks again!


          documentation changes in response to feedback

210804f

Contributor Author

alessandrobenedetti commented Jan 19, 2022 •

edited

Loading

I refactored the documentation contribution, changing the name of the section to "Dense Vector Search".
Whenever in the future we add some additional neural search techniques, we can add a parent section that contains the various children.
I'll wait some additional hours to gather additional feedback, especially for the documentation (thanks @ctargett again for your time).

If no blockers, I'll proceed tomorrow with the merge!

magibney reviewed

View reviewed changes

solr/solr-ref-guide/src/dense-vector-search.adoc Outdated Show resolved Hide resolved


          documentation changes in response to feedback

1bf2f3d

magibney reviewed

View reviewed changes

solr/solr-ref-guide/src/dense-vector-search.adoc Outdated Show resolved Hide resolved

alessandrobenedetti added 7 commits

January 19, 2022 18:28


          documentation changes in response to feedback

fcb47ee


          documentation changes in response to feedback

8f596b3


          documentation changes in response to feedback

134e0e2


          Merge remote-tracking branch 'upstream/main' into feature/index_vecto…

92794ef

…r_field


          minor changes pre-commit

6e16615


          minor changes pre-commit

477bfd9


          minor changes pre-commit

47e7983

alessandrobenedetti merged commit c309a8a into apache:main

alessandrobenedetti added a commit that referenced this pull request


          SOLR-15880: K Nearest Neighbors Search (#476)

6d04e8b

This contribution introduces dense vector indexing and searching

alessandrobenedetti added a commit that referenced this pull request


          SOLR-15880: K Nearest Neighbors Search (#476)

ed3f22e

This contribution introduces dense vector indexing and searching

stefan-it commented Feb 3, 2022

Hi guys,

thanks for working on this feature! I would like to index vectors with a dimension > 1024 (embeddings come from a ResNet), do you know how this would be possible 🤔

Many thanks in advance ❤️

Contributor

msokolov commented Feb 3, 2022

We put in a limit of 1024 since we need some limit in order to help people not to shoot themselves in the foot and then blame the software... What dimension in fact are your vectors? Maybe consider adding some dimensionality reduction step? Sorry I don't know what a ResNet is :)

stefan-it commented Feb 3, 2022

I'm currently using 2048, but I can use a smaller model that outputs 1024 :)

ramayer commented Apr 21, 2022

@msokolov : larger embeddings are definitely useful. Oxford's "VGG-Face" outputs a vector of 2622 elements.

Support for larger embeddings would enable Solr features like "boost the results of word docs containing an image with a similar face to George Washington" using such models .

rattle99 commented Feb 29, 2024 •

edited

Loading

Trying to query for topK greater than 10 but cannot get more than 10 documents in response.

http://localhost:8983/solr/films/query?q={%21knn%20f=film_vector%20topK=17}[-0.1784,0.0096,-0.1455,0.4167,-0.1148,-0.0053,-0.0651,-0.0415,0.0859,-0.1789]

This is from the film examples in solr docs, the query works fine on the UI with

{!knn f=film_vector topK=12}[-0.1784,0.0096,-0.1455,0.4167,-0.1148,-0.0053,-0.0651,-0.0415,0.0859,-0.1789]

Is there something in the documentation regarding this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

epugh epugh left review comments

noblepaul noblepaul left review comments

ctargett ctargett left review comments

risdenk risdenk left review comments

fmmoret fmmoret left review comments

joel-bernstein joel-bernstein left review comments

magibney magibney left review comments

cpoerschke cpoerschke left review comments

sonatype-lift sonatype-lift left review comments

mocobeta Awaiting requested review from mocobeta

msokolov Awaiting requested review from msokolov

Labels

None yet