Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-15880: K Nearest Neighbors Search #476

Merged
merged 80 commits into from
Jan 24, 2022

Conversation

alessandrobenedetti
Copy link
Contributor

@alessandrobenedetti alessandrobenedetti commented Dec 23, 2021

https://issues.apache.org/jira/browse/SOLR-15880

Description

This is the first milestone in integrating neural search in Apache Solr.
We are reviewing the Pull Request internally and finishing the documentation.
Other than that, the code is ready for review.
It includes a DenseVectorField to wrap the Lucene implementation using Hierarchical Navigable Small World to index the dense vector.
It includes a Knn query parser to run K Nearest Neighbors query.

Solution

A new Apache Solr field Type has been added.
A new Apache Solr Query Parser has been added.

Tests

Dense vector fields and the query parser have tests.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@alessandrobenedetti alessandrobenedetti changed the title SOLR-14397 SOLR-15880 Dec 24, 2021
@alessandrobenedetti alessandrobenedetti marked this pull request as ready for review December 24, 2021 16:43
@alessandrobenedetti
Copy link
Contributor Author

The Pull Request is now complete and ready for review.
Adding @ctargett to the loop as there are plenty of documentation additions.
Merry Christmas :D

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Very excited to see this getting in to Solr. My main comments are about the de/serialization format. I wonder also if you have a plan for providing access to the index-time hyperparameters, like graph fanout (maxConns) etc. It is pretty important, for best performance, to be able to optimize these for a particular set of vectors embeddings.

}

private IndexableField createStoredField(SchemaField field, Object value) {
return new StoredField(field.getName(), value.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like something very useful for numeric vector fields? You'd end up with strings like [0.3485723998437523,...] which would be a very inefficient way to store the information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have been resolved with latest commit


@Override
public void write(TextResponseWriter writer, String name, IndexableField f) throws IOException {
writer.writeVal(name, f.stringValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what stringValue will return here - will it be the Arrays.toString() of the vector? Something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have been resolved with latest commit

/**
* Parses a String vector.
*
* @param value with format: [f1, f2, f3, f4...fn]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered an alternate more compressed string format? Maybe b64-encode the bytes of the float array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have been resolved with latest commit

@alessandrobenedetti
Copy link
Contributor Author

thanks @msokolov for your initial feedback!
me and @eliaporciani will think about it and design some changes in the next few days!

@alessandrobenedetti
Copy link
Contributor Author

I should have addressed most of the comments regarding the documentation.
Thank you very much for the suggestions @ctargett , they are really valuable, especially for a software engineer that rarely has touched the solr doc like me :)

Only a few questions remain, that I don't think are really critical, so I'll wait for some reply on them before committing.

In regards to the "Neural Search" name, as I expressed in a comment before, I think it's fine, but I'll raise the matter to the wider community through slack for a quick feedback!

Cheers and thanks again!

@alessandrobenedetti
Copy link
Contributor Author

alessandrobenedetti commented Jan 19, 2022

I refactored the documentation contribution, changing the name of the section to "Dense Vector Search".
Whenever in the future we add some additional neural search techniques, we can add a parent section that contains the various children.
I'll wait some additional hours to gather additional feedback, especially for the documentation (thanks @ctargett again for your time).

If no blockers, I'll proceed tomorrow with the merge!

@alessandrobenedetti alessandrobenedetti merged commit c309a8a into apache:main Jan 24, 2022
alessandrobenedetti added a commit that referenced this pull request Jan 24, 2022
This contribution introduces dense vector indexing and searching
alessandrobenedetti added a commit that referenced this pull request Jan 24, 2022
This contribution introduces dense vector indexing and searching
@stefan-it
Copy link

Hi guys,

thanks for working on this feature! I would like to index vectors with a dimension > 1024 (embeddings come from a ResNet), do you know how this would be possible 🤔

Many thanks in advance ❤️

@msokolov
Copy link
Contributor

msokolov commented Feb 3, 2022

We put in a limit of 1024 since we need some limit in order to help people not to shoot themselves in the foot and then blame the software... What dimension in fact are your vectors? Maybe consider adding some dimensionality reduction step? Sorry I don't know what a ResNet is :)

@stefan-it
Copy link

I'm currently using 2048, but I can use a smaller model that outputs 1024 :)

@ramayer
Copy link

ramayer commented Apr 21, 2022

@msokolov : larger embeddings are definitely useful. Oxford's "VGG-Face" outputs a vector of 2622 elements.

Support for larger embeddings would enable Solr features like "boost the results of word docs containing an image with a similar face to George Washington" using such models .

@rattle99
Copy link

rattle99 commented Feb 29, 2024

Trying to query for topK greater than 10 but cannot get more than 10 documents in response.

http://localhost:8983/solr/films/query?q={%21knn%20f=film_vector%20topK=17}[-0.1784,0.0096,-0.1455,0.4167,-0.1148,-0.0053,-0.0651,-0.0415,0.0859,-0.1789]

This is from the film examples in solr docs, the query works fine on the UI with

{!knn f=film_vector topK=12}[-0.1784,0.0096,-0.1455,0.4167,-0.1148,-0.0053,-0.0651,-0.0415,0.0859,-0.1789]

Is there something in the documentation regarding this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.