Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-15880: K Nearest Neighbors Search #476

Merged
merged 80 commits into from
Jan 24, 2022
Merged
Show file tree
Hide file tree
Changes from 50 commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
e96d41d
Implementation of
eliaporciani Dec 14, 2021
232de04
Merge branch 'main' of github.com:apache/solr into feature/index_vect…
eliaporciani Dec 14, 2021
d05f992
Added simple parsing for similarity functions
eliaporciani Dec 14, 2021
4231213
small fixes
eliaporciani Dec 20, 2021
80aa323
Merge remote-tracking branch 'upstream/main' into feature/index_vecto…
alessandrobenedetti Dec 21, 2021
6c32c93
some testing and refactor
alessandrobenedetti Dec 21, 2021
6eb8875
Dense Vector Field Type draft completed
alessandrobenedetti Dec 22, 2021
188f2f7
First Documentation Draft for field type
alessandrobenedetti Dec 22, 2021
35a7d08
Query parser review + test
alessandrobenedetti Dec 23, 2021
d307b21
Merge remote-tracking branch 'upstream/main' into feature/index_vecto…
alessandrobenedetti Dec 23, 2021
6065fcb
license fix
alessandrobenedetti Dec 23, 2021
561129b
documentation fix
alessandrobenedetti Dec 23, 2021
7d906ad
minor fixes
alessandrobenedetti Dec 24, 2021
e724a49
first documentation draft
alessandrobenedetti Dec 24, 2021
5198e79
documentation complete
alessandrobenedetti Dec 24, 2021
273e325
documentation complete
alessandrobenedetti Dec 24, 2021
88b81e4
minor documentation fix
alessandrobenedetti Dec 27, 2021
11bb38a
minor documentation fix
alessandrobenedetti Dec 27, 2021
6dcf63c
Merge remote-tracking branch 'upstream/main' into feature/index_vecto…
alessandrobenedetti Dec 30, 2021
3c15827
From string approach to numerical vector approach for serialization/d…
alessandrobenedetti Dec 30, 2021
cd8a6ee
minor
alessandrobenedetti Dec 30, 2021
0510def
from String.split to StringUtils.split( as per suggestion
alessandrobenedetti Jan 10, 2022
d3d67b8
introducing support for HNSW max connections and beam width codec cus…
alessandrobenedetti Jan 10, 2022
93053da
introducing support for HNSW max connections and beam width codec cus…
alessandrobenedetti Jan 10, 2022
99910ce
Update solr/solr-ref-guide/src/neural-search.adoc
alessandrobenedetti Jan 12, 2022
4582ef0
Update solr/solr-ref-guide/src/neural-search.adoc
alessandrobenedetti Jan 12, 2022
2fd0444
improved documentation for reranking
alessandrobenedetti Jan 12, 2022
7fb52ac
Update solr/solr-ref-guide/src/neural-search.adoc
alessandrobenedetti Jan 12, 2022
6f2aa9f
improved documentation for reranking
alessandrobenedetti Jan 12, 2022
3ab0084
avoid calling getFieldType twice
alessandrobenedetti Jan 18, 2022
8614d53
private visibility set
alessandrobenedetti Jan 18, 2022
d02e289
avoid for loop continuous arraylist resize for stored fields
alessandrobenedetti Jan 18, 2022
783053b
don't return default float vector in case of not numerical/string input
alessandrobenedetti Jan 18, 2022
db9ef19
Update solr/core/src/java/org/apache/solr/update/DocumentBuilder.java
alessandrobenedetti Jan 18, 2022
27cbfc9
Update solr/core/src/java/org/apache/solr/update/DocumentBuilder.java
alessandrobenedetti Jan 18, 2022
26b47ff
minor revert
alessandrobenedetti Jan 18, 2022
4b6c895
minor revert
alessandrobenedetti Jan 18, 2022
09c68bd
test for copy field dimension problem
alessandrobenedetti Jan 18, 2022
620f3fb
Update solr/core/src/java/org/apache/solr/schema/DenseVectorField.java
alessandrobenedetti Jan 18, 2022
e843b91
Update solr/core/src/java/org/apache/solr/search/neural/KnnQParser.java
alessandrobenedetti Jan 18, 2022
d71d9ab
Update solr/core/src/java/org/apache/solr/search/neural/KnnQParser.java
alessandrobenedetti Jan 18, 2022
a0c9a1b
minor signature change
alessandrobenedetti Jan 18, 2022
d2de75b
Update solr/core/src/test-files/solr/collection1/conf/bad-schema-dens…
alessandrobenedetti Jan 18, 2022
b449890
Update solr/core/src/test-files/solr/collection1/conf/bad-schema-dens…
alessandrobenedetti Jan 18, 2022
b3a8672
Update solr/core/src/test-files/solr/collection1/conf/schema-densevec…
alessandrobenedetti Jan 18, 2022
cc35dc5
Update solr/core/src/test-files/solr/collection1/conf/schema-densevec…
alessandrobenedetti Jan 18, 2022
740d9ed
Update solr/core/src/test-files/solr/collection1/conf/schema.xml
alessandrobenedetti Jan 18, 2022
01d0890
minor schema change
alessandrobenedetti Jan 18, 2022
9ace2d7
Update solr/core/src/test/org/apache/solr/search/QueryEqualityTest.java
alessandrobenedetti Jan 18, 2022
4386aee
minor documentation change
alessandrobenedetti Jan 18, 2022
be60d2b
Update solr/core/src/java/org/apache/solr/schema/DenseVectorField.java
alessandrobenedetti Jan 18, 2022
b1138ad
Update solr/core/src/java/org/apache/solr/schema/DenseVectorField.java
alessandrobenedetti Jan 18, 2022
52fe209
Update solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java
alessandrobenedetti Jan 18, 2022
e3abe41
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
cc06a04
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
01015bc
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
751a20b
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
e92f21d
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
495fc83
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
af83502
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
d81bca7
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
c88ba53
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
736c2a6
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
33e5973
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
7b35f24
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
403564b
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
bddc802
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
7e0c985
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
fdec065
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
b197ef0
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
297c6a0
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
210804f
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
1bf2f3d
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
fcb47ee
documentation changes in response to feedback
alessandrobenedetti Jan 19, 2022
8f596b3
documentation changes in response to feedback
alessandrobenedetti Jan 21, 2022
134e0e2
documentation changes in response to feedback
alessandrobenedetti Jan 21, 2022
92794ef
Merge remote-tracking branch 'upstream/main' into feature/index_vecto…
alessandrobenedetti Jan 24, 2022
6e16615
minor changes pre-commit
alessandrobenedetti Jan 24, 2022
477bfd9
minor changes pre-commit
alessandrobenedetti Jan 24, 2022
47e7983
minor changes pre-commit
alessandrobenedetti Jan 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 28 additions & 4 deletions solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java
Original file line number Diff line number Diff line change
Expand Up @@ -16,23 +16,27 @@
*/
package org.apache.solr.core;

import java.lang.invoke.MethodHandles;
import java.util.Arrays;
import java.util.Locale;

import org.apache.lucene.codecs.Codec;
import org.apache.lucene.codecs.DocValuesFormat;
import org.apache.lucene.codecs.KnnVectorsFormat;
import org.apache.lucene.codecs.PostingsFormat;
import org.apache.lucene.codecs.lucene90.Lucene90Codec;
import org.apache.lucene.codecs.lucene90.Lucene90Codec.Mode;
import org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsFormat;
import org.apache.solr.common.SolrException;
import org.apache.solr.common.SolrException.ErrorCode;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.schema.DenseVectorField;
import org.apache.solr.schema.FieldType;
import org.apache.solr.schema.SchemaField;
import org.apache.solr.util.plugin.SolrCoreAware;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.lang.invoke.MethodHandles;
import java.util.Arrays;
import java.util.Locale;

/**
* Per-field CodecFactory implementation, extends Lucene's
* and returns postings format implementations according to the
Expand Down Expand Up @@ -114,6 +118,26 @@ public DocValuesFormat getDocValuesFormatForField(String field) {
}
return super.getDocValuesFormatForField(field);
}

@Override
public KnnVectorsFormat getKnnVectorsFormatForField(String field) {
final SchemaField schemaField = core.getLatestSchema().getFieldOrNull(field);
FieldType fieldType = (schemaField == null ? null : schemaField.getType());
if (fieldType != null && fieldType instanceof DenseVectorField) {
alessandrobenedetti marked this conversation as resolved.
Show resolved Hide resolved
DenseVectorField vectorType = (DenseVectorField) fieldType;
String knnVectorFormatName = vectorType.getCodecFormat();
if (knnVectorFormatName != null) {
if (knnVectorFormatName.equals(Lucene90HnswVectorsFormat.class.getSimpleName())) {
int maxConn = vectorType.getHnswMaxConn();
int beamWidth = vectorType.getHnswBeamWidth();
return new Lucene90HnswVectorsFormat(maxConn, beamWidth);
} else {
return KnnVectorsFormat.forName(knnVectorFormatName);
}
}
}
return super.getKnnVectorsFormatForField(field);
}
};
}

Expand Down
279 changes: 279 additions & 0 deletions solr/core/src/java/org/apache/solr/schema/DenseVectorField.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,279 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.solr.schema;

import org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsFormat;
import org.apache.lucene.document.KnnVectorField;
import org.apache.lucene.index.IndexableField;
import org.apache.lucene.index.VectorSimilarityFunction;
import org.apache.lucene.queries.function.ValueSource;
import org.apache.lucene.search.KnnVectorQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.SortField;
import org.apache.lucene.util.hnsw.HnswGraph;
import org.apache.solr.common.SolrException;
import org.apache.solr.search.QParser;
import org.apache.solr.uninverting.UninvertingReader;

import java.util.ArrayList;
import java.util.List;
import java.util.Locale;
import java.util.Map;

import static java.util.Optional.ofNullable;
import static org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsFormat.DEFAULT_BEAM_WIDTH;
import static org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsFormat.DEFAULT_MAX_CONN;

/**
* Provides a field type to support Lucene's {@link
* org.apache.lucene.document.KnnVectorField}.
* See {@link org.apache.lucene.search.KnnVectorQuery} for more details.
* It supports a fixed cardinality dimension for the vector and a fixed similarity function.
* The default similarity is EUCLIDEAN_HNSW (L2).
alessandrobenedetti marked this conversation as resolved.
Show resolved Hide resolved
* The default index codec format is specified in the Lucene Codec constructor.
* For Lucene 9.0 e.g.
* See {@link org.apache.lucene.codecs.lucene90.Lucene90Codec}
* Currently only {@link org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsFormat} is supported for
* advanced hyper-parameter customisation.
* See {@link org.apache.lucene.util.hnsw.HnswGraph} for more details about the implementation.
*
* <br>
* Only {@code Indexed} and {@code Stored} attributes are supported.
*/
public class DenseVectorField extends FloatPointField {

static final String KNN_VECTOR_DIMENSION = "vectorDimension";
static final String KNN_SIMILARITY_FUNCTION = "similarityFunction";

static final String CODEC_FORMAT = "codecFormat";
static final String HNSW_MAX_CONNECTIONS = "hnswMaxConnections";
static final String HNSW_BEAM_WIDTH = "hnswBeamWidth";

private int dimension;
private VectorSimilarityFunction similarityFunction;
private VectorSimilarityFunction DEFAULT_SIMILARITY = VectorSimilarityFunction.EUCLIDEAN;

private String codecFormat;
/**
* This parameter is coupled with the {@link Lucene90HnswVectorsFormat} format implementation.
* Controls how many of the nearest neighbor candidates are connected to the new node. Defaults to
* {@link Lucene90HnswVectorsFormat#DEFAULT_MAX_CONN}. See {@link HnswGraph} for more details.
*/
private int hnswMaxConn;
/**
* This parameter is coupled with the {@link Lucene90HnswVectorsFormat} format implementation.
* The number of candidate neighbors to track while searching the graph for each newly inserted
* node. Defaults to to {@link Lucene90HnswVectorsFormat#DEFAULT_BEAM_WIDTH}. See {@link
* HnswGraph} for details.
*/
private int hnswBeamWidth;

public DenseVectorField() {
super();
}

public DenseVectorField(int dimension) {
super();
this.dimension = dimension;
this.similarityFunction = DEFAULT_SIMILARITY;
}

public DenseVectorField(int dimension, VectorSimilarityFunction similarityFunction) {
super();
this.dimension = dimension;
this.similarityFunction = similarityFunction;
}

@Override
public void init(IndexSchema schema, Map<String, String> args) {
this.dimension = ofNullable(args.get(KNN_VECTOR_DIMENSION))
.map(value -> Integer.parseInt(value))
.orElseThrow(() -> new SolrException(SolrException.ErrorCode.SERVER_ERROR, "the vector dimension is a mandatory parameter"));
args.remove(KNN_VECTOR_DIMENSION);

this.similarityFunction = ofNullable(args.get(KNN_SIMILARITY_FUNCTION))
.map(value -> VectorSimilarityFunction.valueOf(value.toUpperCase(Locale.ROOT)))
.orElse(DEFAULT_SIMILARITY);
args.remove(KNN_SIMILARITY_FUNCTION);

this.codecFormat = args.get(CODEC_FORMAT);
args.remove(CODEC_FORMAT);

this.hnswMaxConn = ofNullable(args.get(HNSW_MAX_CONNECTIONS))
.map(value -> Integer.parseInt(value))
.orElse(DEFAULT_MAX_CONN);
args.remove(HNSW_MAX_CONNECTIONS);

this.hnswBeamWidth = ofNullable(args.get(HNSW_BEAM_WIDTH))
.map(value -> Integer.parseInt(value))
.orElse(DEFAULT_BEAM_WIDTH);
args.remove(HNSW_BEAM_WIDTH);

this.properties &= ~MULTIVALUED;
this.properties &= ~UNINVERTIBLE;

super.init(schema, args);
}

public int getDimension() {
return dimension;
}

public VectorSimilarityFunction getSimilarityFunction() {
return similarityFunction;
}

public String getCodecFormat() {
return codecFormat;
}

public Integer getHnswMaxConn() {
return hnswMaxConn;
}

public Integer getHnswBeamWidth() {
return hnswBeamWidth;
}

@Override
public void checkSchemaField(final SchemaField field) throws SolrException {
super.checkSchemaField(field);
if (field.multiValued()) {
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
getClass().getSimpleName() + " fields can not be multiValued: " + field.getName());
}
alessandrobenedetti marked this conversation as resolved.
Show resolved Hide resolved

if (field.hasDocValues()) {
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
getClass().getSimpleName() + " fields can not have docValues: " + field.getName());
}
}

public List<IndexableField> createFields(SchemaField field, Object value) {
ArrayList<IndexableField> fields = new ArrayList<>();
float[] parsedVector;
try {
parsedVector = parseVector(value);
} catch (RuntimeException e) {
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Error while creating field '" + field + "' from value '" + value + "', expected format:'[f1, f2, f3...fn]' e.g. [1.0, 3.4, 5.6]", e);
}

alessandrobenedetti marked this conversation as resolved.
Show resolved Hide resolved
if (field.indexed()) {
fields.add(createField(field, parsedVector));
}
if (field.stored()) {
fields.ensureCapacity(parsedVector.length + 1);
for (float vectorElement : parsedVector) {
fields.add(getStoredField(field, vectorElement));
}
}
return fields;
}

@Override
public IndexableField createField(SchemaField field, Object parsedVector) {
if (parsedVector == null) return null;
float[] typedVector = (float[]) parsedVector;
return new KnnVectorField(field.getName(), typedVector, similarityFunction);
}

/**
* Index Time Parsing
* The inputValue is an ArrayList with a type that dipends on the loader used:
alessandrobenedetti marked this conversation as resolved.
Show resolved Hide resolved
* - {@link org.apache.solr.handler.loader.XMLLoader}, {@link org.apache.solr.handler.loader.CSVLoader} produces an ArrayList of String
* - {@link org.apache.solr.handler.loader.JsonLoader} produces an ArrayList of Double
* - {@link org.apache.solr.handler.loader.JavabinLoader} produces an ArrayList of Float
*
* @param inputValue - An {@link ArrayList} containing the elements of the vector
* @return the vector parsed
*/
float[] parseVector(Object inputValue) {
if (!(inputValue instanceof List)) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "incorrect vector format." +
" The expected format is an array :'[f1,f2..f3]' where each element f is a float");
}
List<?> inputVector = (List<?>) inputValue;
if (inputVector.size() != dimension) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "incorrect vector dimension." +
" The vector value has size "
+ inputVector.size() + " while it is expected a vector with size " + dimension);
}

float[] vector = new float[dimension];
if (inputVector.get(0) instanceof CharSequence) {
for (int i = 0; i < dimension; i++) {
try {
vector[i] = Float.parseFloat(inputVector.get(i).toString());
} catch (NumberFormatException e) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "incorrect vector element: '" + inputVector.get(i) +
"'. The expected format is:'[f1,f2..f3]' where each element f is a float");
}
}
} else if (inputVector.get(0) instanceof Number) {
for (int i = 0; i < dimension; i++) {
vector[i] = ((Number) inputVector.get(i)).floatValue();
}
} else {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "incorrect vector format." +
" The expected format is an array :'[f1,f2..f3]' where each element f is a float");
}
alessandrobenedetti marked this conversation as resolved.
Show resolved Hide resolved

return vector;
}

@Override
public UninvertingReader.Type getUninversionType(SchemaField sf) {
return null;
}

@Override
public ValueSource getValueSource(SchemaField field, QParser parser) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
"Function queries are not supported for Dense Vector fields.");
}

public Query getKnnVectorQuery(String fieldName, float[] vectorToSearch, int topK) {
return new KnnVectorQuery(fieldName, vectorToSearch, topK);
}

/**
* Not Supported
alessandrobenedetti marked this conversation as resolved.
Show resolved Hide resolved
*/
@Override
public Query getFieldQuery(QParser parser, SchemaField field, String externalVal) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
"Field Queries are not supported for Dense Vector fields. Please use the {!knn} query parser to run K nearest neighbors search queries.");
}

/**
* Not Supported
*/
@Override
public Query getRangeQuery(QParser parser, SchemaField field, String part1, String part2, boolean minInclusive, boolean maxInclusive) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
"Range Queries are not supported for Dense Vector fields. Please use the {!knn} query parser to run K nearest neighbors search queries.");
}

/**
* Not Supported
*/
@Override
public SortField getSortField(SchemaField field, boolean top) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Cannot sort on a Dense Vector field");
}

}
2 changes: 2 additions & 0 deletions solr/core/src/java/org/apache/solr/search/QParserPlugin.java
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
import org.apache.solr.search.join.GraphQParserPlugin;
import org.apache.solr.search.join.HashRangeQParserPlugin;
import org.apache.solr.search.mlt.MLTQParserPlugin;
import org.apache.solr.search.neural.KnnQParserPlugin;
import org.apache.solr.util.plugin.NamedListInitializedPlugin;

public abstract class QParserPlugin implements NamedListInitializedPlugin, SolrInfoBean {
Expand Down Expand Up @@ -87,6 +88,7 @@ public abstract class QParserPlugin implements NamedListInitializedPlugin, SolrI
map.put(MinHashQParserPlugin.NAME, new MinHashQParserPlugin());
map.put(HashRangeQParserPlugin.NAME, new HashRangeQParserPlugin());
map.put(RankQParserPlugin.NAME, new RankQParserPlugin());
map.put(KnnQParserPlugin.NAME, new KnnQParserPlugin());

standardPlugins = Collections.unmodifiableMap(map);
}
Expand Down
Loading