Skip to content

Commit

Permalink
Add support for indexing byte-sized knn vectors (#90774)
Browse files Browse the repository at this point in the history
This change adds an element_type as an optional mapping parameter for dense vector fields as 
described in #89784. This also adds a byte element_type for dense vector fields that supports storing 
dense vectors using only 8-bits per dimension. This is only supported when the mapping parameter 
index is set to true.

The code follows a similar pattern to our NumberFieldMapper where we have an enum for 
ElementType, and it has methods that DenseVectorFieldType and DenseVectorMapper can delegate to 
to support each available type (just float and byte for now).
  • Loading branch information
jdconrad committed Oct 20, 2022
1 parent 81fd614 commit f28ae4b
Show file tree
Hide file tree
Showing 7 changed files with 863 additions and 65 deletions.
5 changes: 5 additions & 0 deletions docs/changelog/90774.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 90774
summary: Add support for indexing byte-sized knn vectors
area: Vector Search
type: feature
issues: []
33 changes: 28 additions & 5 deletions docs/reference/mapping/types/dense-vector.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
<titleabbrev>Dense vector</titleabbrev>
++++

The `dense_vector` field type stores dense vectors of float values. Dense
The `dense_vector` field type stores dense vectors of numeric values. Dense
vector fields can be used in the following ways:

* In <<query-dsl-script-score-query,`script_score`>> queries, to score
Expand All @@ -14,8 +14,12 @@ documents matching a filter
vectors to a query vector

The `dense_vector` type does not support aggregations or sorting.
When <<dense-vector-params, `element_type`>> is `byte`
<<query-dsl-script-score-query,`script_score`>> is not supported.

You add a `dense_vector` field as an array of floats:
You add a `dense_vector` field as an array of numeric values
based on <<dense-vector-params, `element_type`>> with
`float` by default:

[source,console]
--------------------------------------------------
Expand Down Expand Up @@ -104,6 +108,16 @@ Dense vector fields cannot be indexed if they are within

The following mapping parameters are accepted:

`element_type`::
(Optional, string)
The data type used to encode vectors. The supported data types are
`float` (default) and `byte`. `float` indexes a 4-byte floating-point
value per dimension. `byte` indexes a 1-byte integer value per dimension.
`byte` requires `index` to be `true`. Using `byte` can result in a
substantially smaller index size with the trade off of lower
precision. Vectors using `byte` require dimensions with integer values
between -128 to 127, inclusive for both indexing and searching.

`dims`::
(Required, integer)
Number of vector dimensions. Can't exceed `1024` for indexed vectors
Expand Down Expand Up @@ -134,9 +148,18 @@ distance) between the vectors. The document `_score` is computed as
`dot_product`:::
Computes the dot product of two vectors. This option provides an optimized way
to perform cosine similarity. In order to use it, all vectors must be of unit
length, including both document and query vectors. The document `_score` is
computed as `(1 + dot_product(query, vector)) / 2`.
to perform cosine similarity. The constraints and computed score are defined
by `element_type`.
+
When `element_type` is `float`, all vectors must be unit length, including both
document and query vectors. The document `_score` is computed as
`(1 + dot_product(query, vector)) / 2`.
+
When `element_type` is `byte`, all vectors must have the same
length including both document and query vectors or results will be inaccurate.
The document `_score` is computed as
`0.5 + (dot_product(query, vector) / (32768 * dims))`
where `dims` is the number of dimensions per vector.
`cosine`:::
Computes the cosine similarity. Note that the most efficient way to perform
Expand Down
70 changes: 70 additions & 0 deletions docs/reference/search/search-your-data/knn-search.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,76 @@ search has a higher probability of finding the true `k` top nearest neighbors.
Similarly, you can decrease `num_candidates` for faster searches with
potentially less accurate results.

[discrete]
[[approximate-knn-using-byte-vectors]]
==== Approximate kNN using byte vectors

The approximate kNN search API supports `byte` value vectors in
addition to `float` value vectors. Use the <<search-api-knn, `knn` option>>
to search a `dense_vector` field with <<dense-vector-params, `element_type`>> set to
`byte` and indexing enabled.

. Explicitly map one or more `dense_vector` fields with
<<dense-vector-params, `element_type`>> set to `byte` and indexing enabled.
+
[source,console]
----
PUT byte-image-index
{
"mappings": {
"properties": {
"byte-image-vector": {
"type": "dense_vector",
"element_type": "byte",
"dims": 2,
"index": true,
"similarity": "cosine"
},
"title": {
"type": "text"
}
}
}
}
----
// TEST[continued]

. Index your data ensuring all vector values
are integers within the range [-128, 127].
+
[source,console]
----
POST byte-image-index/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "byte-image-vector": [5, -20], "title": "moose family" }
{ "index": { "_id": "2" } }
{ "byte-image-vector": [8, -15], "title": "alpine lake" }
{ "index": { "_id": "3" } }
{ "byte-image-vector": [11, 23], "title": "full moon" }
----
//TEST[continued]

. Run the search using the <<search-api-knn, `knn` option>>
ensuring the `query_vector` values are integers within the
range [-128, 127].
+
[source,console]
----
POST byte-image-index/_search
{
"knn": {
"field": "byte-image-vector",
"query_vector": [-5, 9],
"k": 10,
"num_candidates": 100
},
"fields": [ "title" ]
}
----
// TEST[continued]
// TEST[s/"k": 10/"k": 3/]
// TEST[s/"num_candidates": 100/"num_candidates": 3/]

[discrete]
[[knn-search-filter-example]]
==== Filtered kNN search
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
setup:
- skip:
version: ' - 8.5.99'
reason: 'byte-sized kNN search added in 8.6'

- do:
indices.create:
index: test
body:
settings:
number_of_replicas: 0
mappings:
properties:
name:
type: keyword
vector:
type: dense_vector
element_type: byte
dims: 5
index: true
similarity: cosine

- do:
index:
index: test
id: "1"
body:
name: cow.jpg
vector: [2, -1, 1, 4, -3]

- do:
index:
index: test
id: "2"
body:
name: moose.jpg
vector: [127.0, -128.0, 0.0, 1.0, -1.0]

- do:
index:
index: test
id: "3"
body:
name: rabbit.jpg
vector: [5, 4.0, 3, 2.0, 127]

- do:
indices.refresh: {}

---
"kNN search only":
- do:
search:
index: test
body:
fields: [ "name" ]
knn:
field: vector
query_vector: [127, 127, -128, -128, 127]
k: 2
num_candidates: 3

- match: {hits.hits.0._id: "3"}
- match: {hits.hits.0.fields.name.0: "rabbit.jpg"}

- match: {hits.hits.1._id: "2"}
- match: {hits.hits.1.fields.name.0: "moose.jpg"}

---
"kNN search plus query":
- do:
search:
index: test
body:
fields: [ "name" ]
knn:
field: vector
query_vector: [127.0, -128.0, 0.0, 1.0, -1.0]
k: 2
num_candidates: 3
query:
term:
name: rabbit.jpg

- match: {hits.hits.0._id: "2"}
- match: {hits.hits.0.fields.name.0: "moose.jpg"}

- match: {hits.hits.1._id: "3"}
- match: {hits.hits.1.fields.name.0: "rabbit.jpg"}

- match: {hits.hits.2._id: "1"}
- match: {hits.hits.2.fields.name.0: "cow.jpg"}

---
"kNN search with filter":
- do:
search:
index: test
body:
fields: [ "name" ]
knn:
field: vector
query_vector: [5.0, 4, 3.0, 2, 127.0]
k: 2
num_candidates: 3

filter:
term:
name: "rabbit.jpg"

- match: {hits.total.value: 1}
- match: {hits.hits.0._id: "3"}
- match: {hits.hits.0.fields.name.0: "rabbit.jpg"}

- do:
search:
index: test
body:
fields: [ "name" ]
knn:
field: vector
query_vector: [2, -1, 1, 4, -3]
k: 2
num_candidates: 3
filter:
- term:
name: "rabbit.jpg"
- term:
_id: 2

- match: {hits.total.value: 0}

---
"kNN search with explicit search_type":
- do:
catch: bad_request
search:
index: test
search_type: query_then_fetch
body:
fields: [ "name" ]
knn:
field: vector
query_vector: [-0.5, 90.0, -10, 14.8, -156.0]
k: 2
num_candidates: 3

- match: { error.root_cause.0.type: "illegal_argument_exception" }
- match: { error.root_cause.0.reason: "cannot set [search_type] when using [knn] search, since the search type is determined automatically" }

---
"Test nonexistent field":
- do:
catch: bad_request
search:
index: test
body:
fields: [ "name" ]
knn:
field: nonexistent
query_vector: [ 1, 0, 0, 0, -1 ]
k: 2
num_candidates: 3
- match: { error.root_cause.0.type: "query_shard_exception" }
- match: { error.root_cause.0.reason: "failed to create query: field [nonexistent] does not exist in the mapping" }

---
"Direct kNN queries are disallowed":
- do:
catch: bad_request
search:
index: test
body:
query:
knn:
field: vector
query_vector: [ -1, 0, 1, 2, 3 ]
num_candidates: 1
- match: { error.root_cause.0.type: "illegal_argument_exception" }
- match: { error.root_cause.0.reason: "[knn] queries cannot be provided directly, use the [knn] body parameter instead" }

0 comments on commit f28ae4b

Please sign in to comment.