Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cosine similarity faster by storing magnitude and normalizing vectors #99445

Merged
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
8649522
Make cosine similarity faster through normalization
benwtrent Sep 11, 2023
0ac4be5
adding test coverage
benwtrent Sep 11, 2023
240af15
adding more testing
benwtrent Sep 11, 2023
1bf4c5a
adjusting iterator
benwtrent Sep 11, 2023
2c8ed33
Update docs/changelog/99445.yaml
benwtrent Sep 11, 2023
1724d28
updating uuid
benwtrent Sep 11, 2023
1825930
removing unnecessary commits
benwtrent Sep 11, 2023
b559bb9
Merge remote-tracking branch 'upstream/main' into feature/make-cosine…
benwtrent Oct 23, 2023
a1064d2
updating
benwtrent Oct 23, 2023
e762440
Merge branch 'main' into feature/make-cosine-faster
elasticmachine Oct 23, 2023
4c14eac
Merge remote-tracking branch 'upstream/main' into feature/make-cosine…
benwtrent Nov 15, 2023
9ac56fa
improving performance for scripts
benwtrent Nov 15, 2023
e6f0bf1
Merge branch 'feature/make-cosine-faster' of github.com:benwtrent/ela…
benwtrent Nov 15, 2023
ea0bb1c
fix compilation
benwtrent Nov 15, 2023
3f10623
fixing tests
benwtrent Nov 15, 2023
a9a775f
fixing tests
benwtrent Nov 15, 2023
b655c3b
Merge branch 'main' into feature/make-cosine-faster
benwtrent Nov 28, 2023
c6b8677
fixing tests
benwtrent Nov 28, 2023
143797b
Merge remote-tracking branch 'upstream/main' into feature/make-cosine…
benwtrent Nov 29, 2023
78ff5b2
Moving magnitude decoding to be lazy & addressing PR comments
benwtrent Nov 29, 2023
a8a69ac
addressing pr comments
benwtrent Dec 1, 2023
9825034
Merge remote-tracking branch 'upstream/main' into feature/make-cosine…
benwtrent Dec 1, 2023
a5eb4fb
danged formatting garbage
benwtrent Dec 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/changelog/99445.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 99445
summary: Make cosine similarity faster by storing magnitude and normalizing vectors
area: Vector Search
type: enhancement
issues: []
Original file line number Diff line number Diff line change
Expand Up @@ -135,3 +135,87 @@ setup:
- match: {hits.hits.2._id: "1"}
- gte: {hits.hits.2._score: 0.78}
- lte: {hits.hits.2._score: 0.791}

---
"L2 similarity with indexed cosine similarity vector":
- skip:
features: close_to
- do:
headers:
Content-Type: application/json
search:
rest_total_hits_as_int: true
body:
query:
script_score:
query: {match_all: {} }
script:
source: "l2norm(params.query_vector, 'indexed_vector')"
params:
query_vector: [0.5, 111.3, -13.0, 14.8, -156.0]

- match: {hits.total: 3}

- match: {hits.hits.0._id: "1"}
- close_to: {hits.hits.0._score: {value: 301.36, error: 0.01}}

- match: {hits.hits.1._id: "2"}
- close_to: {hits.hits.1._score: {value: 11.34, error: 0.01}}

- match: {hits.hits.2._id: "3"}
- close_to: {hits.hits.2._score: {value: 0.01, error: 0.01}}
---
"L1 similarity with indexed cosine similarity vector":
- skip:
features: close_to
- do:
headers:
Content-Type: application/json
search:
rest_total_hits_as_int: true
body:
query:
script_score:
query: {match_all: {} }
script:
source: "l1norm(params.query_vector, 'indexed_vector')"
params:
query_vector: [0.5, 111.3, -13.0, 14.8, -156.0]

- match: {hits.total: 3}

- match: {hits.hits.0._id: "1"}
- close_to: {hits.hits.0._score: {value: 485.18, error: 0.01}}

- match: {hits.hits.1._id: "2"}
- close_to: {hits.hits.1._score: {value: 12.30, error: 0.01}}

- match: {hits.hits.2._id: "3"}
- close_to: {hits.hits.2._score: {value: 0.01, error: 0.01}}
---
"Test vector magnitude equality":
- skip:
features: close_to

- do:
headers:
Content-Type: application/json
search:
rest_total_hits_as_int: true
body:
query:
script_score:
query: {match_all: {} }
script:
source: "doc['vector'].magnitude"

- match: {hits.total: 3}

- match: {hits.hits.0._id: "1"}
- close_to: {hits.hits.0._score: {value: 429.6021, error: 0.01}}

- match: {hits.hits.1._id: "3"}
- close_to: {hits.hits.1._score: {value: 192.6447, error: 0.01}}

- match: {hits.hits.2._id: "2"}
- close_to: {hits.hits.2._score: {value: 186.34454, error: 0.01}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
setup:
- skip:
version: ' - 7.99.99'
reason: 'kNN search added in 8.0'
- do:
indices.create:
index: test
body:
mappings:
properties:
vector:
type: dense_vector
dims: 5
index: true
similarity: cosine
normalized_vector:
type: dense_vector
dims: 5
index: true
similarity: cosine
end_normalized:
type: dense_vector
dims: 5
index: true
similarity: cosine
first_normalized:
type: dense_vector
dims: 5
index: true
similarity: cosine
middle_normalized:
type: dense_vector
dims: 5
index: true
similarity: cosine


- do:
index:
index: test
id: "1"
body:
name: cow.jpg
vector: [230.0, 300.33, -34.8988, 15.555, -200.0]
middle_normalized: [230.0, 300.33, -34.8988, 15.555, -200.0]
normalized_vector: [0.5353791, 0.6990887, -0.08123516, 0.03620792, -0.46554706]
end_normalized: [230.0, 300.33, -34.8988, 15.555, -200.0]
first_normalized: [0.5353791, 0.6990887, -0.08123516, 0.03620792, -0.46554706]

- do:
index:
index: test
id: "2"
body:
name: moose.jpg
vector: [-0.5, 100.0, -13, 14.8, -156.0]
first_normalized: [-0.5, 100.0, -13, 14.8, -156.0]
normalized_vector: [-0.0026832016, 0.53664035, -0.06976324, 0.07942277, -0.8371589]
middle_normalized: [-0.0026832016, 0.53664035, -0.06976324, 0.07942277, -0.8371589]
end_normalized: [-0.0026832016, 0.53664035, -0.06976324, 0.07942277, -0.8371589]

- do:
index:
index: test
id: "3"
body:
name: rabbit.jpg
vector: [0.5, 111.3, -13.0, 14.8, -156.0]
first_normalized: [0.5, 111.3, -13.0, 14.8, -156.0]
middle_normalized: [0.5, 111.3, -13.0, 14.8, -156.0]
normalized_vector: [0.0025954517, 0.5777475, -0.06748174, 0.076825365, -0.8097809]
end_normalized: [0.0025954517, 0.5777475, -0.06748174, 0.076825365, -0.8097809]

- do:
indices.refresh: {}

---
"kNN search only regular query":
- skip:
version: ' - 8.3.99'
reason: 'kNN added to search endpoint in 8.4'
features: close_to
- do:
search:
index: test
body:
fields: [ "name" ]
knn:
field: vector
query_vector: [-0.5, 90.0, -10, 14.8, -156.0]
k: 3
num_candidates: 3

- match: {hits.hits.0._id: "2"}
- close_to: {hits.hits.0._score: {value: 0.999405, error: 0.0001}}

- match: {hits.hits.1._id: "3"}
- close_to: {hits.hits.1._score: {value: 0.9976501, error: 0.0001}}

- do:
search:
index: test
body:
fields: [ "name" ]
knn:
field: normalized_vector
query_vector: [-0.5, 90.0, -10, 14.8, -156.0]
k: 3
num_candidates: 3

- match: {hits.hits.0._id: "2"}
- close_to: {hits.hits.0._score: {value: 0.999405, error: 0.0001}}

- match: {hits.hits.1._id: "3"}
- close_to: {hits.hits.1._score: {value: 0.9976501, error: 0.0001}}

- do:
search:
index: test
body:
fields: [ "name" ]
knn:
field: first_normalized
query_vector: [-0.5, 90.0, -10, 14.8, -156.0]
k: 3
num_candidates: 3

- match: {hits.hits.0._id: "2"}
- close_to: {hits.hits.0._score: {value: 0.999405, error: 0.0001}}

- match: {hits.hits.1._id: "3"}
- close_to: {hits.hits.1._score: {value: 0.9976501, error: 0.0001}}

- do:
search:
index: test
body:
fields: [ "name" ]
knn:
field: middle_normalized
query_vector: [-0.5, 90.0, -10, 14.8, -156.0]
k: 3
num_candidates: 3

- match: {hits.hits.0._id: "2"}
- close_to: {hits.hits.0._score: {value: 0.999405, error: 0.0001}}

- match: {hits.hits.1._id: "3"}
- close_to: {hits.hits.1._score: {value: 0.9976501, error: 0.0001}}

- do:
search:
index: test
body:
fields: [ "name" ]
knn:
field: end_normalized
query_vector: [-0.5, 90.0, -10, 14.8, -156.0]
k: 3
num_candidates: 3

- match: {hits.hits.0._id: "2"}
- close_to: {hits.hits.0._score: {value: 0.999405, error: 0.0001}}

- match: {hits.hits.1._id: "3"}
- close_to: {hits.hits.1._score: {value: 0.9976501, error: 0.0001}}

# With a normalized query vector, all should be the same

- do:
search:
index: test
body:
fields: [ "name" ]
knn:
field: vector
query_vector: [-0.0027626718, 0.4972809, -0.055253435, 0.081775084, -0.86195356]
k: 3
num_candidates: 3

- match: {hits.hits.0._id: "2"}
- close_to: {hits.hits.0._score: {value: 0.999405, error: 0.0001}}

- match: {hits.hits.1._id: "3"}
- close_to: {hits.hits.1._score: {value: 0.9976501, error: 0.0001}}

- do:
search:
index: test
body:
fields: [ "name" ]
knn:
field: normalized_vector
query_vector: [-0.0027626718, 0.4972809, -0.055253435, 0.081775084, -0.86195356]
k: 3
num_candidates: 3

- match: {hits.hits.0._id: "2"}
- close_to: {hits.hits.0._score: {value: 0.999405, error: 0.0001}}

- match: {hits.hits.1._id: "3"}
- close_to: {hits.hits.1._score: {value: 0.9976501, error: 0.0001}}

- do:
search:
index: test
body:
fields: [ "name" ]
knn:
field: first_normalized
query_vector: [-0.0027626718, 0.4972809, -0.055253435, 0.081775084, -0.86195356]
k: 3
num_candidates: 3

- match: {hits.hits.0._id: "2"}
- close_to: {hits.hits.0._score: {value: 0.999405, error: 0.0001}}

- match: {hits.hits.1._id: "3"}
- close_to: {hits.hits.1._score: {value: 0.9976501, error: 0.0001}}

- do:
search:
index: test
body:
fields: [ "name" ]
knn:
field: middle_normalized
query_vector: [-0.0027626718, 0.4972809, -0.055253435, 0.081775084, -0.86195356]
k: 3
num_candidates: 3

- match: {hits.hits.0._id: "2"}
- close_to: {hits.hits.0._score: {value: 0.999405, error: 0.0001}}

- match: {hits.hits.1._id: "3"}
- close_to: {hits.hits.1._score: {value: 0.9976501, error: 0.0001}}

- do:
search:
index: test
body:
fields: [ "name" ]
knn:
field: end_normalized
query_vector: [-0.0027626718, 0.4972809, -0.055253435, 0.081775084, -0.86195356]
k: 3
num_candidates: 3

- match: {hits.hits.0._id: "2"}
- close_to: {hits.hits.0._score: {value: 0.999405, error: 0.0001}}

- match: {hits.hits.1._id: "3"}
- close_to: {hits.hits.1._score: {value: 0.9976501, error: 0.0001}}
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ private static IndexVersion def(int id, Version luceneVersion) {
public static final IndexVersion NEW_SPARSE_VECTOR = def(8_500_001, Version.LUCENE_9_7_0);
public static final IndexVersion SPARSE_VECTOR_IN_FIELD_NAMES_SUPPORT = def(8_500_002, Version.LUCENE_9_7_0);
public static final IndexVersion UPGRADE_LUCENE_9_8 = def(8_500_003, Version.LUCENE_9_8_0);
public static final IndexVersion NORMALIZED_VECTOR_COSINE = def(8_500_004, Version.LUCENE_9_8_0);

/*
* STOP! READ THIS FIRST! No, really,
Expand Down
Loading