Add l1norm and l2norm distances for vectors #44116

mayya-sharipova · 2019-07-09T13:54:49Z

Add L1norm - Manhattan distance
Add L2norm - Euclidean distance
relates to #37947

Add L1norm - Manhattan distance Add L2norm - Euclidean distance relates to elastic#37947

elasticmachine · 2019-07-09T13:55:06Z

Pinging @elastic/es-search

mayya-sharipova · 2019-07-09T13:56:31Z

@cbuescher Can you please review this PR when you have time. I have tried to address here all your comments from #40255

cbuescher

Thanks @mayya-sharipova, looks very good already. I left a few minor questions and some asks around testing, nothing that seems to hold up this PR for a long time though.

cbuescher · 2019-07-10T09:52:57Z

docs/reference/query-dsl/script-score-query.asciidoc

@@ -31,8 +30,7 @@ GET /_search
     }
 }
 --------------------------------------------------
-// CONSOLE
-// TEST[setup:twitter]
+// NOTCONSOLE


Why doesn't this work anymore as a tested console snippet? Can it (or the data) be changed so we can keep it tested?

@cbuescher this doesn't work anymore as there could only one TEST SETUP per a doc file. I have decided to use TEST SETUP for testing vectors, and unable to use setup with twitter data for this example, so I made it NOTCONSOLE

an alternative could be to create a separate doc file for vector functions. I can do this

No, thats fine then. I see enough snippets that are tested later. One without testing if okay I think if the way to solve this involves too much work.

cbuescher · 2019-07-10T09:57:11Z

docs/reference/query-dsl/script-score-query.asciidoc

+PUT my_index/_doc/1
+{
+  "my_dense_vector": [0.5, 10, 6],
+  "my_sparse_vector": {"2": 1.5, "15" : 2, "50": -1.1, "4545": 1.1}


Just curious for my own sake, can the sparse vector keys also be ids, e.g. "feature1" etc...? Since they are strings here one might think so.

No, they can only be integers. We throw an exception if a vector key is not convertible to an integer.

Thanks, good to know.

cbuescher · 2019-07-10T10:00:27Z

docs/reference/query-dsl/script-score-query.asciidoc

+
+NOTE: Unlike `cosineSimilarity` that represent similarity, `l1norm` and
+`l2norm` shown below represent distances or differences. This means, that
+the more similar are the vectors, the less will be the scores produced by the


Not sure 100% since not a native speaker, but "the more similar the vectors are" sounds better to me here. Probably also "... the lower the scores will be that are produced...", but maybe better double check with a native speaker. Doesn't matter much though.

cbuescher · 2019-07-10T10:19:14Z

x-pack/plugin/src/test/resources/rest-api-spec/test/vectors/15_dense_vector_l1l2.yml

+
+  - match: {hits.hits.1._id: "2"}
+  - gte: {hits.hits.1._score: 12.25}
+  - lte: {hits.hits.1._score: 12.35}


this is quite some range thats allowed here, I'm getting 12.3 as the expected value. Do we need that big of a margin here? If so, do you know why? Should we document the fact in case the value deviates from the numerically expected value so much?

cbuescher · 2019-07-10T10:26:00Z

x-pack/plugin/src/test/resources/rest-api-spec/test/vectors/35_sparse_vector_l1l2.yml

+
+  - match: {hits.hits.1._id: "2"}
+  - gte: {hits.hits.1._score: 12.25}
+  - lte: {hits.hits.1._score: 12.35}


same here as above

cbuescher · 2019-07-10T10:36:45Z

x-pack/plugin/vectors/src/main/java/org/elasticsearch/xpack/vectors/query/ScoreScriptUtils.java

-            //break vector into two arrays dims and values
-            int n = queryVector.size();
-            queryValues = new double[n];
-            queryDims = new int[n];


Just curious about the deletions here, whats the relation of this to adding the l1/l2 norm? Or is this part of a different change?

@cbuescher One of your comments from the previous PR was for *Sparse classes to share a common abstract superclass for e.g. sharing the common code in the constructor. I thought this is a very good point and redesigned all *Sparse classes to extend from VectorSparseFunctions to share a common code in the constructor.

Great, got it, thanks. I just wasn't sure where this moved to, might have been in a differen PR or I missed it here..

cbuescher · 2019-07-10T10:59:31Z

...lugin/vectors/src/test/java/org/elasticsearch/xpack/vectors/query/ScoreScriptUtilsTests.java

+
+    public void testSparseVectorFunctionsSpecialCases() {
+        // Query and document vector have missing dimensions that are present in one, and absent in another
+        int[] docVectorDims = {2, 10, 11, 50, 113, 4545};


I know if might be paranoid, but can you add something so the last doc dimension is larger than the biggest query dimention to trigger the while-loops at the end of the function?
And the other way around, we'd probably need a second case where the last queryVector dim > highest doc dim. Sorry, might be a bit paranoid but those are code paths I don't see covered otherwise.

- organize vector functions as a separate doc - increase precision in tests calculations - add a separate test when sparse doc dims are bigger and less than query vector dims

jtibshirani · 2019-07-10T20:50:19Z

docs/reference/vectors/vector-functions.asciidoc

+--------------------------------------------------
+// CONSOLE
+
+NOTE: Unlike `cosineSimilarity` that represent similarity, `l1norm` and


I had a couple suggestions around this transformation:

We could include a 1 in the denominator to avoid division by 0 when a document vector matches the query exactly: 1 / (1 + l1norm(params.queryVector, doc['my_dense_vector'])).

Would it make sense to use this transformation in the search examples themselves, instead of just including it in a note? We could give a description of it in a callout, as we do for cosineSimilarity and dotProduct. It would be harder to miss that way, and I also think it makes the examples more helpful + realistic, since it seems very unlikely a user would want to score by distance descending.

@jtibshirani Thanks these are very good suggestions, I have addressed them in the last commit

Thanks @mayya-sharipova, the new commit you added looks good to me!

cbuescher · 2019-07-10T20:51:51Z

...lugin/vectors/src/test/java/org/elasticsearch/xpack/vectors/query/ScoreScriptUtilsTests.java


        // test l1norm
        L1NormSparse l1Norm = new L1NormSparse(queryVector);
        double result3 = l1Norm.l1normSparse(dvs);
-        assertEquals("l1normSparse result is not equal to the expected value!", 517.18, result3, 0.1);
+        assertEquals("l1normSparse result is not equal to the expected value!", 517.184, result3, 0.001);


I would have expeced the l1 norm to change by 11.5 when adding the "4546" doc vector dimentions of 11.5f?
Could you re-check the expected value? I think the while-loop in the function might miss the last "biggest" value?
Same for the l2-norm also btw.

Sorry, my bad. I now saw you took out the "11" and moved it to "4546" with the same value. All good.

cbuescher

LGTM

…nces

mayya-sharipova · 2019-07-11T14:10:01Z

@elasticmachine run elasticsearch-ci/2

Add L1norm - Manhattan distance Add L2norm - Euclidean distance relates to #37947

Add l1norm and l2norm distances for vectors

eb48e80

Add L1norm - Manhattan distance Add L2norm - Euclidean distance relates to elastic#37947

mayya-sharipova added the :Search Relevance/Ranking Scoring, rescoring, rank evaluation. label Jul 9, 2019

mayya-sharipova added >enhancement v7.4.0 v8.0.0 labels Jul 9, 2019

mayya-sharipova requested a review from cbuescher July 9, 2019 13:55

cbuescher requested changes Jul 10, 2019

View reviewed changes

Address Christoph's feedback

670159f

- organize vector functions as a separate doc - increase precision in tests calculations - add a separate test when sparse doc dims are bigger and less than query vector dims

jtibshirani reviewed Jul 10, 2019

View reviewed changes

cbuescher reviewed Jul 10, 2019

View reviewed changes

cbuescher approved these changes Jul 10, 2019

View reviewed changes

mayya-sharipova added 2 commits July 11, 2019 09:15

Made examples more realistic

13b6e77

Merge remote-tracking branch 'upstream/master' into l1l2-vector-dista…

2874dbf

…nces

mayya-sharipova merged commit 16747f8 into elastic:master Jul 11, 2019

mayya-sharipova added a commit that referenced this pull request Jul 11, 2019

Add l1norm and l2norm distances for vectors (#44116)

32cb47b

Add L1norm - Manhattan distance Add L2norm - Euclidean distance relates to #37947

mayya-sharipova deleted the l1l2-vector-distances branch July 11, 2019 18:33

mayya-sharipova mentioned this pull request Jul 18, 2019

Add more distance metrics for vector fields #40473

Closed

codebrain mentioned this pull request Oct 14, 2019

7.4 meta ticket elastic/elasticsearch-net#4133

Closed

56 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

jtibshirani added :Search Relevance/Vectors Vector search and removed :Search Relevance/Ranking Scoring, rescoring, rank evaluation. labels Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add l1norm and l2norm distances for vectors #44116

Add l1norm and l2norm distances for vectors #44116

mayya-sharipova commented Jul 9, 2019

elasticmachine commented Jul 9, 2019

mayya-sharipova commented Jul 9, 2019

cbuescher left a comment

cbuescher Jul 10, 2019

mayya-sharipova Jul 10, 2019

mayya-sharipova Jul 10, 2019

cbuescher Jul 10, 2019

cbuescher Jul 10, 2019

mayya-sharipova Jul 10, 2019

cbuescher Jul 10, 2019

cbuescher Jul 10, 2019

cbuescher Jul 10, 2019

cbuescher Jul 10, 2019

cbuescher Jul 10, 2019

mayya-sharipova Jul 10, 2019

cbuescher Jul 10, 2019

cbuescher Jul 10, 2019

jtibshirani Jul 10, 2019 •

edited

Loading

mayya-sharipova Jul 11, 2019

jtibshirani Jul 11, 2019

cbuescher Jul 10, 2019

cbuescher Jul 10, 2019

cbuescher left a comment

mayya-sharipova commented Jul 11, 2019

Add l1norm and l2norm distances for vectors #44116

Add l1norm and l2norm distances for vectors #44116

Conversation

mayya-sharipova commented Jul 9, 2019

elasticmachine commented Jul 9, 2019

mayya-sharipova commented Jul 9, 2019

cbuescher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani Jul 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbuescher left a comment

Choose a reason for hiding this comment

mayya-sharipova commented Jul 11, 2019

jtibshirani Jul 10, 2019 •

edited

Loading