-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add l1norm and l2norm distances for vectors #44116
Add l1norm and l2norm distances for vectors #44116
Conversation
Add L1norm - Manhattan distance Add L2norm - Euclidean distance relates to elastic#37947
Pinging @elastic/es-search |
@cbuescher Can you please review this PR when you have time. I have tried to address here all your comments from #40255 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mayya-sharipova, looks very good already. I left a few minor questions and some asks around testing, nothing that seems to hold up this PR for a long time though.
@@ -31,8 +30,7 @@ GET /_search | |||
} | |||
} | |||
-------------------------------------------------- | |||
// CONSOLE | |||
// TEST[setup:twitter] | |||
// NOTCONSOLE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't this work anymore as a tested console snippet? Can it (or the data) be changed so we can keep it tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbuescher this doesn't work anymore as there could only one TEST SETUP per a doc file. I have decided to use TEST SETUP for testing vectors, and unable to use setup with twitter data for this example, so I made it NOTCONSOLE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an alternative could be to create a separate doc file for vector functions. I can do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, thats fine then. I see enough snippets that are tested later. One without testing if okay I think if the way to solve this involves too much work.
PUT my_index/_doc/1 | ||
{ | ||
"my_dense_vector": [0.5, 10, 6], | ||
"my_sparse_vector": {"2": 1.5, "15" : 2, "50": -1.1, "4545": 1.1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious for my own sake, can the sparse vector keys also be ids, e.g. "feature1" etc...? Since they are strings here one might think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they can only be integers. We throw an exception if a vector key is not convertible to an integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good to know.
|
||
NOTE: Unlike `cosineSimilarity` that represent similarity, `l1norm` and | ||
`l2norm` shown below represent distances or differences. This means, that | ||
the more similar are the vectors, the less will be the scores produced by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure 100% since not a native speaker, but "the more similar the vectors are" sounds better to me here. Probably also "... the lower the scores will be that are produced...", but maybe better double check with a native speaker. Doesn't matter much though.
|
||
- match: {hits.hits.1._id: "2"} | ||
- gte: {hits.hits.1._score: 12.25} | ||
- lte: {hits.hits.1._score: 12.35} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quite some range thats allowed here, I'm getting 12.3 as the expected value. Do we need that big of a margin here? If so, do you know why? Should we document the fact in case the value deviates from the numerically expected value so much?
|
||
- match: {hits.hits.1._id: "2"} | ||
- gte: {hits.hits.1._score: 12.25} | ||
- lte: {hits.hits.1._score: 12.35} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here as above
//break vector into two arrays dims and values | ||
int n = queryVector.size(); | ||
queryValues = new double[n]; | ||
queryDims = new int[n]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious about the deletions here, whats the relation of this to adding the l1/l2 norm? Or is this part of a different change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbuescher One of your comments from the previous PR was for *Sparse classes to share a common abstract superclass for e.g. sharing the common code in the constructor. I thought this is a very good point and redesigned all *Sparse classes to extend from VectorSparseFunctions
to share a common code in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, got it, thanks. I just wasn't sure where this moved to, might have been in a differen PR or I missed it here..
|
||
public void testSparseVectorFunctionsSpecialCases() { | ||
// Query and document vector have missing dimensions that are present in one, and absent in another | ||
int[] docVectorDims = {2, 10, 11, 50, 113, 4545}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know if might be paranoid, but can you add something so the last doc dimension is larger than the biggest query dimention to trigger the while-loops at the end of the function?
And the other way around, we'd probably need a second case where the last queryVector dim > highest doc dim. Sorry, might be a bit paranoid but those are code paths I don't see covered otherwise.
- organize vector functions as a separate doc - increase precision in tests calculations - add a separate test when sparse doc dims are bigger and less than query vector dims
-------------------------------------------------- | ||
// CONSOLE | ||
|
||
NOTE: Unlike `cosineSimilarity` that represent similarity, `l1norm` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a couple suggestions around this transformation:
- We could include a 1 in the denominator to avoid division by 0 when a document vector matches the query exactly:
1 / (1 + l1norm(params.queryVector, doc['my_dense_vector']))
. - Would it make sense to use this transformation in the search examples themselves, instead of just including it in a note? We could give a description of it in a callout, as we do for
cosineSimilarity
anddotProduct
. It would be harder to miss that way, and I also think it makes the examples more helpful + realistic, since it seems very unlikely a user would want to score by distance descending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani Thanks these are very good suggestions, I have addressed them in the last commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mayya-sharipova, the new commit you added looks good to me!
|
||
// test l1norm | ||
L1NormSparse l1Norm = new L1NormSparse(queryVector); | ||
double result3 = l1Norm.l1normSparse(dvs); | ||
assertEquals("l1normSparse result is not equal to the expected value!", 517.18, result3, 0.1); | ||
assertEquals("l1normSparse result is not equal to the expected value!", 517.184, result3, 0.001); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expeced the l1 norm to change by 11.5 when adding the "4546" doc vector dimentions of 11.5f?
Could you re-check the expected value? I think the while-loop in the function might miss the last "biggest" value?
Same for the l2-norm also btw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my bad. I now saw you took out the "11" and moved it to "4546" with the same value. All good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@elasticmachine run elasticsearch-ci/2 |
Add L1norm - Manhattan distance Add L2norm - Euclidean distance relates to #37947
Add L1norm - Manhattan distance
Add L2norm - Euclidean distance
relates to #37947