Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add l1norm and l2norm distances for vectors #44116

Merged
merged 4 commits into from
Jul 11, 2019

Conversation

mayya-sharipova
Copy link
Contributor

Add L1norm - Manhattan distance
Add L2norm - Euclidean distance
relates to #37947

Add L1norm - Manhattan distance
Add L2norm - Euclidean distance
relates to elastic#37947
@mayya-sharipova mayya-sharipova added the :Search Relevance/Ranking Scoring, rescoring, rank evaluation. label Jul 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@mayya-sharipova
Copy link
Contributor Author

@cbuescher Can you please review this PR when you have time. I have tried to address here all your comments from #40255

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mayya-sharipova, looks very good already. I left a few minor questions and some asks around testing, nothing that seems to hold up this PR for a long time though.

@@ -31,8 +30,7 @@ GET /_search
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
// NOTCONSOLE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this work anymore as a tested console snippet? Can it (or the data) be changed so we can keep it tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher this doesn't work anymore as there could only one TEST SETUP per a doc file. I have decided to use TEST SETUP for testing vectors, and unable to use setup with twitter data for this example, so I made it NOTCONSOLE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alternative could be to create a separate doc file for vector functions. I can do this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, thats fine then. I see enough snippets that are tested later. One without testing if okay I think if the way to solve this involves too much work.

PUT my_index/_doc/1
{
"my_dense_vector": [0.5, 10, 6],
"my_sparse_vector": {"2": 1.5, "15" : 2, "50": -1.1, "4545": 1.1}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious for my own sake, can the sparse vector keys also be ids, e.g. "feature1" etc...? Since they are strings here one might think so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they can only be integers. We throw an exception if a vector key is not convertible to an integer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good to know.


NOTE: Unlike `cosineSimilarity` that represent similarity, `l1norm` and
`l2norm` shown below represent distances or differences. This means, that
the more similar are the vectors, the less will be the scores produced by the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure 100% since not a native speaker, but "the more similar the vectors are" sounds better to me here. Probably also "... the lower the scores will be that are produced...", but maybe better double check with a native speaker. Doesn't matter much though.


- match: {hits.hits.1._id: "2"}
- gte: {hits.hits.1._score: 12.25}
- lte: {hits.hits.1._score: 12.35}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite some range thats allowed here, I'm getting 12.3 as the expected value. Do we need that big of a margin here? If so, do you know why? Should we document the fact in case the value deviates from the numerically expected value so much?


- match: {hits.hits.1._id: "2"}
- gte: {hits.hits.1._score: 12.25}
- lte: {hits.hits.1._score: 12.35}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here as above

//break vector into two arrays dims and values
int n = queryVector.size();
queryValues = new double[n];
queryDims = new int[n];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious about the deletions here, whats the relation of this to adding the l1/l2 norm? Or is this part of a different change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher One of your comments from the previous PR was for *Sparse classes to share a common abstract superclass for e.g. sharing the common code in the constructor. I thought this is a very good point and redesigned all *Sparse classes to extend from VectorSparseFunctions to share a common code in the constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, got it, thanks. I just wasn't sure where this moved to, might have been in a differen PR or I missed it here..


public void testSparseVectorFunctionsSpecialCases() {
// Query and document vector have missing dimensions that are present in one, and absent in another
int[] docVectorDims = {2, 10, 11, 50, 113, 4545};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know if might be paranoid, but can you add something so the last doc dimension is larger than the biggest query dimention to trigger the while-loops at the end of the function?
And the other way around, we'd probably need a second case where the last queryVector dim > highest doc dim. Sorry, might be a bit paranoid but those are code paths I don't see covered otherwise.

- organize vector functions as a separate doc
- increase precision in tests calculations
- add a separate test when sparse doc dims
are bigger and less than query vector dims
--------------------------------------------------
// CONSOLE

NOTE: Unlike `cosineSimilarity` that represent similarity, `l1norm` and
Copy link
Contributor

@jtibshirani jtibshirani Jul 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a couple suggestions around this transformation:

  • We could include a 1 in the denominator to avoid division by 0 when a document vector matches the query exactly: 1 / (1 + l1norm(params.queryVector, doc['my_dense_vector'])).
  • Would it make sense to use this transformation in the search examples themselves, instead of just including it in a note? We could give a description of it in a callout, as we do for cosineSimilarity and dotProduct. It would be harder to miss that way, and I also think it makes the examples more helpful + realistic, since it seems very unlikely a user would want to score by distance descending.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks these are very good suggestions, I have addressed them in the last commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mayya-sharipova, the new commit you added looks good to me!


// test l1norm
L1NormSparse l1Norm = new L1NormSparse(queryVector);
double result3 = l1Norm.l1normSparse(dvs);
assertEquals("l1normSparse result is not equal to the expected value!", 517.18, result3, 0.1);
assertEquals("l1normSparse result is not equal to the expected value!", 517.184, result3, 0.001);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expeced the l1 norm to change by 11.5 when adding the "4546" doc vector dimentions of 11.5f?
Could you re-check the expected value? I think the while-loop in the function might miss the last "biggest" value?
Same for the l2-norm also btw.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad. I now saw you took out the "11" and moved it to "4546" with the same value. All good.

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mayya-sharipova
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/2

@mayya-sharipova mayya-sharipova merged commit 16747f8 into elastic:master Jul 11, 2019
mayya-sharipova added a commit that referenced this pull request Jul 11, 2019
Add L1norm - Manhattan distance
Add L2norm - Euclidean distance
relates to #37947
@mayya-sharipova mayya-sharipova deleted the l1l2-vector-distances branch July 11, 2019 18:33
@jtibshirani jtibshirani added :Search Relevance/Vectors Vector search and removed :Search Relevance/Ranking Scoring, rescoring, rank evaluation. labels Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants