Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectors.most_similar should always return 1.0 for identical vectors #4506

Closed
ines opened this issue Oct 22, 2019 · 3 comments
Closed

Vectors.most_similar should always return 1.0 for identical vectors #4506

ines opened this issue Oct 22, 2019 · 3 comments
Labels
bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity

Comments

@ines
Copy link
Member

ines commented Oct 22, 2019

We should probably hard-code the workaround for the imprecision, just like we do for the built-in similarity methods.

How to reproduce the behaviour

@pytest.mark.xfail
def test_vectors_most_similar_identical():
"""Test that most similar identical vectors are assigned a score of 1.0."""
data = numpy.asarray([[4, 2, 2, 2], [4, 2, 2, 2], [1, 1, 1, 1]], dtype="f")
v = Vectors(data=data, keys=["A", "B", "C"])
keys, _, scores = v.most_similar(numpy.asarray([[4, 2, 2, 2]], dtype="f"))
assert scores[0][0] == 1.0 # not 1.0000002
data = numpy.asarray([[1, 2, 3], [1, 2, 3], [1, 1, 1]], dtype="f")
v = Vectors(data=data, keys=["A", "B", "C"])
keys, _, scores = v.most_similar(numpy.asarray([[1, 2, 3]], dtype="f"))
assert scores[0][0] == 1.0 # not 0.9999999

@ines ines added bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity labels Oct 22, 2019
@svlandeg
Copy link
Member

Just out of curiosity - why is this a requirement? Shouldn't equality on real values always be tested with some margin?

@ines
Copy link
Member Author

ines commented Oct 22, 2019

@svlandeg It's not really a requirement, but when we didn't clip the values for the .similarity methods, people found this pretty confusing (e.g. when nlp("x").similarity(nlp("x")) wasn't 1.0). And since we're doing it this way for other similarity comparisons, we might as well do it for the most_similar.

@ines ines closed this as completed in 9489c5f Oct 22, 2019
@lock
Copy link

lock bot commented Nov 21, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Nov 21, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity
Projects
None yet
Development

No branches or pull requests

2 participants