Vectors.most_similar should always return 1.0 for identical vectors #4506

ines · 2019-10-22T16:20:18Z

We should probably hard-code the workaround for the imprecision, just like we do for the built-in similarity methods.

How to reproduce the behaviour

spaCy/spacy/tests/vocab_vectors/test_vectors.py

Lines 144 to 154 in 74a19ae

    
           @pytest.mark.xfail 
        
           def test_vectors_most_similar_identical(): 
        
               """Test that most similar identical vectors are assigned a score of 1.0.""" 
        
               data = numpy.asarray([[4, 2, 2, 2], [4, 2, 2, 2], [1, 1, 1, 1]], dtype="f") 
        
               v = Vectors(data=data, keys=["A", "B", "C"]) 
        
               keys, _, scores = v.most_similar(numpy.asarray([[4, 2, 2, 2]], dtype="f")) 
        
               assert scores[0][0] == 1.0  # not 1.0000002 
        
               data = numpy.asarray([[1, 2, 3], [1, 2, 3], [1, 1, 1]], dtype="f") 
        
               v = Vectors(data=data, keys=["A", "B", "C"]) 
        
               keys, _, scores = v.most_similar(numpy.asarray([[1, 2, 3]], dtype="f")) 
        
               assert scores[0][0] == 1.0  # not 0.9999999

svlandeg · 2019-10-22T17:25:56Z

Just out of curiosity - why is this a requirement? Shouldn't equality on real values always be tested with some margin?

ines · 2019-10-22T18:03:06Z

@svlandeg It's not really a requirement, but when we didn't clip the values for the .similarity methods, people found this pretty confusing (e.g. when nlp("x").similarity(nlp("x")) wasn't 1.0). And since we're doing it this way for other similarity comparisons, we might as well do it for the most_similar.

lock · 2019-11-21T18:54:50Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity labels Oct 22, 2019

ines closed this as completed in 9489c5f Oct 22, 2019

lock bot locked as resolved and limited conversation to collaborators Nov 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectors.most_similar should always return 1.0 for identical vectors #4506

Vectors.most_similar should always return 1.0 for identical vectors #4506

ines commented Oct 22, 2019 •

edited

svlandeg commented Oct 22, 2019

ines commented Oct 22, 2019

lock bot commented Nov 21, 2019

Vectors.most_similar should always return 1.0 for identical vectors #4506

Vectors.most_similar should always return 1.0 for identical vectors #4506

Comments

ines commented Oct 22, 2019 • edited

How to reproduce the behaviour

svlandeg commented Oct 22, 2019

ines commented Oct 22, 2019

lock bot commented Nov 21, 2019

ines commented Oct 22, 2019 •

edited