## Similarities

Before we talk about _how_ things are made comparable, let's demonstate _that_ they are.

You can generally use .similarity() to compare any token --
as well as larger units like span or doc, as they will mean the average of their parts. 

In [27]:
import spacy
english_md  = spacy.load('en_core_web_md')   

for one, other in ( # compare some prepared pairs
        ('ducks are great', 'cats are nice'),
        ('ducks are great', 'goats are cool'),
        ('ducks are great', 'Forks are spoons'),
        ('ducks are great', 'Forks and spoons'),
        ('ducks are great', 'Forks and spoons and knives'),
        ('ducks are great', 'Forks and spoons are knives'),
        ('ducks and blah and blah and blah', 'blue and bleh and bleh and bleh'),
        ('ducks',           'cats'),
        ('ducks',           'blue'),
        ('red',             'blue'),
    ):
    sim = english_md( one ).similarity(english_md( other ))
    print( "%.3f   %50s %50s"%( sim, one, other ))

# These three-world phrases are actually quite contrived - real sentences have a narrower range of 

0.869                                      ducks are great                                      cats are nice
0.872                                      ducks are great                                     goats are cool
0.757                                      ducks are great                                   Forks are spoons
0.307                                      ducks are great                                   Forks and spoons
0.420                                      ducks are great                        Forks and spoons and knives
0.750                                      ducks are great                        Forks and spoons are knives
0.858                     ducks and blah and blah and blah                    blue and bleh and bleh and bleh
0.533                                                ducks                                               cats
0.218                                                ducks                                               blue
0.811     

There are some fundamental limitations to this to keep in mind, like 
- assume it does not consider ordering, just words' presence
- the shorter the sentence/sequence, the more that one word of difference changes its meaning
- the longer the sequence, the less that an average of meaning means much
- at any length, how function words will dilute meaning (and might make them compare well to others for _non-contentful_ reasons)
- how similar filler words can make two sentences more alike than they really are 
- 'static vectors' basically means a word has the same vector in all contexts.

In [20]:
# these are sentences from two different wikipedia articles; we are trying to see if it can tell a difference.
text = """
Because it is smaller, the Moon has less gravity than Earth (only 1/6 of the amount on Earth). 

In 286, the capital of the Western Roman Empire became Mediolanum (now Milan). 

So if a person weighs 60 kilograms on Earth, the person would only weigh 10 kilograms on the moon. 

In 402, the capital was again moved, this time to Ravenna.

But even though the Moon's gravity is weaker than the Earth's gravity, it is still there. 

In 476, the Goths captured the capital.

If a person dropped a ball while standing on the moon, it would still fall down. However, it would fall much more slowly.

In AD 398, Alaric led the Visigoths and began making attacks closer and closer to the capital.

A person who jumped as high as possible on the moon would jump higher than on Earth, but still fall back to the ground.

Its surface gravity is about one sixth of Earth's, about half of that of Mars.

Rome ceased to be the capital from the time of the division. 

The Moon has been an important source of inspiration and knowledge for humans.

In 455, the Vandals captured the city. 
"""

# use the model to split into sentences.
doc = english_md( text )
sents = list(doc.sents) 
# and see which sentences match best
res = []
for i in range(len(sents)-1):
    for j in range(i+1, len(sents)):
        one, other = sents[i], sents[j] 
        sim = one.similarity( other )
        if sim > 0.85:
            res.append( (sim, one, other) )
res.sort(reverse=True)

for sim, one, other in res:
     print( "%.2f\n       [%s]\n       [%s]"%( sim, one.text.strip(), other.text.strip() ))

0.98
       [In 476, the Goths captured the capital.]
       [In 455, the Vandals captured the city.]
0.91
       [In 402, the capital was again moved, this time to Ravenna.]
       [In AD 398, Alaric led the Visigoths and began making attacks closer and closer to the capital.]
0.91
       [In 476, the Goths captured the capital.]
       [In AD 398, Alaric led the Visigoths and began making attacks closer and closer to the capital.]
0.91
       [In 402, the capital was again moved, this time to Ravenna.]
       [In 476, the Goths captured the capital.]
0.90
       [In 402, the capital was again moved, this time to Ravenna.]
       [In 455, the Vandals captured the city.]
0.90
       [In AD 398, Alaric led the Visigoths and began making attacks closer and closer to the capital.]
       [In 455, the Vandals captured the city.]
0.88
       [In 286, the capital of the Western Roman Empire became Mediolanum (now Milan).]
       [In 476, the Goths captured the capital.]
0.88
       [If a per

So that looks good, but such examples are often a little... synthetic. 

Yes, it puts things that are alike together even though they separated in the text,
but it's not clear what the reason for the difference between 0.9 and 0.7 scores are,
and below 0.7 it starts to make mistakes even with these fairly distinct topics.

We hid that fact from you by having a higher threshold.

Also, as spacy documentation points out...

## Similarity is subjective

Say, in the context of any text whatsoever, `I like burgers` and `I like pasta` are quite similar,
in the context of food, they should be considered quite dissimilar ([from here](https://spacy.io/usage/spacy-101#vectors-similarity)).

## It's also technically messy

We can assume these are 'word embeddings', which often just means 'vectors, but we spent a bunch of time compacting meaning into relatively few dense dimensions'.


similarity() is nice, but you may have seen other uses of this semantic sort of thing.

You may have seen:
- semantic comparison
  - "how similar are two documents, by what they talk about?"

- semantic search 
  - querying for documents and allowing for variations in the words used to express ideas -- by querying by such meaningful vectors


For many of these, you will 
- possible train it for your use
- need to fetch the underlying numbers
- ensure the remain the same between all things you compare

All of that takes a bunch more planning, forethought, and design up front.


It also takes insight into spacy in particular works.

The thing that makes it very flexible and modular _also_ means that not all vectors are the same, _at all_.

If you squint, vectors could be said to come from 
- static vectors  - come from predetermined, always the same for the same word
- dynamic vectors - using context in the sentence it is used

Spacy seems to have all combinations
- neither - models without any will exist, though do not seem common.
- static only - 
- static and dynamic - start with static vectors, then adjust for some contextual awareness, e.g. when detecting something as a VERB, adjust for verbiness
- dynamic only - 
- 



Models may
- use static vectors only

  - e.g. with tok2vec
- and the tiny models are even context-sensitive-only which, yes, is confusing.
- transformers work in their own way

As a result, the same text may get different vectors.



There is a general machine-learning question in how do you deal with new data.

One way is to avoid needing to at all. 
Say, use one model to analyse all text, and stick with those vectors.

Another is to 





Spacy has, however, made things a bit complex.
- _sm models don't have vectors.  While similarity() still does something at all, you should assume this is extremely basic.
- _md and _lg  models tend to have **static word vectors**, for english it may be [https://nlp.stanford.edu/projects/glove/ GLoVe vectors], great in itself _but_ a word will receive the same vector in all contexts
- _trf do context sensitive embeddings (and put those values in a different place)

This means that
* spacy's similarity() call does different things in differenct models
* you can't always pick out the vectors directly (though you can often get away with it if you stick to one model)


Note that
- scores on spans and docs act as the average of their compobnents
- ...which also means e.g. function words can dilute larger-span vectors (and might make them compare well for non-contentful reasons)
- (...so...) similarity() does not consider ordering, just words' presence
- shorter sentences have minimal and more volative meaning

In [35]:
import wetsuite.helpers.spacy

In [None]:
# Things like 'find similar words within texts' will rely on some variant of 'compare everything to everything'
# I've not found a spacy way to do such mass comparisons other than to call .similarity() a lot, which is a bunch of overhead
# Since it seems to just be cosine similarity, we can use scipy to do a lot more comparisons in one go - code for which is in our helper

print( "SENTENCE SIMILARITY" )
# yes, these these thresholds are chosen to give good results with this example. Play with them to see how messy it actually is.
for score, one, two in wetsuite.helpers.spacy.similar_sentences(doc,     thresh=0.5, n=5):   
    print( "    %5.2f  %40r  %40r"%(score, one, two) )
    
print( "TOKEN SIMILARITY" )
for score, one, two in wetsuite.helpers.spacy.similar_chunks(doc, 1,0,0, thresh=0.6, n=5):
    print( "    %5.2f  %40s  %40s"%(score, one, two) )

print( "ENTITY AND NOUN CHUNK SIMILARITY" )
for score, one, two in wetsuite.helpers.spacy.similar_chunks(doc, 0,1,1, thresh=0.7, n=5):
    print( "    %5.2f  %40s  %40s"%(score, one, two) )

# It's generally not so useful to compare tokens with phrases from the same document, in that the top similarities will be phrases with their own head/root.

In [None]:
# Since the average of a sentence or document would be a lot of function words, 
#   direct comparison would still work but be watered down depending on how many of those there are


#   so you might like 
# At the same time, spacy prefers its parsed object immutable, so you would have to work around it
import numpy
from importlib import reload
reload(helpers_spacy)
for sent in paris.sents:
    print( '-'*80 )
    print( sent )
    sg = helpers_spacy.interesting_words( sent )
    print( sg )
    vpt = helpers_spacy.vector_per_tag(sent, average=True) 
    for tag, ary in vpt.items():
        print( tag, numpy.linalg.norm(ary))


One of the simpler of word embeddings / vectors
This assists tasks like calculating similarity of larger chunks of text as well.



semantic similarity: Compare to best paratgraph
so also split better

maybe separate repo
...with a note that 


split docs
  bwb, cvdr    XML based
  rechtspraak  opkayish