## Similarities
Most models add vectors to each token (tok2vec, or transformer-based for the _trf model),
which are word embeddings, so should compare well for words with similar meaning.

This assists tasks like calculating similarity of larger chunks of text as well.





semantic similarity: Compare to best paratgraph
so also split better

maybe separate repo
...with a note that 


split docs
  bwb, cvdr    XML based
  rechtspraak  opkayish
  




In [None]:
import spacy

english_lg  = spacy.load('en_core_web_lg')   

In [5]:
# You can generally compare any token/span (spans/sentence/docs will act as their average) using .similarity(). 
# 
# There are some fundamental limitations to this to keep in mind, like 
# - that it does not consider ordering, just words' presence
# - how volatile the meaning of short sentences may be
# - how function words dilute larger-span vectors (and might make them compare well for non-contentful reasons)
# - 'static vectors' basically means a word has the same vector in all contexts.

for one, other in (
        ('ducks are great', 'cats are nice'),
        ('ducks are great', 'goats are cool'),
        ('ducks are great', 'Forks are spoons'),
        ('ducks are great', 'Forks and spoons'),
        ('ducks are great', 'Forks and spoons and knives'),
        ('ducks are great', 'Forks and spoons are knives'),
        ('ducks and blah and blah and blah', 'blue and bleh and bleh and bleh'),
        ('ducks',           'blue'),
    ):
    sim = english_lg( one ).similarity(english_lg( other ))
    print( "%.3f   %50s %50s"%( sim, one, other ))

# These three-world phrases are actually quite contrived - real sentences have a narrower range of 

print()
# these are sentences from two different wikipedia articles; we are trying to see if it can tell a difference.
text = """Because it is smaller, the Moon has less gravity than Earth (only 1/6 of the amount on Earth). 
So if a person weighs 60 kilograms on Earth, the person would only weigh 10 kilograms on the moon. 
But even though the Moon's gravity is weaker than the Earth's gravity, it is still there. 
If a person dropped a ball while standing on the moon, it would still fall down. However, it would fall much more slowly.
A person who jumped as high as possible on the moon would jump higher than on Earth, but still fall back to the ground.
Rome ceased to be the capital from the time of the division. 
In 286, the capital of the Western Roman Empire became Mediolanum (now Milan). 
In 402, the capital was again moved, this time to Ravenna.
In AD 398, Alaric led the Visigoths and began making attacks closer and closer to the capital.
By 410, he had sacked the Rome. In 455, the Vandals captured the city. In 476, the Goths captured the capital """

doc = english_lg( text )
sents = list(doc.sents)
for i in range(len(sents)-1):
    one, other = sents[i], sents[i+1] 
    sim = one.similarity( other )
    print( "%.3f\n  [%s]\n  [%s]"%( sim, one.text.strip(), other.text.strip() ))

0.869                                      ducks are great                                      cats are nice
0.872                                      ducks are great                                     goats are cool
0.796                                      ducks are great                                   Forks are spoons
0.383                                      ducks are great                                   Forks and spoons
0.450                                      ducks are great                        Forks and spoons and knives
0.761                                      ducks are great                        Forks and spoons are knives
0.868                     ducks and blah and blah and blah                    blue and bleh and bleh and bleh
0.218                                                ducks                                               blue

0.732
  [Because it is smaller, the Moon has less gravity than Earth (only 1/6 of the amount on Earth).]
  [So if a per


Spacy has, however, made things a bit complex.
- _sm models don't have vectors.  While similarity() still does something useful at all, you should assume this is extremely basic.
- _md and _lg  models tend to have **static word vectors**, for english it may be GroVe vectors, great in itself _but_ a word will receive the same vector in all contexts
- _trf do context sensitive embeddings (and put those values in a different place)

This means that
* spacy's similarity() call does different things in difference models
* you can't always pick out the vectors directly (though you can often get away with it if you stick to one model)


Note that
- scores on spans and docs act as the average of their compobnents
- ...which also means e.g. function words can dilute larger-span vectors (and might make them compare well for non-contentful reasons)
- (...so...) similarity() does not consider ordering, just words' presence
- shorter sentences have minimal and more volative meaning

In [6]:
#!pip3 install spacy
#!python3 -m spacy download en_core_web_trf   # works better, but can be rather slow without GPU properly set up

In [11]:
import spacy

english_trf = spacy.load('en_core_web_trf')

In [15]:
import spacy
import numpy as np
from spacy.language import Language


# We use the @ character to register the following Class definition
# with spaCy under the name 'tensor2attr'.
@Language.factory('tensor2attr')
# We begin by declaring the class name: Tensor2Attr. The name is
# declared using 'class', followed by the name and a colon.
class Tensor2Attr:

    # We continue by defining the first method of the class,
    # __init__(), which is called when this class is used for
    # creating a Python object. Custom components in spaCy
    # require passing two variables to the __init__() method:
    # 'name' and 'nlp'. The variable 'self' refers to any
    # object created using this class!
    def __init__(self, name, nlp):
        # We do not really do anything with this class, so we
        # simply move on using 'pass' when the object is created.
        pass

    # The __call__() method is called whenever some other object
    # is passed to an object representing this class. Since we know
    # that the class is a part of the spaCy pipeline, we already know
    # that it will receive Doc objects from the preceding layers.
    # We use the variable 'doc' to refer to any object received.
    def __call__(self, doc):
        # When an object is received, the class will instantly pass
        # the object forward to the 'add_attributes' method. The
        # reference to self informs Python that the method belongs
        # to this class.
        self.add_attributes(doc)

        # After the 'add_attributes' method finishes, the __call__
        # method returns the object.
        return doc

    # Next, we define the 'add_attributes' method that will modify
    # the incoming Doc object by calling a series of methods.
    def add_attributes(self, doc):
        # spaCy Doc objects have an attribute named 'user_hooks',
        # which allows customising the default attributes of a
        # Doc object, such as 'vector'. We use the 'user_hooks'
        # attribute to replace the attribute 'vector' with the
        # Transformer output, which is retrieved using the
        # 'doc_tensor' method defined below.
        doc.user_hooks['vector'] = self.doc_tensor

        # We then perform the same for both Spans and Tokens that
        # are contained within the Doc object.
        doc.user_span_hooks['vector'] = self.span_tensor
        doc.user_token_hooks['vector'] = self.token_tensor

        # We also replace the 'similarity' method, because the
        # default 'similarity' method looks at the default 'vector'
        # attribute, which is empty! We must first replace the
        # vectors using the 'user_hooks' attribute.
        doc.user_hooks['similarity'] = self.get_similarity
        doc.user_span_hooks['similarity'] = self.get_similarity
        doc.user_token_hooks['similarity'] = self.get_similarity

    # Define a method that takes a Doc object as input and returns
    # Transformer output for the entire Doc.
    def doc_tensor(self, doc):
        # Return Transformer output for the entire Doc. As noted
        # above, this is the last item under the attribute 'tensor'.
        # Average the output along axis 0 to handle batched outputs.
        return doc._.trf_data.tensors[-1].mean(axis=0)

    # Define a method that takes a Span as input and returns the Transformer
    # output.
    def span_tensor(self, span):
        # Get alignment information for Span. This is achieved by using
        # the 'doc' attribute of Span that refers to the Doc that contains
        # this Span. We then use the 'start' and 'end' attributes of a Span
        # to retrieve the alignment information. Finally, we flatten the
        # resulting array to use it for indexing.
        tensor_ix = span.doc._.trf_data.align[span.start: span.end].data.flatten()

        # Fetch Transformer output shape from the final dimension of the output.
        # We do this here to maintain compatibility with different Transformers,
        # which may output tensors of different shape.
        out_dim = span.doc._.trf_data.tensors[0].shape[-1]

        # Get Token tensors under tensors[0]. Reshape batched outputs so that
        # each "row" in the matrix corresponds to a single token. This is needed
        # for matching alignment information under 'tensor_ix' to the Transformer
        # output.
        tensor = span.doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]

        # Average vectors along axis 0 ("columns"). This yields a 768-dimensional
        # vector for each spaCy Span.
        return tensor.mean(axis=0)

    # Define a function that takes a Token as input and returns the Transformer
    # output.
    def token_tensor(self, token):
        # Get alignment information for Token; flatten array for indexing.
        # Again, we use the 'doc' attribute of a Token to get the parent Doc,
        # which contains the Transformer output.
        tensor_ix = token.doc._.trf_data.align[token.i].data.flatten()

        # Fetch Transformer output shape from the final dimension of the output.
        # We do this here to maintain compatibility with different Transformers,
        # which may output tensors of different shape.
        out_dim = token.doc._.trf_data.tensors[0].shape[-1]

        # Get Token tensors under tensors[0]. Reshape batched outputs so that
        # each "row" in the matrix corresponds to a single token. This is needed
        # for matching alignment information under 'tensor_ix' to the Transformer
        # output.
        tensor = token.doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]

        # Average vectors along axis 0 (columns). This yields a 768-dimensional
        # vector for each spaCy Token.
        return tensor.mean(axis=0)

    # Define a function for calculating cosine similarity between vectors
    def get_similarity(self, doc1, doc2):
        # Calculate and return cosine similarity
        return np.dot(doc1.vector, doc2.vector) / (doc1.vector_norm * doc2.vector_norm)


In [16]:
# while _trf can do contextual word embedding rather than static word embedding, this is not placed in .vector.tensors
# You could fish out the tensors like   toks_vectors, doc_vector = doc._.trf_data.tensors
#   but it's handier to augent spacy to make similarity work with transformer tensors - custom pipeline element defined in our helper module

# this mentioned tensor2attr is not basic spacy, it exists because it is defined in our helpers_spacy
if not english_trf.has_pipe('tensor2attr'):
    print("adding transformer based similarity")
    english_trf.add_pipe('tensor2attr')

adding transformer based similarity


In [None]:
import spacy.tokens
spacy.tokens.Doc

In [17]:
doc = english_trf("The bank stores investment capital in Paris, the capital of France")
first_capital, second_capital = doc[4], doc[9]
print( first_capital, second_capital, round( first_capital.similarity(second_capital), 2) ) 
# without contextual word embeddings, both capitals would be the same, and their similarity 1.0

capital capital 0.61


In [18]:
# figure out what word it's close to 
# TODO: fix that this is less valid because we are comparting to words that are ripped from context
money  = english_trf("money")[0]
city   = english_trf("city")[0]
paris  = english_trf("lyon")[0]

print ('              ','money', 'city', 'lyon')
for in_sent, example in ( (' bank_capital ',first_capital), ('city_capital',second_capital) ):
    print( '%15s %4.2f %4.2f %4.2f'%(in_sent, round(example.similarity(money), 2), round( example.similarity(city), 2), round( example.similarity(paris), 2)) )

               money city lyon
  bank_capital  0.55 0.48 0.32
   city_capital 0.35 0.34 0.15


In [None]:
import numpy
#vocablist = list(english_lg.vocab.strings)
vvv = []
i=0
for string in list(english_lg.vocab.strings)[::10]:
    if len(string)<4:
        continue
    n = english_lg(string)[0].norm
    if n > 800000000000000000:
        continue

    print( n, string, english_lg.vocab.strings[string])
    
    if string in english_lg.vocab.vectors:
        print( string, english_lg.vocab.strings[string], numpy.abs(english_lg.vocab.vectors[string]))
    
    vvv.append( string )

    i+=1
    if i>1000:
        break
print(i)
print(len(vvv))

vocablist = vvv

#english_lg.vocab.lookups
#for string in 
#print(len(vocablist))
#vocablist[0].prob

In [19]:
#english_md = spacy.load("en_core_web_md")   

allsim1, allsim2 = {}, {}
#vocablist = list(english_lg.vocab.strings)

print( len(vocablist) )

for word in vocablist[::10]:
    isolated = english_trf(word)
    allsim1[word] = first_capital.similarity( isolated )
    allsim2[word] = second_capital.similarity( isolated )

allsim1 = list(allsim1.items())
allsim1.sort(key = lambda x:x[1], reverse=True)
print( allsim1[:10] )

allsim2 = list(allsim2.items())
allsim2.sort(key = lambda x:x[1], reverse=True)
print( allsim2[:10] )




NameError: name 'vocablist' is not defined

In [None]:
allsim

In [None]:
# Things like 'find similar words within texts' will rely on some variant of 'compare everything to everything'
# I've not found a spacy way to do such mass comparisons other than to call .similarity() a lot, which is a bunch of overhead
# Since it seems to just be cosine similarity, we can use scipy to do a lot more comparisons in one go - code for which is in our helper

print( "SENTENCE SIMILARITY" )
# yes, these these thresholds are chosen to give good results with this example. Play with them to see how messy it actually is.
for score, one, two in wetsuite.helpers.spacy.similar_sentences(doc,     thresh=0.5, n=5):   
    print( "    %5.2f  %40r  %40r"%(score, one, two) )
    
print( "TOKEN SIMILARITY" )
for score, one, two in wetsuite.helpers.spacy.similar_chunks(doc, 1,0,0, thresh=0.6, n=5):
    print( "    %5.2f  %40s  %40s"%(score, one, two) )

print( "ENTITY AND NOUN CHUNK SIMILARITY" )
for score, one, two in wetsuite.helpers.spacy.similar_chunks(doc, 0,1,1, thresh=0.7, n=5):
    print( "    %5.2f  %40s  %40s"%(score, one, two) )

# It's generally not so useful to compare tokens with phrases from the same document, in that the top similarities will be phrases with their own head/root.

In [None]:
# Since the average of a sentence or document would be a lot of function words, 
#   direct comparison would still work but be watered down depending on how many of those there are


#   so you might like 
# At the same time, spacy prefers its parsed object immutable, so you would have to work around it
import numpy
from importlib import reload
reload(helpers_spacy)
for sent in paris.sents:
    print( '-'*80 )
    print( sent )
    sg = helpers_spacy.interesting_words( sent )
    print( sg )
    vpt = helpers_spacy.vector_per_tag(sent, average=True) 
    for tag, ary in vpt.items():
        print( tag, numpy.linalg.norm(ary))