# Transformers and spaCy

`spaCy` has a wrapper for the `transformers` library, called `spacy-transformers`. There are some tasks that the wrapper does not support (for instance, <MASK>-prediction, since it is considered a Natural Language Generation task); but many are supported, including fine tuning.
    
More information can be found on spaCy's webpage and [here](https://explosion.ai/blog/spacy-transformers)
    
    
# Word embeddings
Typically, real-valued representations of words learned from text (often: through a language modelling task). You can regard word embeddings as one of the main components of linguistic knowledge they acquired after training.    
    
    
You can easily access a language model's word embeddings
    
### Static embeddings

In [23]:
import spacy

nlp = spacy.load('en_core_web_md') #spaCy sm models do not ship with word embeddings

doc1 = nlp('Apple shares rose on the news.')
doc2 = nlp('Apple pie is delicious.')

print('First thirty dimensions of the vector for "Apple" in doc1:\n',
      doc1[0].vector[:30])
print('\n First thirty dimensions of the vector for "Apple" in doc1:\n',
      doc2[0].vector[:30])
print('\n Number of dimensions/shape of the word embeddings in en_core_web_md:\n', doc1[0].vector.shape)

print('\n Are the two vectors identical?', (doc1[0].vector == doc2[0].vector).all())

First thirty dimensions of the vector for "Apple" in doc1:
 [-0.6334     0.18981   -0.53544   -0.52658   -0.30001    0.30559
 -0.49303    0.14636    0.012273   0.96802    0.0040354  0.25234
 -0.29864   -0.014646  -0.24905   -0.67125   -0.053366   0.59426
 -0.068034   0.10315    0.66759    0.024617  -0.37548    0.52557
  0.054449  -0.36748   -0.28013    0.090898  -0.025687  -0.5947   ]

 First thirty dimensions of the vector for "Apple" in doc1:
 [-0.6334     0.18981   -0.53544   -0.52658   -0.30001    0.30559
 -0.49303    0.14636    0.012273   0.96802    0.0040354  0.25234
 -0.29864   -0.014646  -0.24905   -0.67125   -0.053366   0.59426
 -0.068034   0.10315    0.66759    0.024617  -0.37548    0.52557
  0.054449  -0.36748   -0.28013    0.090898  -0.025687  -0.5947   ]

 Number of dimensions/shape of the word embeddings in en_core_web_md:
 (300,)

 Are the two vectors identical? True


<div class="alert alert-block alert-info"> <b>Discussion.</b> What do you expect to be the similarity ranking of the words "cat", "snake", "car" and "random" with respect to the word "dog"?</div>

<div class="alert alert-block alert-success"> <b>Activity.</b> Use spaCy's en_core_web_md's word embeddings to retrieve the similarities of the words above.
<div>



<div class="alert alert-block alert-success"> <b>Activity.</b> Use spaCy's en_core_web_md's word embeddings to find:<br>
    
  1. The word with the smallest cosine similarity to the word "apple"; 
  2. The largest cosine similarity to the word "apple";
  3. The second largest cosine similarity to the word "apple".
</div>

***    

SpaCy's `similarity`-function is not only restricted to tokens, but can also be applied to documents and spans. Their representations are the average of the token vectors that are found within the document/span.

In [41]:
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.8015960629076846
salty fries <-> hamburgers 0.5733411312103271


<div class="alert alert-block alert-info"> <b>Discussion.</b> Is the "mean embedding" of the tokens within a document/span a reasonable representation for a document/span? What are its disadvantages? Can you think of alternatives?</div>

See __[Nouns are Vectors, Adjectives are Matrices](https://aclanthology.org/D10-1115.pdf)__, i.a., for compositional vectorial representations for static vectors.

***

### Contextualized embeddings

In [50]:
nlp_trf = spacy.load('en_core_web_trf')

doc1 = nlp_trf("Apple shares rose on the news.")

In [32]:
# length of the doc1-object
doc1.__len__() #why 7 and not 6?

7

In [33]:
# the "tensors"-attribute of a TransformerData object contains 
#a Python list with vector representations generated by the transformer
#for individual tokens and the entire doc1 object.

# Check the shape of the first item in the list. 
#This is the output for individual tokens
doc1._.trf_data.tensors[0].shape #1 batch of 9 vectors with 768 dimensions each

(1, 9, 768)

You can think of a **tensor** as a bundle of numerical objects; or as a multi-dimensional array; or as a higher-dimensional matrix.

In [34]:
# Check the shape of the second item in the list. 
#This is the output for the entire document
doc1._.trf_data.tensors[1].shape

(1, 768)

In [37]:
# First ten dimensions of the tensor
doc1._.trf_data.tensors[0][0][:10] 

array([[-0.02419195,  0.00347404, -0.82839984, ...,  0.3204193 ,
         0.6912029 , -0.03307062],
       [-0.6120319 ,  0.68185633, -0.32543862, ...,  1.0084194 ,
        -0.27445263,  0.17246294],
       [-1.1976653 ,  1.0952203 , -0.51286656, ...,  1.0510603 ,
        -1.0514488 , -0.6002351 ],
       ...,
       [-0.24873887, -0.60771364, -0.25459698, ..., -0.6928086 ,
        -0.34879273,  0.36271068],
       [-0.03674727, -0.06959243, -1.5592436 , ...,  0.25073642,
         1.0069203 ,  0.31602612],
       [-0.0369583 , -0.0689571 , -1.5576652 , ...,  0.2519927 ,
         1.0056583 ,  0.31408337]], dtype=float32)

This looks very similar to the embeddings we had before; only that we're now talking about tensors and accessing them is a little more cumbersome. 

However, we had established that there were **7** tokens in `doc1` but we now have **9** embeddings.

In [38]:
# Access the Transformer tokens under the key 'input_texts'
doc1._.trf_data.tokens['input_texts']

[['<s>', 'Apple', 'Ġshares', 'Ġrose', 'Ġon', 'Ġthe', 'Ġnews', '.', '</s>']]

In [40]:
doc_split = nlp_trf('Do you really need a representation for words like miscommunication or wordly?')
doc_split._.trf_data.tokens['input_texts']

[['<s>',
  'Do',
  'Ġyou',
  'Ġreally',
  'Ġneed',
  'Ġa',
  'Ġrepresentation',
  'Ġfor',
  'Ġwords',
  'Ġlike',
  'Ġmis',
  'communication',
  'Ġor',
  'Ġword',
  'ly',
  '?',
  '</s>']]

Sub-word representations are a standard solution for out of vocabulary representations. See also __[fastText](https://fasttext.cc/)__ for static sub-word representations.

Since tokens may not correspond 1:1 to individual vectors, we first need to align them.

In [55]:
#the vectors corresponding to token 9 (miscommunication)
doc_split[9], doc_split._.trf_data.align[9].data

(miscommunication,
 array([[10],
        [11]], dtype=int32))

In [56]:
doc1[0], doc1._.trf_data.align[0].data

(Apple, array([[1]], dtype=int32))

### Adding a class to the spaCy pipeline

There is no good native way to compare contextualized word embeddings in spaCy. However, one of the mean advantages of this package is how easy it is to modify or add components to the pipeline.

In [57]:
# Class from: https://applied-language-technology.mooc.fi

# Import the Language object under the 'language' module in spaCy,
# and NumPy for calculating cosine similarity.
from spacy.language import Language
import numpy as np

# We use the @ character to register the following Class definition
# with spaCy under the name 'tensor2attr'.
@Language.factory('tensor2attr')

# We begin by declaring the class name: Tensor2Attr. The name is 
# declared using 'class', followed by the name and a colon.
class Tensor2Attr:
    
    # We continue by defining the first method of the class, 
    # __init__(), which is called when this class is used for 
    # creating a Python object. Custom components in spaCy 
    # require passing two variables to the __init__() method:
    # 'name' and 'nlp'. The variable 'self' refers to any
    # object created using this class!
    def __init__(self, name, nlp):
        
        # We do not really do anything with this class, so we
        # simply move on using 'pass' when the object is created.
        pass

    # The __call__() method is called whenever some other object
    # is passed to an object representing this class. Since we know
    # that the class is a part of the spaCy pipeline, we already know
    # that it will receive Doc objects from the preceding layers.
    # We use the variable 'doc' to refer to any object received.
    def __call__(self, doc):
        
        # When an object is received, the class will instantly pass
        # the object forward to the 'add_attributes' method. The
        # reference to self informs Python that the method belongs
        # to this class.
        self.add_attributes(doc)
        
        # After the 'add_attributes' method finishes, the __call__
        # method returns the object.
        return doc
    
    # Next, we define the 'add_attributes' method that will modify
    # the incoming Doc object by calling a series of methods.
    def add_attributes(self, doc):
        
        # spaCy Doc objects have an attribute named 'user_hooks',
        # which allows customising the default attributes of a 
        # Doc object, such as 'vector'. We use the 'user_hooks'
        # attribute to replace the attribute 'vector' with the 
        # Transformer output, which is retrieved using the 
        # 'doc_tensor' method defined below.
        doc.user_hooks['vector'] = self.doc_tensor
        
        # We then perform the same for both Spans and Tokens that
        # are contained within the Doc object.
        doc.user_span_hooks['vector'] = self.span_tensor
        doc.user_token_hooks['vector'] = self.token_tensor
        
        # We also replace the 'similarity' method, because the 
        # default 'similarity' method looks at the default 'vector'
        # attribute, which is empty! We must first replace the
        # vectors using the 'user_hooks' attribute.
        doc.user_hooks['similarity'] = self.get_similarity
        doc.user_span_hooks['similarity'] = self.get_similarity
        doc.user_token_hooks['similarity'] = self.get_similarity
    
    # Define a method that takes a Doc object as input and returns 
    # Transformer output for the entire Doc.
    def doc_tensor(self, doc):
        
        # Return Transformer output for the entire Doc. As noted
        # above, this is the last item under the attribute 'tensor'.
        # Average the output along axis 0 to handle batched outputs.
        return doc._.trf_data.tensors[-1].mean(axis=0)
    
    # Define a method that takes a Span as input and returns the Transformer 
    # output.
    def span_tensor(self, span):
        
        # Get alignment information for Span. This is achieved by using
        # the 'doc' attribute of Span that refers to the Doc that contains
        # this Span. We then use the 'start' and 'end' attributes of a Span
        # to retrieve the alignment information. Finally, we flatten the
        # resulting array to use it for indexing.
        tensor_ix = span.doc._.trf_data.align[span.start: span.end].data.flatten()
        
        # Fetch Transformer output shape from the final dimension of the output.
        # We do this here to maintain compatibility with different Transformers,
        # which may output tensors of different shape.
        out_dim = span.doc._.trf_data.tensors[0].shape[-1]
        
        # Get Token tensors under tensors[0]. Reshape batched outputs so that
        # each "row" in the matrix corresponds to a single token. This is needed
        # for matching alignment information under 'tensor_ix' to the Transformer
        # output.
        tensor = span.doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]
        
        # Average vectors along axis 0 ("columns"). This yields a 768-dimensional
        # vector for each spaCy Span.
        return tensor.mean(axis=0)
    
    # Define a function that takes a Token as input and returns the Transformer
    # output.
    def token_tensor(self, token):
        
        # Get alignment information for Token; flatten array for indexing.
        # Again, we use the 'doc' attribute of a Token to get the parent Doc,
        # which contains the Transformer output.
        tensor_ix = token.doc._.trf_data.align[token.i].data.flatten()
        
        # Fetch Transformer output shape from the final dimension of the output.
        # We do this here to maintain compatibility with different Transformers,
        # which may output tensors of different shape.
        out_dim = token.doc._.trf_data.tensors[0].shape[-1]
        
        # Get Token tensors under tensors[0]. Reshape batched outputs so that
        # each "row" in the matrix corresponds to a single token. This is needed
        # for matching alignment information under 'tensor_ix' to the Transformer
        # output.
        tensor = token.doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]

        # Average vectors along axis 0 (columns). This yields a 768-dimensional
        # vector for each spaCy Token.
        return tensor.mean(axis=0)
    
    # Define a function for calculating cosine similarity between vectors
    def get_similarity(self, doc1, doc2):
        
        # Calculate and return cosine similarity
        return np.dot(doc1.vector, doc2.vector) / (doc1.vector_norm * doc2.vector_norm)

In [58]:
# Add the component named 'tensor2attr', which we registered using the
# @Language decorator and its 'factory' method to the pipeline.
nlp_trf.add_pipe('tensor2attr')

# Call the 'pipeline' attribute to examine the pipeline
nlp_trf.pipeline

[('transformer',
  <spacy_transformers.pipeline_component.Transformer at 0x7f9ec419c100>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f9ec419cac0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f9ec81f3890>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f9e98c0fac0>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f9ec7861c00>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f9e9088cdd0>),
 ('tensor2attr', <__main__.Tensor2Attr at 0x7f9e90143b80>)]

In [69]:
doc1 = nlp_trf("Apple shares rose on the news.")
doc2 = nlp_trf('I really enjoyed this delicious apple.')

# doc1[0] accesses the emmbedding of the first token = 'Apple'
print(doc1[0])  # Apple

# doc2[5] accesses the emmbedding of the 6th token = 'apple'
print(doc2[5])  # apple

print(doc1[0].similarity(doc2[5])) 

Apple
apple
0.23767997


In [70]:
doc1s = nlp("Apple shares rose on the news.")
doc2s = nlp('I really enjoyed this delicious apple.')

# doc1[0] accesses the emmbedding of the first token = 'Apple'
print(doc1s[0])  # Apple

# doc2[5] accesses the emmbedding of the 6th token = 'apple'
print(doc2s[5])  # apple

print(doc1s[0].similarity(doc2s[5])) 

Apple
apple
1.0000001192092896


In [71]:
doc3 = nlp_trf('The new iPhone by Apple is expected by September 2024')
doc4 = nlp_trf('I liked the red apple the most')

print(doc3[4])

# doc2[5] accesses the emmbedding of the 6th token = 'apple'
print(doc4[4])  # apple

print(doc1[0].similarity(doc3[4])) 
print(doc2[5].similarity(doc4[4])) 

Apple
apple
0.46383777
0.89162207


<div class="alert alert-block alert-success"> <b>Activity.</b> Use spaCy's en_core_web_md's word embeddings to answer the following questions
    1. What is the average similarity between a noun and a determiner?
    2. What is the average similarity between a noun and a verb?
</div>

<div class="alert alert-block alert-success"> <b>Activity.</b> Use spaCy's en_core_web_trf to answer the following questions
    1. What is the average similarity between a noun and a determiner?
    2. What is the average similarity between a noun and a verb?
</div>