# Spacy Transformer Tutorial

Example extracted from:
    - https://explosion.ai/blog/spacy-transformers
    - https://github.com/explosion/spacy-transformers

In [6]:
import spacy
import torch
import numpy
from numpy.testing import assert_almost_equal

In [7]:
is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
    torch.set_default_tensor_type("torch.cuda.FloatTensor")

In [8]:
nlp = spacy.load("en_trf_bertbaseuncased_lg")
doc = nlp("Here is some text to encode.")
assert doc.tensor.shape == (7, 768)  # Always has one row per token
doc._.trf_word_pieces_  # String values of the wordpieces
doc._.trf_word_pieces  # Wordpiece IDs (note: *not* spaCy's hash values!)
doc._.trf_alignment  # Alignment between spaCy tokens and wordpieces
# The raw transformer output has one row per wordpiece.
assert len(doc._.trf_last_hidden_state) == len(doc._.trf_word_pieces)
# To avoid losing information, we calculate the doc.tensor attribute such that
# the sum-pooled vectors match (apart from numeric error)
assert_almost_equal(doc.tensor.sum(axis=0), doc._.trf_last_hidden_state.sum(axis=0), decimal=5)
span = doc[2:4]
# Access the tensor from Span elements (especially helpful for sentences)
assert numpy.array_equal(span.tensor, doc.tensor[2:4])

In [9]:
# .vector and .similarity use the transformer outputs
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")
print(apple1[0].similarity(apple2[0]))  # 0.73428553
print(apple1[0].similarity(apple3[0]))  # 0.43365782

0.7342854
0.43365785


# An Example Closer to Our HSBC Application

In [10]:
# .vector and .similarity use the transformer outputs
hsbc1 = nlp("OECD is directing financial flows towards all SDGs.")
hsbc2 = nlp("Water in rivers flow from high to low altitute.")
hsbc3 = nlp("HSBC is supporting the flow of finance towards a green transition.")
print(hsbc1[0].similarity(hsbc2[0]))
print(hsbc1[0].similarity(hsbc3[0])) 

0.4038766
0.6132948


# Important Aspect

The most important features are the raw outputs of the transformer, which can be accessed at doc._.trf_outputs.last_hidden_state . This variable gives you a tensor with one row per wordpiece token. The doc.tensor attribute gives you one row per spaCy token, which is useful if you're working on token-level tasks such as part-of-speech tagging or spelling correction. We've taken care to calculate an alignment between the models' various wordpiece tokenization schemes and spaCy'slilininguisisticicalyly-motivivatedtokenizizatioion, with a weighting scheme that ensures that no information is lost.

In [11]:
doc.tensor

array([[-0.48589775, -0.4306911 , -0.23172405, ..., -0.80624217,
         0.05815674, -0.14307418],
       [-0.10500746, -0.6504953 , -0.03901236, ..., -0.22964466,
        -0.37916154,  0.7867269 ],
       [ 0.07593302, -0.4535252 ,  0.24927792, ...,  0.28133318,
        -0.4593692 ,  0.8306134 ],
       ...,
       [ 0.854046  ,  0.04648603,  0.0145665 , ...,  0.26930487,
         0.07758062,  0.57686764],
       [ 0.8994547 ,  0.66806537,  0.2916338 , ..., -0.6893833 ,
        -0.829394  ,  0.5464675 ],
       [ 0.4891531 ,  0.2633897 , -0.57394785, ...,  0.21986733,
        -0.3543096 , -0.43486243]], dtype=float32)

In [14]:
len(doc.tensor)

7

In [15]:
#doc._.trf_outputs.last_hidden_state()