## Token Embedding Analysis

One of the key innovations of the Transformers architecture is that token embeddings are _affected_ by nearby tokens (e.g., others words in the sentence). This notebook will demonstrate this effect by analyzing single keywords and comparing the output embedding vectors for the keyword from multiple sentences containing the example keyword.

In [2]:
import torch 
import numpy as np 
import torch.nn as F
from typing import List, Tuple
from transformers import DistilBertTokenizer, DistilBertModel

  from .autonotebook import tqdm as notebook_tqdm


### Load Tokenizer and Model

In [3]:
model_checkpoint = 'distilbert-base-uncased'

In [4]:
tokenizer = DistilBertTokenizer.from_pretrained(model_checkpoint)

In [5]:
def show_decoding(tokenizer, encoding:torch.tensor) -> List[Tuple[int, str]]:
    """Show encoded/decoded pairs for example sentence."""
    return [(_enc.item(), tokenizer.decode(_enc)) for _enc in encoding]

In [6]:
model = DistilBertModel.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Keywords and Example Text

We set the keyword and create example sentences containing the keyword. Notice in the example sentences below the different _ways_ in which the `keyword` is used. At the end of this notebook, we will compare the individual embeddings for the `keyword` from each of the example sentences. We expect to see a more _similar_ embedding if the `keyword` is used in a similar way between any pair of sentences. For instance, keep in mind examples `examples[0]` and `examples[3]`.

In [7]:
# set keyword
keyword = "pilot"
keyword_id = tokenizer.encode(keyword)[1]
print('keyword id:', keyword_id)

keyword id: 4405


In [8]:
# create examples where the above keyword is used in different forms
examples = [
    "Attention passengers, this is the pilot speaking. Please prepare for landing.", # <- flight-related
    "The tv show was funny but it didn't get approved after the pilot.", 
    "So, are you happy with your honda pilot? How does it handle the rough roads around here?", 
    "Even though the flight was turbulent, I trusted the pilot had everything under control.", # <- flight-related
    "She was the best pilot the commander had ever seen."
]

### Text Encoding

Preprocessing step before input to the model. Each token is translated to the integer id maintained in the `tokenizer` vocabulary.

In [9]:
encodings = tokenizer(examples, padding=True, return_tensors='pt')
encodings = encodings['input_ids'] # we aren't interest in attention_masks in this case
print(encodings.size())

torch.Size([5, 22])


In [10]:
# view encoded tensors from example text (padding makes all examples of equal length)
encodings[0]

tensor([ 101, 3086, 5467, 1010, 2023, 2003, 1996, 4405, 4092, 1012, 3531, 7374,
        2005, 4899, 1012,  102,    0,    0,    0,    0,    0,    0])

In [11]:
# notice how the tokenizer handles example `3`
# particularly with the misspelling of "trusted" and the work "bumpy"
show_decoding(tokenizer, encodings[3])

[(101, '[ C L S ]'),
 (2130, 'e v e n'),
 (2295, 't h o u g h'),
 (1996, 't h e'),
 (3462, 'f l i g h t'),
 (2001, 'w a s'),
 (22609, 't u r b u l e n t'),
 (1010, ','),
 (1045, 'i'),
 (9480, 't r u s t e d'),
 (1996, 't h e'),
 (4405, 'p i l o t'),
 (2018, 'h a d'),
 (2673, 'e v e r y t h i n g'),
 (2104, 'u n d e r'),
 (2491, 'c o n t r o l'),
 (1012, '.'),
 (102, '[ S E P ]'),
 (0, '[ P A D ]'),
 (0, '[ P A D ]'),
 (0, '[ P A D ]'),
 (0, '[ P A D ]')]

In [12]:
# get indices for where `keyword` occurs in each encoded vector
keyword_enc_idx = np.where(encodings.numpy() == keyword_id)[1]
keyword_enc_idx

array([ 7, 15,  9, 11,  5])

### Model Outputs

Input the encoded text examples as a forward-pass to the model. The model will output the embeddings for each token in each text example. We will extract the `keyword` embeddings from each of the sentence outputs to compare the `keyword` representations and how they are affected by their _context_ (i.e., neighbor words).

In [13]:
# first dim size should equal len(examples)
outputs = model(encodings)
print('output shape:', outputs[0].size())

output shape: torch.Size([5, 22, 768])


In [14]:
outputs[0].size()

torch.Size([5, 22, 768])

In [15]:
outputs[0]

tensor([[[-0.2736, -0.1543,  0.0635,  ...,  0.0185,  0.4955,  0.3776],
         [ 0.3556,  0.3899,  0.3734,  ..., -0.0779,  0.2457, -0.2127],
         [-0.0223,  0.0153,  0.3945,  ...,  0.0511,  0.2932,  0.0463],
         ...,
         [ 0.3637, -0.1927,  0.4301,  ...,  0.0364,  0.3203, -0.3774],
         [ 0.4107, -0.2260,  0.4248,  ...,  0.0392,  0.3263, -0.4284],
         [ 0.4081, -0.2384,  0.4259,  ...,  0.0639,  0.3257, -0.4584]],

        [[-0.0736, -0.3944,  0.1900,  ..., -0.1812,  0.5740,  0.2573],
         [ 0.0144, -0.7304, -0.4040,  ..., -0.1761,  0.8622, -0.2593],
         [ 0.0136, -0.6067,  0.0610,  ..., -0.2159,  0.7598, -0.2419],
         ...,
         [ 0.2783, -0.3372,  0.3197,  ...,  0.1241,  0.2205, -0.1912],
         [ 0.3623, -0.3474,  0.3242,  ...,  0.1781,  0.2146, -0.2292],
         [ 0.3049, -0.4622,  0.1602,  ...,  0.2063,  0.2096, -0.1351]],

        [[ 0.0841,  0.0536, -0.1024,  ..., -0.1595,  0.4923,  0.1829],
         [ 0.2789, -0.5166,  0.1365,  ...,  0

In [16]:
# extract embeddings for `keyword` in each of the sentence outputs
keyword_embeddings = torch.stack([outputs[0][i][keyword_enc_idx[i]] for i in range(len(examples))])
keyword_embeddings.size()

torch.Size([5, 768])

In [17]:
# example embedding for `keyword` from example text `i` (shortened for print-out)
keyword_embeddings[0][:10]

tensor([ 0.0386, -0.3837,  0.0909,  0.0236,  0.1591,  0.0252, -0.3695,  0.3892,
         0.3110, -0.6666], grad_fn=<SliceBackward0>)

In [18]:
keyword_embeddings[1][:10]

tensor([ 0.3484, -0.5333,  0.0284,  0.2435, -0.2398, -0.2844,  0.2531,  0.2528,
         0.2653, -0.0074], grad_fn=<SliceBackward0>)

### Generate Embedding for Keyword Only

In [19]:
# create single token sentence
base_keyword = f"{keyword}"
base_keyword

'pilot'

In [20]:
# tokenize - first and last `id` will be beginning and end of sentence tokens
base_encoding = tokenizer(base_keyword, return_tensors='pt')
base_encoding = base_encoding['input_ids']

In [21]:
# forward pass - get token embedding
base_output = model(base_encoding)
base_embedding = base_output[0][0][1]

In [22]:
base_embedding[:10]

tensor([ 0.0967, -0.0604, -0.1340, -0.0524,  0.2880,  0.0868, -0.0932, -0.0259,
         0.6041, -0.8347], grad_fn=<SliceBackward0>)

### Cosine Similarity Between Keyword Embeddings 

\*including the `base_embedding`

Based on the examples, we expect the keyword embeddings contained in sentences related to a "pilot" to be more similar to each other than keyword embeddings in other sentences.

In [23]:
cos = F.CosineSimilarity(dim=1)

In [24]:
sim_scores = torch.stack([
    cos(keyword_embeddings, keyword_embeddings[i]) 
    for i in range(len(examples))
])

In [25]:
sim_scores

tensor([[1.0000, 0.6127, 0.7506, 0.9188, 0.8510],
        [0.6127, 1.0000, 0.5606, 0.5833, 0.6129],
        [0.7506, 0.5606, 1.0000, 0.7733, 0.7424],
        [0.9188, 0.5833, 0.7733, 1.0000, 0.8347],
        [0.8510, 0.6129, 0.7424, 0.8347, 1.0000]], grad_fn=<StackBackward0>)

In [26]:
# important similarity scores based on example texts
sim_scores[0, 3]

tensor(0.9188, grad_fn=<SelectBackward0>)

In [27]:
# compare all sentence-based keyword embeddings with the `base_embedding`
cos(keyword_embeddings, base_embedding)

tensor([0.6686, 0.5095, 0.5919, 0.6675, 0.6742], grad_fn=<SumBackward1>)

## Domain Specific Terminology

When we ask the "off-the-shelf" pretrained tokenizers and models to utilize domain specific terminology, we begin to encounter various _types_ of issues, such as:

- Out of Vocabulary terms
- Homonyms (i.e. words with same spelling but different meanings)
- Breaking of alpha-numeric references, codes, etc. (e.g., AMM reference)

In [28]:
# observe how the tokenizer regards nacelles, ac, and the amm reference
example = "inspected right side nacelles for ac 99999 per manual reference 123-456-789."
example_enc = tokenizer(example, return_tensors='pt')['input_ids'].flatten()

In [31]:
# tokenizer vocab does not have entry for "nacelles", breaks up ac number & manual reference
show_decoding(tokenizer, example_enc)

[(101, '[ C L S ]'),
 (20456, 'i n s p e c t e d'),
 (2157, 'r i g h t'),
 (2217, 's i d e'),
 (6583, 'n a'),
 (29109, '# # c e l'),
 (4244, '# # l e s'),
 (2005, 'f o r'),
 (9353, 'a c'),
 (25897, '9 9 9'),
 (2683, '# # 9'),
 (2683, '# # 9'),
 (2566, 'p e r'),
 (6410, 'm a n u a l'),
 (4431, 'r e f e r e n c e'),
 (13138, '1 2 3'),
 (1011, '-'),
 (3429, '4 5'),
 (2575, '# # 6'),
 (1011, '-'),
 (6275, '7 8'),
 (2683, '# # 9'),
 (1012, '.'),
 (102, '[ S E P ]')]