# Conversion of Contextual to Static Word Representations

## Author: Sandip S Panesar

### V1. October 2021

##### Copy this notebook into Colab and run from there for best results.

In [3]:
#NB: May need to install transformer package first
!pip install transformers
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch



**Introduction**

A fundamental building block of modern natural language processing (NLP) techniques are high-dimensional mathematical word representations. Though there are numerous ways to mathematically quantify words in text, e.g. frequncy, TF-IDF etc., these methods all have particular drawbacks. An alternative, first described by Mikolov et al. (2013) was to utilize neural models to estimate these word representations in a vector space. This technique resulted in the popular Word2Vec language model, which has been widely adopted and utilized. Models like Word2Vec are trained on a single corpus of text data, and produce single word representations for each word in a corpus vocabulary. These representations are typically 1x200 in dimension, i.e. each word has 200 "dimensions" of features that encode its semantic and syntatic properties. Word2Vec can produce word representations with a dimensionality of up to 300. 

Similar high-dimensional word representations are produced by models such as GloVe (global vectors for word representation), described by Pennington et al. (2014), which is a log-bilinear model which is trained on non-zero entries of a global word-word co-occurrence matrix. 

Nevertheless, the problem with word representations produced by both of the aforementioned is that they are "static." This means that a single representation is produced for a word, which is consequently a product of the word appearing in *every* vocabulary context. Think about the word "flies". Depending upon how its used, it could be a verb or a noun:

- **Noun:** There are *flies* in my kitchen.

![](https://www.pestworld.org/media/560912/istock_000001759709small_2-flies.jpg?preset=pestFeature1280)

- **Verb:** Jane *flies* to London on Tuesday.

![](https://media-cldnry.s-nbcnews.com/image/upload/t_fit-2000w,f_auto,q_auto:best/newscms/2020_29/3397778/200717-british-airways-747-al-0858.jpg)

Naturally, for certain tasks such as machine translation among many others, a single static word representation may produce inaccuracies and negatively affect performance. 

**Contextual Language Models**

A substantial advancement in NLP came with the introduction of contextualized language models such as the bidirectional encoder representation for transformers (BERT) (Devlin et al. 2018). Since the release of Word2Vec, advances had already been made by creating models that could incorporate syntax, morphology, subwords and subcharacters. Nevertheless, the single biggest performance increase has been conferred by models that can incorporate context. BERT, in particular, is a model that has been first trained on massive text corpora (e.g. Wikipedia) and is designed to produce "just in time" word embeddings for tasks that notably include language translation and prediction. By initial training on massive corpora, in its core-state BERT possesses a series of context-agnostic word representation layers that are then fine tuned (NB: "fine tuned" in this context refers to a different process than further tweaking an already trained BERT model for performance enhancement in a particular domain or task) by passing a sentence example into it. The resulting output is a sequence of word representations for each word in a sentence that are uniquely influenced by each word in that sentence - i.e. they are contextual word representations. 

**Contextual language model architectures**

![](https://miro.medium.com/max/1348/1*lxd3DCwPKYkjmQb3yfy3Hw.png)

**Transformers: The building blocks of contextual language models**

![](https://miro.medium.com/max/1400/1*abz_nltyDYtC6ThqNg4O6w.png)

**Scaled Dot Product Attention**

$$
\mathrm{Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{n}})V}
$$

**Contextualized Sequences of Word Representations from BERT**

NB. This notebook and all subsequent examples will use the core BERT-base-uncased model (as a HuggingFace transformer package) to demonstrate the technique.

In [4]:
# Initialize tokenizer and model objects from HuggingFace transformers library

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Lets use the above contextual examples of the word *flies* to see how the word representation produced by BERT changes depending upon context. 

Before being passed into BERT, the text needs to be tokenized using the BERT-model tokenizer first:

In [5]:
ct_1 = tokenizer.tokenize("there are flies in my kitchen")
ct_2 = tokenizer.tokenize("jane flies to london on tuesday")

tk_1 = tokenizer.convert_tokens_to_ids(ct_1)
tk_2 = tokenizer.convert_tokens_to_ids(ct_2)

Which produces two sequences of unique tokens corresponding to BERT's unique vocabulary of ~30,000 words:

In [6]:
print("Context 1:")
print(ct_1)
print(tk_1)
print("Context 2:")
print(ct_2)
print(tk_2)

Context 1:
['there', 'are', 'flies', 'in', 'my', 'kitchen']
[2045, 2024, 10029, 1999, 2026, 3829]
Context 2:
['jane', 'flies', 'to', 'london', 'on', 'tuesday']
[4869, 10029, 2000, 2414, 2006, 9857]


In this particular case, we are interested in the word "flies", which is the 3rd word in the first example, and the 2nd word in the second example. Note that this number is the same for both (10029) and represents the index sequence for the particular word in the pre-trained BERT's context-agnostic word-representation matrix. NB: As BERT uses subword pooling, some words that don't appear in the BERT vocabulary may be constructed from multiple tokens. In this example, there are as many tokens as there are words in each sentence, however be aware that there may be more tokens in a particular tokenized sequence.

The next step is to encode the tokenized sequences using the model's encoder:

In [7]:
with torch.no_grad():
  reps_1 = model(torch.tensor([tk_1]), output_hidden_states=True)
  reps_2 = model(torch.tensor([tk_2]), output_hidden_states=True)

Above, we have passed the tokenized examples into the BERT model architecture, and have extracted the hidden states for each one. The output will be in the form of a "masked LM output", which can be indexed like a list object:

In [8]:
print("Length of the 'reps_1' object: {}".format(len(reps_1[1])))
print("Length of the 'reps_2' object: {}".format(len(reps_2[1])))

Length of the 'reps_1' object: 13
Length of the 'reps_2' object: 13


Why is the length of this object 13? Because there are 13 unique layers in the BERT model architecture (there are actually 12, but the input layer gets counted). Following the input, the context-agnostic representations (which are indexed using the unique tokens) get successively multiplied by the unique weight matrices of each layer to produce a final output.

Lets look further into the the model outputs:

In [9]:
print("Type of first item in 'reps_1' object: {}".format(type(reps_1[1][1])))
print("Shape of first item in 'reps_1' object: {}".format(reps_1[1][1].shape))

Type of first item in 'reps_1' object: <class 'torch.Tensor'>
Shape of first item in 'reps_1' object: torch.Size([1, 6, 768])


Here we can see that the output of the model is a Torch tensor object. Its shape is 1 x 6 x 768. Remember this is a tensor, rather than a simple matrix. We can ignore the 1st dimension for the time being. The second dimension is 6, because there were 6 words in the particular sentence for the first example ("there are flies in my kitchen"). Why is the 3rd dimension 768? Because the BERT model uses this dimensionality as a default for all word representations. Therefore, the model output has been a sequence of 13 torch tensors of size 1, 6, 768, which are the unique BERT output representations produced by all 13 layers of the model. We would expect the same dimensionalities for the second example ("Jane flies to London on Tuesday") also. 

Looking further into the outputs from the 1st BERT layer:

In [10]:
reps_1[1][1]

tensor([[[-0.3371,  0.0284, -0.0151,  ...,  0.1688,  0.5847,  0.2671],
         [ 0.1085, -0.5766,  0.5589,  ...,  0.8240,  0.8500,  0.2674],
         [ 1.2492,  0.4529,  0.3716,  ...,  0.4312,  1.6800, -0.7760],
         [-0.3168, -0.4342,  0.3652,  ...,  0.8458, -0.1063,  0.4743],
         [ 0.0768,  0.0895, -0.5986,  ..., -0.1931,  0.1779, -0.1596],
         [ 1.1235,  1.1240, -1.7649,  ...,  0.0018,  0.5081, -1.2387]]])

Above is the particular sequence of unique BERT representations for each word in the sentence for example 1. We know that the word of interest, "flies" is the 3rd word in the sequence and is represented by a single, rather than multiple subword tokens. We can extract this from the sequence:

In [11]:
reps_1[1][1][0][2]

tensor([ 1.2492,  0.4529,  0.3716, -0.2537, -0.6448, -0.5668,  0.2715,  0.0713,
         0.5645,  0.2954,  0.1321, -0.2661, -1.3168, -0.8103,  0.7747, -0.1451,
        -0.3072, -0.8303,  0.0562,  0.0047,  0.9308, -1.1341, -0.0853, -0.6373,
         1.1573, -0.7362,  0.3110,  0.6170, -0.3286,  0.1840, -1.2341, -0.5636,
         0.1067,  0.0057,  0.3447, -0.1934, -0.8136, -0.7385, -0.3345, -0.0770,
         0.8960, -1.0697, -0.6062, -0.7009, -0.0875,  0.0882, -1.0935,  1.3017,
         0.9258, -0.0748, -1.0685,  0.7274,  0.4908,  0.4242, -0.2799, -0.6097,
         1.0959,  0.5043, -0.2942, -0.3891, -0.9965, -0.6558, -0.1532,  1.0867,
        -1.0657,  0.1961, -0.9462,  0.4023, -1.3862, -0.0482,  1.5956,  0.3197,
        -0.0546, -1.3346,  0.1428,  0.3159,  0.5501,  0.6684,  1.0249, -0.0060,
        -0.8474, -1.3680,  0.7003,  0.2155, -0.4345, -0.2939, -0.3236, -0.1450,
        -0.5835, -0.2909,  0.0689, -0.1768,  0.2719,  0.2286,  0.1724,  0.3491,
         0.0193,  0.0952, -0.7759, -1.51

This is the unique representation sequence vector for the word "flies" in this instance. 

To exemplify the difference between the word "flies" produced from this context and the word produced in the second exemplary context, lets compare whether they are the same:

In [12]:
torch.equal(reps_1[1][1][0][2], reps_2[1][1][0][1])

False

Thus, we have demonstrated that the BERT word representation for "flies" is different due to each context. This would be the same for all other layers, e.g. the 12th layer:

In [13]:
torch.equal(reps_1[1][12][0][2], reps_2[1][12][0][1])

False

**Converting Contextual Word Representations into Static Word Representations**

Above, we showed how BERT produces unique, contextualized representations for a single word depending upon a context in which it appears. Nevertheless, these representations, aside from being different in dimensionality from those produced by a Word2Vec or GloVe model, cannot readily be compared to vectors produced by the latter language models. This is because word representations produced by Word2Vec, for example, are representative of the word as it appears in in all contexts in a particular corpus, whereas these representations are representative of the words in that particular example only. Thus, in an attempt to permit comparability for academic and other reasons, Bommasani et al. (2020) devised a method to convert contextual word representations (like the ones we have just produced) into those that are similar to those produced by static language models. 

Specifically, Bommasani et al. (2020) proposed two methods to produce word representations from contextual language models that were comparable to those produced by static models. The first is simply to use a single context for each word, like we have done above. Nevertheless, even though in theory BERT has been trained on corpora much larger than would be practical to train a Word2Vec or GloVe model on, it can be shown that at various language benchmarking tasks such as SimLex999 and SimVerb3500 (see Bommasani paper for more detail), using a single contextualized example substantially underperforms compared to static language models.

Another potential approach is to aggregate several contextual examples and produce a single, aggregated word representation for each word. Though this is a largely experimental technique, Bommasani et al. (2020) demonstrated that using this approach, word representations substantially outperformed those produced by static language models at various language benchmarking tasks. The following sections will demonstrate various approaches to achieve this. The utils.py file contains some helper functions.

**Batch Encoding a Series of Sentences**

If we want to aggregate a series of contextual examples for a particular word of interest, we will need to encode all of these. Lets use the word "interest" in this example:

In [14]:
# Lets define some helper functions first:

def encode(text, tokenizer, add_special_tokens=False):
    encoding = tokenizer.encode(
        text,
        add_special_tokens=add_special_tokens,
        return_tensors='pt')
    if encoding.shape[1] == 0:
        text = tokenizer.unk_token
        encoding = torch.tensor([[tokenizer.vocab[text]]])
    return encoding

def represent(batch_ids, model, layer=-1):
    with torch.no_grad():
        reps = model(batch_ids, output_hidden_states=True)
        return reps.hidden_states[layer]

def find_sublist_indices(sublist, mainlist):
    indices = []
    length = len(sublist)
    for i in range(0, len(mainlist)-length+1):
        if mainlist[i] == sublist:
            indices.append(i)
    return indices

In [15]:
batch = ['the bank account accrued interest over time', 
         'do you have any interest in this job opportunity?', 
         'Jane did not show any interest in Tom', 
         'the central bank adjusted the interest rates']

encoded_batch = [encode(example, tokenizer) for example in batch]

This produces a list of token sequences for each of the examples in the batch:

In [16]:
encoded_batch

[tensor([[ 1996,  2924,  4070, 16222, 28551,  3037,  2058,  2051]]),
 tensor([[2079, 2017, 2031, 2151, 3037, 1999, 2023, 3105, 4495, 1029]]),
 tensor([[4869, 2106, 2025, 2265, 2151, 3037, 1999, 3419]]),
 tensor([[ 1996,  2430,  2924, 10426,  1996,  3037,  6165]])]

The next step would be to then pass each of these example sequences into the model and obtain the model representations for each. For this example we are using layer 1:

In [17]:
batch_reps = [represent(example, model, layer=1) for example in encoded_batch]

batch_reps

print("Length of 'batch_reps': {}".format(len(batch_reps)))

for idx, item in enumerate(batch_reps):
  print("Shape of BERT output for example {}: {}".format(idx+1, item.shape))

Length of 'batch_reps': 4
Shape of BERT output for example 1: torch.Size([1, 8, 768])
Shape of BERT output for example 2: torch.Size([1, 10, 768])
Shape of BERT output for example 3: torch.Size([1, 8, 768])
Shape of BERT output for example 4: torch.Size([1, 7, 768])


Here, we have encoded each of the 4 examples using the tokenized sequences. We can see that the representation lengths correspond to the lengths of the examples from the 'encoded_batch' list, and each of these items corresponds to a 768-dimensional vector for each token representation. 

Nevertheless, these representations are for **all** words in the example sequences. In order to aggregate contextual representations, we need to extract the representations corresponding to only the word of interest, i.e. the word of interest in this example is "interest". We achieve this using an indexing function that is able to extract the representation (or representations, in the case of multiple subwords) from the sequences we have produced.

We utilize a special function for this, which is able to index and pull out the representations for the word of interest using the sequence of tokens and the sequence of reps, but first we need to get this token from the BERT model tokenizer:

In [18]:
toi = encode('interest', tokenizer)

toi

tensor([[3037]])

In our case, 3037 is the token of interest, thus we need to pull the representation from the representation sequence that corresponds to the index for 3037 in the sequence of tokens:

In [19]:
indices = [find_sublist_indices(toi, id.squeeze(0)) for id in encoded_batch]
    
indices

[[5], [4], [5], [5]]

Above are the corresponding index positions for the word "interest" in each contextual example. The next step would be to extract the corresponding representations from the representation sequences:

In [20]:
final_reps = [rep.squeeze()[idx] for idx, rep in zip(indices, batch_reps)]

print("Length of 'final_reps': {}".format(len(final_reps)))

for idx, item in enumerate(final_reps):
  print("Shape of object {} in 'final_reps': {}".format(idx+1, item.shape))

Length of 'final_reps': 4
Shape of object 1 in 'final_reps': torch.Size([1, 768])
Shape of object 2 in 'final_reps': torch.Size([1, 768])
Shape of object 3 in 'final_reps': torch.Size([1, 768])
Shape of object 4 in 'final_reps': torch.Size([1, 768])


Above, we can see that there are 4 items in this list, corresponding to all 4 examples. We can also see that each of these is a 1 x 768 dimensional tensor. These are the individual, unique contextualized representations for the word "interest" which was influenced by the unique sequence of words that appeared around this word in each contextual example.

Now, all that is left to do is take the mean for the example:

In [21]:
interest_rep = torch.mean(torch.cat(final_reps), axis=0).squeeze(0)

print("The shape of the final, aggregated vector is: {}".format(interest_rep.shape))

print(interest_rep)

The shape of the final, aggregated vector is: torch.Size([768])
tensor([-1.5860e-01,  4.2653e-01, -1.3746e+00, -2.0712e-01,  1.4237e+00,
         3.5543e-01, -4.3161e-01,  1.3834e-01, -6.5748e-01, -9.0278e-01,
         8.9920e-01, -8.3442e-01, -1.6335e+00,  5.8129e-01, -8.0425e-01,
         2.4763e-02,  4.5348e-01, -1.6919e-01,  5.5616e-01, -6.3543e-01,
         1.3068e+00, -4.7555e-01,  5.5960e-01, -6.5125e-01,  1.9772e-01,
         4.9402e-02, -1.1139e+00,  8.9068e-01, -6.4682e-01,  1.4834e-01,
        -1.3721e-01, -5.6790e-01,  2.8271e-01,  8.3907e-01, -2.5547e-01,
        -2.4574e-01,  4.2206e-01, -4.1671e-01, -2.5220e-01,  7.0160e-01,
         1.2924e-01,  1.0697e-01, -7.5490e-01, -4.6156e-01, -5.2881e-01,
        -9.4269e-01, -1.2130e-01,  4.7840e-01, -7.9150e-02, -1.2404e+00,
        -3.1382e-01,  1.7650e-01,  8.0450e-02,  8.2559e-01,  7.4104e-01,
         7.0324e-01,  5.3583e-01,  7.3397e-01,  2.2428e-01, -9.2912e-01,
         9.8462e-01, -4.0586e-01, -4.7924e-01, -7.9504e-01, 

We can repeat this process for some other words:

In [22]:
# Lets make a helper function to make the process easier
def encode_represent(batch, tokenizer, model, toi):
  encoded_batch = [encode(example, tokenizer) for example in batch]
  batch_reps = [represent(example, model, layer=1) for example in encoded_batch]
  toi = encode(toi, tokenizer)
  indices = [find_sublist_indices(toi, id.squeeze(0)) for id in encoded_batch]
  final_reps = [rep.squeeze()[idx] for idx, rep in zip(indices, batch_reps)]
  return torch.mean(torch.cat(final_reps), axis=0).squeeze(0)


batch2 = ['the sky is blue',
          'blue is the color of the ocean',
          'she has blue eyes']

blue_rep = encode_represent(batch2, tokenizer, model, 'blue')

batch3 = ['i was so angry i was seeing red',
          'he drives a red corvette',
          'the girl in the red dress']

red_rep = encode_represent(batch3, tokenizer, model, 'red')

# And just for an example, lets create another batch for 'blue' but in a different context
batch4 = ['blue cheese is a type of cheese',
          'the presence of mold in the cheese gives it characteristic blue streaks',
          'gorgonzola is a type of blue cheese',
          'not everybody likes blue cheese due to its strong flavor and smell']

blue2_rep = encode_represent(batch4, tokenizer, model, 'blue')

The above code will have generated two lists of contextual representations for each of the words of interest, i.e. 'blue' and 'red'. Then the mean of each of these was taken, yielding two aggregated word representations for each word. Subsequently, operations such as cosine similarity can be performed between these two tensor objects:

$$
\mathrm{similarity(a, b)=cos(\theta)=\frac{a \cdot b}{\lVert a \rVert \lVert b \rVert}}
$$

In [23]:
cos_br = torch.dot(blue_rep, red_rep)/(torch.norm(blue_rep)*torch.norm(red_rep))
print("Similarity between words 'blue' and 'red': {}".format(cos_br))
cos_bi = torch.dot(blue_rep, interest_rep)/(torch.norm(blue_rep)*torch.norm(interest_rep))
print("Similarity between words 'blue' and 'interest': {}".format(cos_bi))
cos_bb = torch.dot(blue_rep, blue2_rep)/(torch.norm(blue_rep)*torch.norm(blue2_rep))
print("Similarity between words 'blue' and 'blue' in different contexts: {}".format(cos_bb))

Similarity between words 'blue' and 'red': 0.5347683429718018
Similarity between words 'blue' and 'interest': 0.1250074803829193
Similarity between words 'blue' and 'blue' in different contexts: 0.8751708269119263


Note how the similarity between 'blue' and 'red' is greater than the similarity between 'blue' and 'interest'. Note also how the similarity between 'blue,' albeit in different contexts, is almost 0.9.

Here, we have described a process of aggregating several contextual examples for a particular word of interest. This single, aggregated example can be utilized in the same way as static word representations e.g. linear analogies etc. Moreover, taking the mean is only one way to aggregate the representations. Bommasani et al. (2020) tried various approaches including taking the maximum, minimum, first and last of the various contextual examples. 

**References**

1. Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

2. Pennington, J., Socher, R. and Manning, C.D., 2014, October. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).

3. Bommasani, R., Davis, K. and Cardie, C., 2020, July. Interpreting pretrained contextualized representations via reductions to static embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4758-4781).