<div style="text-align: right">Dino Konstantopoulos, 3 June 2021</div>



### BERT embeddings

We can use BERT to extract word and sentence embedding vectors from text data. Embeddings are useful for keyword/search expansion, semantic search and information retrieval. 

Typically, we've used these vectors as high-quality feature inputs to downstream models. Our NLP models such as LSTMs or CNNs require inputs in the form of numerical vectors, and this typically means translating features like the vocabulary and parts of speech into numerical representations. In the past, words had been represented either as uniquely indexed values (one-hot encoding), or more helpfully as neural word embeddings where vocabulary words are matched against the fixed-length feature embeddings that result from models like Word2Vec or Fasttext. BERT offers an advantage over models like Word2Vec, because while each word has a fixed representation under Word2Vec regardless of the context within which the word appears, BERT produces word representations that are dynamically informed by the words around them. For example, given two sentences:

"The man was accused of robbing a bank."
"The man went fishing by the bank of the river."

Word2Vec would produce the same word embedding for the word "*bank*" in both sentences, while under BERT the word embedding for "bank" would be different for each sentence. This will help us shape topic models into *belief* models, which is what wee're after in our research.

## 1. Loading Pre-Trained BERT

In [1]:
!pip install pytorch-pretrained-bert --quiet

[K     |████████████████████████████████| 133kB 5.0MB/s 
[?25h

In [2]:
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

# Load pre-trained model tokenizer (vocabulary-multilingual)
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
100%|██████████| 995526/995526 [00:00<00:00, 3153884.96B/s]


## 2. Sentence Tokenization


BERT provides its own tokenizer, imported above. Let's see how it handles the below sentence.

In [None]:
text = "Mao Zhedong believes the peasants are taken advantage of by capitalists "
marked_text = "[CLS] " + text + " [SEP]"

# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)
segments_ids = [1] * len(tokenized_text)

# Map the token strings to their vocabulary indeces.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

# Print out the tokens.
print (tokenized_text)

## 3. Extracting Embeddings 



In [0]:
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-multilingual-cased')

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

In [20]:
# Predict hidden states features for each layer
with torch.no_grad():
    encoded_layers, _ = model(tokens_tensor, segments_tensors)

# Concatenate the tensors for all layers. We use `stack` here to
# create a new dimension in the tensor.
token_embeddings = torch.stack(encoded_layers, dim=0)

# Remove dimension 1, the "batches".
token_embeddings = torch.squeeze(token_embeddings, dim=1)

# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)

token_embeddings.size()

torch.Size([23, 12, 768])

## 3.1 Word Vectors

There are many methods to extract the word vectors from BERT. A simple solution is to create the word vectors by summing together the last four layers.




In [28]:
# Stores the token vectors, with shape [23 x 768]
token_vecs_sum = []

# `token_embeddings` is a [23 x 12 x 768] tensor.

# For each token in the sentence...
for token in token_embeddings:

    # `token` is a [12 x 768] tensor

    # Sum the vectors from the last four layers.
    sum_vec = torch.sum(token[-4:], dim=0)
    
    # Use `sum_vec` to represent `token`.
    token_vecs_sum.append(sum_vec)

print ('Shape is: %d x %d' % (len(token_vecs_sum), len(token_vecs_sum[0])))

Shape is: 23 x 768


In [31]:
token_vecs_sum[3][:15]

tensor([-1.6947, -1.0643,  1.2002,  0.7157,  2.4702, -0.1323,  2.2795,  2.8649,
        -1.2580,  0.9318,  1.1011,  5.5358,  3.1182, -0.4945,  2.5315])

## 3.2 Sentence Vectors


To get a single vector for our entire sentence we have multiple application-dependent strategies, but a simple approach is to average the second to last hiden layer of each token producing a single 768 length vector.

In [0]:
# `encoded_layers` has shape [12 x 1 x 23 x 768]

# `token_vecs` is a tensor with shape [23 x 768]
token_vecs = encoded_layers[11][0]

# Calculate the average of all 23 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)

In [33]:
print ("Our final sentence embedding vector of shape:", sentence_embedding.size())

Our final sentence embedding vector of shape: torch.Size([768])


In [34]:
sentence_embedding

tensor([-1.8472e-01, -3.1975e-01,  2.0524e-01,  1.8466e-01,  6.7442e-01,
         6.5859e-02,  2.7543e-01,  3.7008e-01,  4.8470e-02,  6.4544e-02,
        -2.5896e-01,  3.8808e-01,  1.8641e-01,  1.8684e-01,  2.7901e-01,
        -1.8391e-01,  2.4382e-01, -1.4729e-02, -1.2464e-01,  3.1313e-01,
        -3.8217e-01,  4.9745e-02,  2.6459e-01,  1.6558e-01,  1.8589e-01,
         3.9221e-01, -3.5080e-01,  2.2916e-01, -1.0496e-01, -5.7553e-01,
         1.2044e-03,  2.6013e-01, -1.8011e-01,  1.1600e-01,  5.0010e-03,
        -4.6986e-02, -5.1652e-02,  2.4950e-01,  1.3273e-01, -1.1434e-01,
        -1.2804e-02,  3.4448e-01,  1.9676e-01,  1.7263e-01,  1.3718e-01,
        -1.2040e-01,  2.6909e-01, -4.9007e-02,  1.1978e-01,  1.9021e-01,
         1.9566e-01, -2.2370e-01, -7.8432e-02,  3.2046e-02,  1.4038e-01,
         3.3115e-01,  1.2462e-01,  1.8779e-01, -9.8005e-03,  2.0713e-01,
        -1.7347e-01,  1.4861e-01,  3.9736e-01,  3.6141e-02, -3.4722e-01,
         2.5525e-01,  1.5488e-01, -1.0872e-01,  1.2