<a href="https://colab.research.google.com/github/UniVR-DH/CAT-tools/blob/main/L11-NeuralMT/Test_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Based on:

https://colab.research.google.com/drive/1yFphU6PW9Uo6lmDly_ud9a6c4RCYlwdX#scrollTo=Q51eN4KAkbIJ


In [None]:
!pip install transformers



In [None]:
import torch
from transformers import BertTokenizer, BertModel
import matplotlib.pyplot as plt
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Because BERT is a pretrained model that expects input data in a specific format, we will need:

* A special token, `[SEP]`, to mark the end of a sentence, or the separation between two sentences

* A special token, `[CLS]`, at the beginning of our text. This token is used for classification tasks, but BERT expects it no matter what your application is.

In [None]:
text = "Here is the sentence I want embeddings for."
marked_text = "[CLS] " + text + " [SEP]"

# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)

# Print out the tokens.
print (tokenized_text)

['[CLS]', 'here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.', '[SEP]']


Notice how the word "embeddings" is represented:

`['em', '##bed', '##ding', '##s']`


The original word has been split into smaller subwords and characters.
The two hash signs preceding some of these subwords are just our tokenizer's way to denote that this subword or character is part of a larger word and preceded by another subword. So, for example, the '##bed' token is separate from the 'bed' token; the first is used whenever the subword 'bed' occurs within a larger word and the second is used explicitly for when the standalone token 'thing you sleep on' occurs.

In [None]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

After breaking the text into tokens, we then have to convert the sentence from a list of strings to a list of vocabulary indeces.


## Embedding a sentence

From here on, we'll use the below example sentence, which contains two instances of the word "bank" with different meanings.

In [None]:
# Define a new example sentence with multiple meanings of the word "bank"
text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."

# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"

# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)

# Map the token strings to their vocabulary indeces.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
print("Size: ", len(indexed_tokens))
# Display the words with their indeces.
for tup in zip(tokenized_text, indexed_tokens):
    print('{:<12} {:>6,}'.format(tup[0], tup[1]))

Size:  22
[CLS]           101
after         2,044
stealing     11,065
money         2,769
from          2,013
the           1,996
bank          2,924
vault        11,632
,             1,010
the           1,996
bank          2,924
robber       27,307
was           2,001
seen          2,464
fishing       5,645
on            2,006
the           1,996
mississippi   5,900
river         2,314
bank          2,924
.             1,012
[SEP]           102


In [None]:
# Mark each of the 22 tokens as belonging to sentence "1".
segments_ids = [1] * len(tokenized_text)
#print (segments_ids)

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])


# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers.
with torch.no_grad():

    outputs = model(tokens_tensor, segments_tensors)

    # Evaluating the model will return a different number of objects based on
    # how it's  configured in the `from_pretrained` call earlier. In this case,
    # becase we set `output_hidden_states = True`, the third item will be the
    # hidden states from all layers. See the documentation for more details:
    # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
    hidden_states = outputs[2]
    print("Ready")

Ready


In [None]:
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

# Concatenate the tensors for all layers. We use `stack` here to
# create a new dimension in the tensor.
token_embeddings = torch.stack(hidden_states, dim=0)

# Remove dimension 1, the "batches".
token_embeddings = torch.squeeze(token_embeddings, dim=1)

# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)

token_embeddings.size()

torch.Size([22, 13, 768])

In [None]:
# Stores the token vectors, with shape [22 x 768]
token_vecs_sum = []

# `token_embeddings` is a [22 x 13 x 768] tensor: 22 objects, 12 layers, 768 dimensions each.

# For each token in the sentence...
for token in token_embeddings:

    # `token` is a [13 x 768] tensor

    # Sum the vectors from the last four layers.
    sum_vec = torch.sum(token[-4:], dim=0)

    # Use `sum_vec` to represent `token`.
    token_vecs_sum.append(sum_vec)

print ('Shape is: %d x %d' % (len(token_vecs_sum), len(token_vecs_sum[0])))

Shape is: 22 x 768


In [None]:
# `hidden_states` has shape [13 x 1 x 22 x 768]

# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = hidden_states[-2][0]

# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)

print ("Our final sentence embedding vector of shape:", sentence_embedding.size())

Our final sentence embedding vector of shape: torch.Size([768])


### Confirming contextually dependent vectors

To confirm that the value of these vectors are in fact contextually dependent, let's look at the different instances of the word "bank" in our example sentence:

"After stealing money from the **bank vault**, the **bank robber** was seen fishing on the Mississippi **river bank**."

Let's find the index of those three instances of the word "bank" in the example sentence.

In [None]:
for i, token_str in enumerate(tokenized_text):
  print (i, token_str)

0 [CLS]
1 after
2 stealing
3 money
4 from
5 the
6 bank
7 vault
8 ,
9 the
10 bank
11 robber
12 was
13 seen
14 fishing
15 on
16 the
17 mississippi
18 river
19 bank
20 .
21 [SEP]


In [None]:
print('First 5 out of 768 vector values for each instance of "bank".')
print('')
print("bank vault   ", str(token_vecs_sum[6][:5]))
print("bank robber  ", str(token_vecs_sum[10][:5]))
print("river bank   ", str(token_vecs_sum[19][:5]))

First 5 vector values for each instance of "bank".

bank vault    tensor([ 3.3596, -2.9805, -1.5421,  0.7065,  2.0031])
bank robber   tensor([ 2.7359, -2.5577, -1.3094,  0.6797,  1.6633])
river bank    tensor([ 1.5266, -0.8895, -0.5152, -0.9298,  2.8334])


In [None]:
from scipy.spatial.distance import cosine

# Calculate the cosine similarity between the word bank
# in "bank robber" vs "river bank" (different meanings).
diff_bank = 1 - cosine(token_vecs_sum[10], token_vecs_sum[19])

# Calculate the cosine similarity between the word bank
# in "bank robber" vs "bank vault" (same meaning).
same_bank = 1 - cosine(token_vecs_sum[10], token_vecs_sum[6])

print('Vector similarity for  *similar*  meanings:  %.2f' % same_bank)
print('Vector similarity for *different* meanings:  %.2f' % diff_bank)

Vector similarity for  *similar*  meanings:  0.94
Vector similarity for *different* meanings:  0.69


In [None]:
def test_sentence_embedding(sentence, tkzer, bert_mdel):
  # Add the special tokens.
  marked_text = "[CLS] " + sentence + " [SEP]"

  # Split the sentence into tokens.
  tokenized_text = tkzer.tokenize(marked_text)

  # Map the token strings to their vocabulary indeces.
  indexed_tokens = tkzer.convert_tokens_to_ids(tokenized_text)

  # Mark each of the 22 tokens as belonging to sentence "1".
  segments_ids = [1] * len(tokenized_text)
  #print (segments_ids)

  # Convert inputs to PyTorch tensors
  tokens_tensor = torch.tensor([indexed_tokens])
  segments_tensors = torch.tensor([segments_ids])


  # Run the text through BERT, and collect all of the hidden states produced
  # from all 12 layers.
  with torch.no_grad():

      outputs = bert_mdel(tokens_tensor, segments_tensors)

      # Evaluating the model will return a different number of objects based on
      # how it's  configured in the `from_pretrained` call earlier. In this case,
      # becase we set `output_hidden_states = True`, the third item will be the
      # hidden states from all layers. See the documentation for more details:
      # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
      hidden_states = outputs[2]

  # `hidden_states` has shape [13 x 1 x 22 x 768]

  # `token_vecs` is a tensor with shape [22 x 768]
  token_vecs = hidden_states[-2][0]

  # Calculate the average of all 22 token vectors.
  sentence_embedding = torch.mean(token_vecs, dim=0)

  return sentence_embedding

In [None]:
s1 = "The river bank was full of grass and flowers."
s2 = "The riverside was abundant with grasses and blossoms."
s3 = "The computer was consuming a lot of energy."
s4 = "La riva del fiume era piena di erba e fiori."

s1e = test_sentence_embedding(s1, tokenizer, model)
s2e = test_sentence_embedding(s2, tokenizer, model)
s3e = test_sentence_embedding(s3, tokenizer, model)
s4e = test_sentence_embedding(s4, tokenizer, model)


# Compare S1 and S3 different meanings
diff_mean = 1 - cosine(s1e, s3e)


# Compare S1 and S2 similar meanings
same_mean = 1 - cosine(s1e, s2e)

# Compare S1 and S4 translation
trans_mean = 1 - cosine(s1e, s4e)



print('Vector similarity for  *DIFFERENT*  meanings:  %.2f' % diff_mean)
print('Vector similarity for *SIMILAR* meanings:  %.2f' % same_mean)
print('Vector similarity for *TRANSLATION* meaning:  %.2f' % trans_mean)

Vector similarity for  *DIFFERENT*  meanings:  0.62
Vector similarity for *SIMILAR* meanings:  0.95
Vector similarity for *TRANSLATION* meaning:  0.43


# Different Model

In [None]:
from transformers import pipeline
ml_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')
ml_model = BertModel.from_pretrained("bert-base-multilingual-uncased",
                                     output_hidden_states = True # Whether the model returns all hidden-states.
                                     )

In [None]:
s1 = "The river bank was full of grass and flowers."
s2 = "The riverside was abundant with grasses and blossoms."
s3 = "The computer was consuming a lot of energy."
s4 = "La riva del fiume era piena di erba e fiori."
s5 = "La rive du fleuve était pleine d'herbe et de fleurs."
s6 = "Oggi non è Lunedì, è Venerdì."

s1e = test_sentence_embedding(s1, ml_tokenizer, ml_model)
s2e = test_sentence_embedding(s2, ml_tokenizer, ml_model)
s3e = test_sentence_embedding(s3, ml_tokenizer, ml_model)
s4e = test_sentence_embedding(s4, ml_tokenizer, ml_model)
s5e = test_sentence_embedding(s5, ml_tokenizer, ml_model)
s6e = test_sentence_embedding(s6, ml_tokenizer, ml_model)



# Compare S1 and S2 similar meanings
same_mean = 1 - cosine(s1e, s2e)


# Compare S1 and S3 different meanings
diff_mean = 1 - cosine(s1e, s3e)


# Compare S1 and S4 translation
trans_mean1 = 1 - cosine(s1e, s4e)

# Compare S1 and S5 translation
trans_mean2 = 1 - cosine(s1e, s5e)

# Compare S1 and S6 translation
trans_mean3 = 1 - cosine(s1e, s6e)



print('Vector similarity for  *DIFFERENT*  meanings:  %.2f' % diff_mean)
print('Vector similarity for *SIMILAR* meanings:  %.2f' % same_mean)
print('Vector similarity for *TRANSLATION IT* meaning:  %.2f' % trans_mean1)
print('Vector similarity for *TRANSLATION FR* meaning:  %.2f' % trans_mean2)
print('Vector similarity for *TRANSLATION IT WRONG* meaning:  %.2f' % trans_mean3)

Vector similarity for  *DIFFERENT*  meanings:  0.91
Vector similarity for *SIMILAR* meanings:  0.97
Vector similarity for *TRANSLATION IT* meaning:  0.94
Vector similarity for *TRANSLATION FR* meaning:  0.93
Vector similarity for *TRANSLATION IT WRONG* meaning:  0.84
