<a href="https://colab.research.google.com/github/coda-nsit/BERT_experiments/blob/master/BERT_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What does this notebook do?
I tried to get all the word embeddings of words fed to the Bert model. I also changed one of the config parameters to output all the hidden layers.

Reference: https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/
The above reference doesn't use the Transformers library but the pytorch-pretrained-bert. I have modified it to use transformers.


In [3]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/37/ba/dda44bbf35b071441635708a3dd568a5ca6bf29f77389f7c7c6818ae9498/transformers-2.7.0-py3-none-any.whl (544kB)
[K     |████████████████████████████████| 552kB 2.7MB/s 
[?25hCollecting tokenizers==0.5.2
[?25l  Downloading https://files.pythonhosted.org/packages/d1/3f/73c881ea4723e43c1e9acf317cf407fab3a278daab3a69c98dcac511c04f/tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl (3.7MB)
[K     |████████████████████████████████| 3.7MB 14.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/a6/b4/7a41d630547a4afd58143597d5a49e07bfd4c42914d8335b2a5657efc14b/sacremoses-0.0.38.tar.gz (860kB)
[K     |████████████████████████████████| 870kB 40.3MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/74/f4/2d5214cbf13d06e7cb2c20d84115ca25b53ea76fa1f0ade0e3c9749de214/sentencepiece-0.1.85-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████

In [0]:
import logging

import torch
% matplotlib inline
import matplotlib.pyplot as plt

from transformers import BertTokenizer, BertModel, BertForMaskedLM, BertConfig

In [0]:
# change logging to see everything that is output
logging.basicConfig(level=logging.INFO)

# Tokenization

In [9]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


## Sample tokenization
`['em', '##bed', '##ding', '##s']`: The original word has been split into smaller subwords and characters. The two hash signs preceding some of these subwords are just our tokenizer’s way to denote that this subword or character is part of a larger word and preceded by another subword. So, for example, the ‘##bed’ token is separate from the ‘bed’ token; the first is used whenever the subword ‘bed’ occurs within a larger word and the second is used explicitly for when the standalone token ‘thing you sleep on’ occurs. 

This model greedily creates a fixed-size vocabulary of individual characters, subwords, and words that best fits our language data.

https://colab.research.google.com/drive/1fCKIBJ6fgWQ-f6UKs7wDTpNTL9N-Cq9X: Notebook on BERT's vocabulary.
https://youtu.be/zJW57aCBCTk: Video on BERT's vocabulary


In [10]:
text = "Here is the sentence I want embeddings for."
marked_text = "[CLS] " + text + " [SEP]"

# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)

# Print out the tokens.
print (tokenized_text)

['[CLS]', 'here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.', '[SEP]']


# Prepare the text for input

In [18]:
text = "After stealing money from the bank vault, the bank robber was seen " \
       "fishing on the Mississippi river bank."

# 1. Add the special tokens
marked_text = "[CLS] " + text + " [SEP]"

# 2. Split the sentence into tokens
tokenized_text = tokenizer.tokenize(marked_text)

# 3. Map the token strings to their vocabulary indeces
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

# 4. Display the words with their indices.
for tup in zip(tokenized_text, indexed_tokens):
    print('{:<12} {:>6,}'.format(tup[0], tup[1]))

# 5. Segment IDs: 0 for sentence 1 and 1 for sentence 2
segments_ids = [1] * len(tokenized_text)

[CLS]           101
after         2,044
stealing     11,065
money         2,769
from          2,013
the           1,996
bank          2,924
vault        11,632
,             1,010
the           1,996
bank          2,924
robber       27,307
was           2,001
seen          2,464
fishing       5,645
on            2,006
the           1,996
mississippi   5,900
river         2,314
bank          2,924
.             1,012
[SEP]           102


## Convert the input to Tensors

In [0]:
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

# Get the embeddings

In [43]:
# Initializing a BERT bert-base-uncased style configuration
configuration = BertConfig()

# changed the default config so that now the model also 
configuration.output_hidden_states = True

# outputting raw hidden-states without any specific head on top
model = BertModel.from_pretrained("bert-base-uncased", config=configuration)
model.eval()

# stop the memorization of the gradients and get the model forward pass
with torch.no_grad():
  embeddings, cls, hidden_states = model(tokens_tensor, segments_tensors)

INFO:transformers.modeling_utils:loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin from cache at /root/.cache/torch/transformers/aa1ef1aede4482d0dbcd4d52baad8ae300e60902e88fcb0bebdec09afd232066.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157


In [42]:
print("embeddings are:")
display(embeddings)
print("\n")

# (batch number, tokens, embedding vector size)
print("embeddings shape:", embeddings.shape)
print("\n")

print("input token shape:", len(tokenized_text))
print("\n")

# 12 + 1: one for the output of the embeddings + one for the output of each layer
# each layer of shape: (batch_size, sequence_length, embedding vector size)
print("hidden state shape:", len(hidden_states))
print("hidden state layer 1:", hidden_states[0].shape)
print("hidden state layer 2:", hidden_states[1].shape)
print("hidden state layer 3:", hidden_states[2].shape)

embeddings are:


tensor([[[-0.4964, -0.1831, -0.5231,  ..., -0.1902,  0.3738,  0.3964],
         [-0.1323, -0.2762, -0.3495,  ..., -0.4567,  0.3786, -0.1096],
         [-0.3626, -0.4002,  0.0676,  ..., -0.3207, -0.2709, -0.3004],
         ...,
         [ 0.2961, -0.2856, -0.0382,  ..., -0.6056, -0.5163,  0.2005],
         [ 0.4878, -0.0909, -0.2358,  ..., -0.0017, -0.5945, -0.2431],
         [-0.2517, -0.3519, -0.4688,  ...,  0.2500,  0.0336, -0.2627]]])



embeddings shape: torch.Size([1, 22, 768])


input token shape: 22


hidden state shape: 13
hidden state layer 1: torch.Size([1, 22, 768])
hidden state layer 2: torch.Size([1, 22, 768])
hidden state layer 3: torch.Size([1, 22, 768])
