# Visualising Embeddings with BERT

BERT is one of the most influential machine learning models. It is a pre-trained language model developed by Google that can be fine-tuned for a wide range of natural language processing tasks. BERT uses a transformer architecture and is trained on massive amounts of text data, allowing it to understand the context and meaning of words in a sentence.

It's not important to understand BERT in detail at this point, but for now:

- BERT stands for Bidirectional Encoding Representations using Transformers.
- It is trained to fill in the missing word in text.
- It contains the word embeddings within its first layer's parameters. These BERT embeddings are widely used as a good starting point for word embeddings.

We will talk about BERT more later, but we can already start using it.

In [6]:
#@title # Run the following cell to install the necessary libraries for this practical. { display-mode: "form" } 
#@markdown Don't worry about what's in this collapsed cell

%pip install -q transformers
import numpy as np

Note: you may need to restart the kernel to use updated packages.


In [3]:
# Run this cell to import the BERT model
from transformers import BertModel

model_name = 'bert-base-uncased'
model: BertModel = BertModel.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


You can see the parameters that the model contains by printing its modules attribute. Run the code in the cell below and look at the output. 
- Can you find out how many embeddings the `BERT` model contains?
- Can you find the size of the output layer? 

In [4]:
print(model.modules)

# How many embeddings does BERT contain?
# Output of (embeddings): 30522 * 768 + 512 * 768 + 2 * 768

# What is the size of the output layer?
# Output: 768


<bound method Module.modules of BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(

You can see that the very first layer is the embedding layer. The parameters of this layer are the embeddings for the thousands of words which BERT recognises.

Now, let's get those embeddings. In the code block below:

- Create a variable called `n_embeddings`, equal to the number of embeddings that you found in the model's `modules` attribute.
- Add a statement to print the shape of the embedding matrix.

In [14]:
embedding_weights = list(model.embeddings.word_embeddings.weight.shape)
n_weights = np.array(embedding_weights).prod()

n_embeddings = 30000
embedding_matrix = model.embeddings.word_embeddings.weight.detach()
embedding_matrix = embedding_matrix[:n_embeddings]

print(f"Nmber of weights: {n_weights}")
print(f"Embeddings weights shape: {model.embeddings.word_embeddings.weight.shape}")


Nmber of weights: 23440896
Embeddings weights shape: torch.Size([30522, 768])


Now we have the embedding matrix, but we don't know which word each of those embeddings correspond to. This is where we need to use the vocab to map from the index of the word (its row in the embedding matrix) to the word itself.

In `HuggingFace` `transformers` models, the vocab is accessible through the tokeniser. In the same way that we loaded in a pre-trained `BERT` model, we can load in the corresponding tokeniser.

Check out the docs [here](https://huggingface.co/docs/transformers/main_classes/tokenizer).

In the codeblock below, we can explore the tokenisation process:

- Define a variable called 'sentence', consisting of a sentence of your choice inside a string.
- Encode the sentence using the tokenizer's `encode` method, assigning the output to a variable called `tokens`.
- Print the tokens.
- Print the number of words in your original sentence.
- Print the length of the `tokens` variable. How does it compare to the sentence?

In [13]:
from transformers import BertTokenizer

bert_tokenizer: BertTokenizer = BertTokenizer.from_pretrained(model_name)

sentence: str = "Hellow world and welcome"
tokens = bert_tokenizer.encode(sentence)
print(f"Tokens: {tokens}")

print(f"Number of tokens: {len(tokens)}")
print(f"Number of words: {len([char for char in sentence if char == ' ']) + 1}")

print(bert_tokenizer.decode(tokens))


Tokens: [101, 7592, 2860, 2088, 1998, 6160, 102]
Number of tokens: 7
Number of words: 4
[CLS] hellow world and welcome [SEP]


We can also decode a string of tokens using the tokeniser's `decode` method. 

In the cell below:
- Create a for-loop that steps through the tokens in `tokens` and prints each on a new line.

In [None]:

# TODO Create a for-loop that steps through the tokens in `tokens` and prints each on a new line.




Depending on your sentence, you might see a variety of different token values, Most of them will be full words, but some will be word fragments, with prefixes and suffixes denoted by leading or trailing `#` symbols.

Regardless of the sentence you chose, you can see that it starts and ends with a pair of special tokens:

`[CLS]` - Is a special token that is used to represent the entire input sequence in a single vector. It is used for classification tasks, and always appears at the start of a body of text.

`[SEP]` - Is a separator token. This token allows the model to distinguish between the two sequences and learn their relationships separately. For example it could be placed between two sentences.




So far we have examined how a sentence is split up into different tokens with the tokeniser, and we know that the relationships between tokens are parsed into embeddings. Now let's attempt to visualise these embeddings. The embeddings themselves are of very high dimensionality (equal to the size of the output layer of `BERT`), but we can attempt to visualise a 3D projection of them. 

Run the code block below to label the embeddings according to three feature dimensions:
- Length
- Number of vowels
- Whether the token is a number

In [None]:
from torch.utils.tensorboard import SummaryWriter
from time import time
import tensorboard as tb
import tensorflow as tf
tf.io.gfile = tb.compat.tensorflow_stub.io.gfile

n_embeddings=30000
embedding_matrix=embedding_matrix[:n_embeddings]
def create_embedding_labels():
    # ADD NEW COLS
    label_functions = {
        "Length": lambda word: len(word),
        "# vowels": lambda word: len([char for char in word if char in "aeiou"]),
        "is number": lambda word: word.isdigit(), # boolean label for numbers
        # "is preposition": lambda word: word in prepositions
    }
    labels = [
        [
            word,
            *[label_function(word) for label_function in label_functions.values()]
        ]
        for word in list(bert_tokenizer.ids_to_tokens.values())[:n_embeddings]
    ]

    label_names = ["Word", *list(label_functions.keys())]

    return labels, label_names


def visualise_embeddings(embeddings, labels=None, label_names="Label"):
    print("Embedding")

    writer = SummaryWriter()
    start = time()
    writer.add_embedding(
        mat=embeddings,
        metadata=labels,
        metadata_header=label_names
    )
    print(f"Total time:", time() - start)

    print("Embedding done")

labels, label_names = create_embedding_labels()
visualise_embeddings(embedding_matrix, labels, label_names)

Now, open tensorboard by running the next cell:

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs


_Note that this is a 3D projection of much higher dimensional embeddings, so most information is lost when we visualise it._