# BERT and Related Models - Embeddings

(C) 2024 by [Damir Cavar](http://damir.cavar.me/)

This code pulls embeddings for words or text from BERT.

Prerequisites:
You will have to install the `transformers` Python module.

In [None]:
!pip install -U transformers

We will need to import `pytorch` and `transformers`.

In [None]:
import random
import torch
from transformers import BertTokenizer, BertModel
from numpy import dot

We'll seed the random value and check whether we can use CUDA and GPUs for computations.

In [2]:
random_seed = 42
random.seed(random_seed)
torch.manual_seed(random_seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(random_seed)

The following function returns the embedding of a text, which could be just one word.

In [3]:
def get_embedding(text: str, tokenizer, model) -> torch.FloatTensor:
	example_encoding = tokenizer.batch_encode_plus(
		[ text ],
		padding            = True,
		truncation         = True,
		return_tensors     = 'pt',
		add_special_tokens = True
	)
	example_input_ids = example_encoding['input_ids']
	example_attention_mask = example_encoding['attention_mask']
	with torch.no_grad():
		example_outputs = model(example_input_ids, attention_mask=example_attention_mask)
		example_embedding = example_outputs.last_hidden_state.mean(dim=1) # Average pooling
	return example_embedding[0].tolist()

In the following code segment we intialize the BERT tokenizer and load the model:

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model     = BertModel.from_pretrained('bert-base-uncased')
vector_length = len(get_embedding("test", tokenizer, model))

This is the word list that we want to use to pull embeddings for:

In [5]:
word_list = list(set("""
apple banana cherry date elderberry fig grapefruit honeydew
car airplane train bus bicycle motorcycle boat ship
dog cat rabbit hamster parrot goldfish
""".split()))

Pull embeddings for each word:

In [None]:
for word in word_list:
    vector = get_embedding(word, tokenizer, model)
    print(word, vector)

(C) 2024 by [Damir Cavar](http://damir.cavar.com/)