# BERT and Related Models - Embeddings

(C) 2024 by [Damir Cavar](http://damir.cavar.me/)

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython).

This code pulls embeddings for words or text from BERT.

Prerequisites:
You will have to install the `transformers` Python module.

In [None]:
!pip install -U transformers

We will need to import `pytorch` and `transformers`.

In [2]:
import random
import torch
from transformers import BertTokenizer, BertModel
from numpy import dot

We'll seed the random value and check whether we can use CUDA and GPUs for computations.

In [3]:
random_seed = 42
random.seed(random_seed)
torch.manual_seed(random_seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(random_seed)

The following function returns the embedding of a text, which could be just one word.

In [4]:
def get_embedding(text: str, tokenizer, model) -> torch.FloatTensor:
	example_encoding = tokenizer.batch_encode_plus(
		[ text ],
		padding            = True,
		truncation         = True,
		return_tensors     = 'pt',
		add_special_tokens = True
	)
	example_input_ids = example_encoding['input_ids']
	example_attention_mask = example_encoding['attention_mask']
	with torch.no_grad():
		example_outputs = model(example_input_ids, attention_mask=example_attention_mask)
		example_embedding = example_outputs.last_hidden_state.mean(dim=1) # Average pooling
	return example_embedding[0].tolist()

In the following code segment we intialize the BERT tokenizer and load the model:

In [5]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model     = BertModel.from_pretrained('bert-base-uncased')
vector_length = len(get_embedding("test", tokenizer, model))

This is the word list that we want to use to pull embeddings for:

In [6]:
word_list = list(set("""
apple banana cherry date elderberry fig grapefruit honeydew
car airplane train bus bicycle motorcycle boat ship
dog cat rabbit hamster parrot goldfish
""".split()))

Pull embeddings for each word:

In [8]:
for word in word_list:
    vector = get_embedding(word, tokenizer, model)
    print(word, len(vector), vector)

car 768 [0.4418266713619232, 0.014356344938278198, -0.08874145150184631, 0.14305876195430756, -0.1058681309223175, -0.16398702561855316, 0.17393989861011505, -0.07295441627502441, 0.11597371101379395, -0.23322458565235138, 0.1699400544166565, -0.13494233787059784, 0.22166548669338226, 0.20501254498958588, -0.22787950932979584, -0.25944387912750244, 0.02095557563006878, 0.49590834975242615, 0.16833575069904327, 0.03698566555976868, 0.08673761039972305, -0.17832374572753906, 0.11849794536828995, -0.08503928035497665, 0.054103489965200424, 0.30846744775772095, -0.09469827264547348, -0.012507428415119648, 0.022822966799139977, -0.3175179660320282, -0.008021880872547626, -0.17893286049365997, -0.10547595471143723, 0.5247851014137268, -0.11582878232002258, -0.37133362889289856, 0.1536560356616974, -0.26614469289779663, -0.1305670291185379, 0.05255204439163208, 0.32447195053100586, -0.029834752902388573, -0.011087954044342041, -0.21899373829364777, 0.2313397377729416, 0.22769267857074738, -0.

(C) 2024 by [Damir Cavar](http://damir.cavar.com/)