### After running the tokens through the model and obtaining them from the last hidden state, it is cruitial to extract the essential information from the resulted output. This notebook will walk through some important embedding pooling steps and compare them.

In [103]:
# for auto reload when changes are made in the package
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [104]:
from transformers import AutoModel, AutoTokenizer
from helpers import show_tokenization, search_through_vocab_dict_using_id, search_through_vocab_dict_using_token, embedding_pooling

In [105]:
model_card = 'emilyalsentzer/Bio_ClinicalBERT'

In [106]:
tokenizer = AutoTokenizer.from_pretrained(model_card)
model = AutoModel.from_pretrained(model_card)

for param in model.parameters():
    param.requires_grad = False

Performing all steps: input > tokenization > model > output from `last_hidden_state`. One addition we add to experiment better is the `padding` and `max_length` to the tokenizer, where the number of tokens will be equal to the the defined max length, and the padded tokens will be given 0 with attention value equal to 0.

In [160]:
text = 'Patient has no history of fatigue, weight change, loss of appetite, or weakness.'
inputs = tokenizer(text, return_tensors="pt") #, padding="max_length", max_length=512) #, padding="max_length", max_length=512)

print(inputs)
print(f"\ntotal number of tokens is {len(inputs['input_ids'][0])}")
# show_tokenization(inputs, tokenizer)

output = model(**inputs)['last_hidden_state'] # batch_size, sequence_length, hidden_size
print(f'last_hidden_state outputs shape: {output.shape}')

{'input_ids': tensor([[  101,  5351,  1144,  1185,  1607,  1104, 18418,   117,  2841,  1849,
           117,  2445,  1104, 21518,   117,  1137, 11477,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

total number of tokens is 19
last_hidden_state outputs shape: torch.Size([1, 19, 768])


To derive a single embedding from an LLM, you typically pool the hidden states using strategies like averaging the embeddings of all tokens, using the [CLS] token’s embedding, or other methods such as max pooling. The pooling approach often depends on the task and model design. Attention masks are used during pooling to avoid the influence of padding tokens, but may be less relevant for strategies like [CLS]. 

In [161]:
cls_pooling = embedding_pooling(output, 'cls')
eos_pooling = embedding_pooling(output, inputs['attention_mask'], 'eos')
max_pooling = embedding_pooling(output, inputs['attention_mask'], 'max')
mean_pooling = embedding_pooling(output, inputs['attention_mask'], 'mean')