As an initial stage, let's take a deeper look at thr `emilyalsentzer/Bio_ClinicalBERT` medical LLM model as a feature extractor from HF. The model has been pre-trained on clinical text from MIMIC-III v1.4 database.

- model card: https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT
- dataset paper trained with: https://www.nature.com/articles/sdata201635
- model paper: https://arxiv.org/abs/1904.03323 

In [7]:
# for auto reload when changes are made in the package
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [69]:
from transformers import AutoModel, AutoTokenizer
from helpers import show_tokenization, search_through_vocab_dict_using_id, search_through_vocab_dict_using_token

In [11]:
model_card = 'emilyalsentzer/Bio_ClinicalBERT'

In [26]:
tokenizer = AutoTokenizer.from_pretrained(model_card)
model = AutoModel.from_pretrained(model_card)

In [25]:
tokenizer

BertTokenizerFast(name_or_path='emilyalsentzer/Bio_ClinicalBERT', vocab_size=28996, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [37]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

#### How many tokens can the tokenizer handle? How many embeddings can represent each token 'embedding matrix'?

<img style="display: block;margin-left: auto;margin-right: auto;" src="https://miro.medium.com/v2/resize:fit:828/format:webp/0*luNBhHsLBbjMSHew.png" alt="image info" />


In [36]:
# number of unique tokens (words, subwords, or characters) that the tokenizer is capable of recognizing and mapping to an index in its vocabulary
tokenizer_vocab_size = tokenizer.vocab_size
print(f"the vocab size is: {tokenizer_vocab_size}") 

# extracting the embedding dimension for each token
embedding_dim = model.config.hidden_size
print(f"the embedding dimension is: {embedding_dim}")

# (each token in the vocabulary is represented by a vector with 768 elements)
print (f"thus the size of the embedding matrix is therefore {tokenizer_vocab_size} × {embedding_dim}")

the vocab size is: 28996
the embedding dimension is: 768
thus the size of the embedding matrix is therefore 28996 × 768


##### Example 1

In [54]:
text = 'Patient has no history of fatigue, weight change, loss of appetite, or weakness.'
inputs = tokenizer(text, return_tensors="pt")

In [64]:
inputs

{'input_ids': tensor([[  101,  5351,  1144,  1185,  1607,  1104, 18418,   117,  2841,  1849,
           117,  2445,  1104, 21518,   117,  1137, 11477,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [52]:
# show each token and its respective id
show_tokenization(inputs, tokenizer)

Unnamed: 0,id,token
0,tensor(101),[CLS]
1,tensor(5351),patient
2,tensor(1144),has
3,tensor(1185),no
4,tensor(1607),history
5,tensor(1104),of
6,tensor(18418),fatigue
7,tensor(117),","
8,tensor(2841),weight
9,tensor(1849),change


What does the above table means? It means that each of the tokens are represented by the paired id value inside the tokenizer vocab_size items. <br>
In other words, the word `history` is represented by a value of `1607` from the tokenizer (as a number as machines understand numbers).

There are two ways to validate this to understand it more:

In [61]:
# here are some examples of the vocabularies and their paired ids from the tokenizer
vocab_dict = tokenizer.get_vocab()
print("first few tokens:", list(vocab_dict.items())[:10])

first few tokens: [('Jersey', 3308), ('Bridges', 17501), ('Beth', 6452), ('Schumacher', 24656), ('calf', 23256), ('app', 12647), ('develops', 11926), ('##quisition', 18540), ('##ination', 9400), ('##zon', 9515)]


In [87]:
# if we want to decode 1607 for the word history, we can do the following
print(tokenizer.decode(1607))

# or search through all of the vocabularies based on the specific id
print(search_through_vocab_dict_using_id(tokenizer, 1607))

# or search using the token text value 
search_through_vocab_dict_using_token(tokenizer, 'history')

history
history


1607

##### Example 2

In [67]:
text = 'A malignant mass with oval shape is visible on the right view of the mammogram.'
inputs = tokenizer(text, return_tensors="pt")
print(inputs)
show_tokenization(inputs, tokenizer)

{'input_ids': tensor([[  101,   170, 12477,  2646, 15454,  3367,  1114, 13102,  3571,  1110,
          5085,  1113,  1103,  1268,  2458,  1104,  1103, 12477,  6262, 28012,
           119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


Unnamed: 0,id,token
0,tensor(101),[CLS]
1,tensor(170),a
2,tensor(12477),ma
3,tensor(2646),##li
4,tensor(15454),##gnant
5,tensor(3367),mass
6,tensor(1114),with
7,tensor(13102),oval
8,tensor(3571),shape
9,tensor(1110),is


Here we can see that there are some unknown words for the tokenizer. This prorcess is called `Subword tokenization` where an unknown word is divided into subwords. It can't represent the words `malignant` or `mammogram` as it is not in the vocabulary dictionary, thus it combines subwords. <br><br> We can validate it below by looking at each token text.

In [68]:
tokenizer.tokenize(text)

['a',
 'ma',
 '##li',
 '##gnant',
 'mass',
 'with',
 'oval',
 'shape',
 'is',
 'visible',
 'on',
 'the',
 'right',
 'view',
 'of',
 'the',
 'ma',
 '##mm',
 '##ogram',
 '.']

This is definitly a challenge for medical data, which could be out of the training dataset domain. The tokenizer's vocabulary might not contain the entire word "malignant", but it has subwords like 'ma', '##li', and '##gnant'. This way, it can still represent and process the word. <br><br> Here we validate as well: 

In [85]:
search_through_vocab_dict_using_token(tokenizer, 'malignant')

'Token not found in the vocabulary.'