### Continuing on the medidcal examples from the previous notebook, let's understand further the model output and embedding matrix.

In [1]:
# for auto reload when changes are made in the package
%load_ext autoreload
%autoreload 2

In [8]:
from transformers import AutoModel, AutoTokenizer
from helpers import show_tokenization, search_through_vocab_dict_using_id, search_through_vocab_dict_using_token

In [3]:
model_card = 'emilyalsentzer/Bio_ClinicalBERT'

In [12]:
tokenizer = AutoTokenizer.from_pretrained(model_card)
model = AutoModel.from_pretrained(model_card)

for param in model.parameters():
    param.requires_grad = False

In [7]:
model.config

BertConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "emilyalsentzer/Bio_ClinicalBERT",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.46.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

using the same example previously, lets show the token matix.

In [24]:
text = 'Patient has no history of fatigue, weight change, loss of appetite, or weakness.'
inputs = tokenizer(text, return_tensors="pt")
print(inputs)
print(f"\ntotal number of tokens is {len(inputs['input_ids'][0])}")

show_tokenization(inputs, tokenizer)

{'input_ids': tensor([[  101,  5351,  1144,  1185,  1607,  1104, 18418,   117,  2841,  1849,
           117,  2445,  1104, 21518,   117,  1137, 11477,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

total number of tokens is 19


Unnamed: 0,id,token
0,tensor(101),[CLS]
1,tensor(5351),patient
2,tensor(1144),has
3,tensor(1185),no
4,tensor(1607),history
5,tensor(1104),of
6,tensor(18418),fatigue
7,tensor(117),","
8,tensor(2841),weight
9,tensor(1849),change


#### Embedding matrix

the tokenizer generated 19 tokens for the input sentence, lets access the embedding matrix from the model before sending the tokens through the model

In [44]:
# here we can see that the shape is as we explained before
print(model.embeddings.word_embeddings.weight.data.shape)

# each token is represented by an embedding vector of a size of 768
print(model.embeddings.word_embeddings.weight.data)

torch.Size([28996, 768])
tensor([[-0.0333, -0.0794, -0.0196,  ..., -0.0365, -0.0359,  0.0013],
        [ 0.0125, -0.0182, -0.0349,  ..., -0.0387, -0.0596, -0.0106],
        [-0.0384, -0.0131,  0.0037,  ..., -0.0394, -0.0423, -0.0357],
        ...,
        [-0.0045, -0.0044, -0.0520,  ..., -0.0384, -0.0762, -0.0117],
        [-0.0235,  0.0125, -0.0237,  ..., -0.0818,  0.0034, -0.0393],
        [ 0.0488, -0.0234, -0.0319,  ..., -0.0522, -0.0444, -0.0116]])


<b>The embedding matrix (shape of `vocab_size`, `embedding_dim`) converts input tokens into dense vector representations before passing them to the transformer layers for further processing. It is the initial lookup table that maps token IDs (indices in the vocabulary) to their corresponding embeddings. It is part of the input layer of the model.<b>

#### Model output explanation

let's run the tokens through the model

In [23]:
model(**inputs).last_hidden_state.shape # batch_size, sequence_length, hidden_size

torch.Size([1, 19, 768])

The `last_hidden_state` is a 3D tensor with the shape (batch_size, sequence_length, hidden_size), where: 
- `batch_size` is the number of input sequences in the batch.
- `sequence_length` is the number of tokens in each input sequence (including special tokens like [CLS] and [SEP]).
- `hidden_size` is the size of the hidden layers or embedding dimension (e.g., 768 for bert-base-uncased).
<br><br>

Each element in `last_hidden_state` is a vector of length hidden_size that represents a token's contextually aware embedding. The model generates these embeddings based on the entire input sequence, so each token embedding takes into account its relationship with other tokens in the sequence. <br>

[ <br>
&nbsp;  [vector_for_[CLS]],     # Embedding for the [CLS] token <br>
&nbsp;  [vector_for_'patient'], # Embedding for the word 'patient' <br>
&nbsp;  [vector_for_'has'],     # Embedding for the word 'has' <br>
&nbsp;  [vector_for_'no'],      # Embedding for the word 'no' <br>
&nbsp;  [vector_for_'history'], # Embedding for the word 'history' <br>
&nbsp;  [vector_for_'of'],      # Embedding for the word 'of' <br>
&nbsp;  [vector_for_'fatigue'], # Embedding for the word 'fatigue' <br>
&nbsp;  [vector_for_,]          # Embedding for the word ','<br>
]<br>


<b> The `last_hidden_state` is the output from the last transformer layer in the model. It consists of contextually enriched embeddings for each token in the input sequence. <br>The last_hidden_state contains embeddings that have been updated through multiple transformer layers, incorporating context from the entire input sequence. These embeddings are used for downstream tasks and have richer information compared to the raw embeddings from the embedding matrix.




<b> In other words, the output of the last hidden layer (i.e., `last_hidden_state`) represents contextually rich embeddings for the entire input sentence. 
For each token in the input sentence, the model generates a vector of size hidden_size (e.g., 768 for bert-base-uncased).
These vectors capture the contextual meaning of each token, considering the entire sentence. This means each token embedding is influenced by its surrounding tokens, allowing it to reflect its contextual significance in the sentence.
