# Interpret and deploy MedCLIP pipeline

## Method
1. Input text(s) and generate tokens for each one (before applying Attention Mechanism) -- tokenizer
2. Apply Attention Mechanism via Neural Network -- model
3. Pool outputs to generate 1 single embedding


## How to proceed
- We can use SentenceTransformer to simply run SentenceTransformer("...").encode(text) without dealing with pooling.
- Using Transformers requires more manual work
- SentenceTransformer is not available for this model
- Transformers requires an AutoTokenizer to preprocess the text into IDs, with padding, truncation parameters
- Then we use AutoModel which is the actual Neural Network to create an output
- This outputs an embedding per token, and we take a representative pool for an entire sentence embedding

In [4]:
from transformers import AutoTokenizer, AutoModel
import torch
tokeniser = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") # Generate input tokens
texts = [
    "Patient has a cold", 
    "Patient has a sore throat and no other symptoms", 
    "Patient is vomiting blood and has a collapsed lung"
    ]

In [5]:
inputs = tokeniser(texts, padding=True, truncation=True, return_tensors="pt") # Tokens
print(inputs)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': tensor([[  101,  5351,  1144,   170,  2504,   102,     0,     0,     0,     0,
             0,     0],
        [  101,  5351,  1144,   170, 15939,  2922,  1105,  1185,  1168,  8006,
           102,     0],
        [  101,  5351,  1110, 26979,  1158,  1892,  1105,  1144,   170,  7322,
         13093,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [None]:
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") # Load Neural Network
with torch.no_grad(): # Apply forward pass without calculating gradients to speed up computation
    outputs = model(**inputs) # Apply Attention Mechanism to each token to generate embeddings

hidden_states = outputs.last_hidden_state # "Hidden_states" is the attention-mechanism output
print(hidden_states.shape)

torch.Size([3, 12, 768])


In [8]:
# Apply Pooling with a mask (similar to filtering Pandas DataFrame)
mask = inputs["attention_mask"].unsqueeze(-1)
pooled = (hidden_states * mask).sum(dim=1) / mask.sum(dim=1)
print(pooled)

tensor([[ 0.5553,  0.0857, -0.4640,  ...,  0.0184,  0.0646, -0.1259],
        [ 0.1498,  0.1194, -0.3643,  ...,  0.2049,  0.1319, -0.3943],
        [ 0.1813,  0.0279, -0.3916,  ...,  0.2560,  0.1692, -0.3637]])


# Full Class

In [13]:
from transformers import AutoTokenizer, AutoModel
import torch

class Model:
    """Import model with AutoTokenizer and Automodel. Defaults to BioClinicalBERT"""
    def __init__(self, link="emilyalsentzer/Bio_ClinicalBERT"):
        self.tokenizer = AutoTokenizer.from_pretrained(link)
        self.model = AutoModel.from_pretrained(link) # Load Neural Network
    def embeddings(self, texts):
        inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt") # Tokens
        
        with torch.no_grad(): # Apply forward pass without calculating gradients to speed up computation
            outputs = self.model(**inputs) # Apply Attention Mechanism to each token to generate embeddings

        hidden_states = outputs.last_hidden_state # "Hidden_states" is the attention-mechanism output
        # Apply Pooling with a mask (similar to filtering Pandas DataFrame)
        mask = inputs["attention_mask"].unsqueeze(-1)
        pooled = (hidden_states * mask).sum(dim=1) / mask.sum(dim=1)
        return pooled

In [14]:
tmp = Model().embeddings(["Patient has a cold", "Patient is vomiting blood"])
print(tmp)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


tensor([[ 0.5553,  0.0857, -0.4640,  ...,  0.0184,  0.0646, -0.1259],
        [ 0.2254,  0.1918, -0.5101,  ...,  0.2051,  0.2852, -0.2781]])
