# CHI-31: Freetext Clustering Proof of Concept

**Goal:** FreeTextAnalysis – some visualization of free text fields like ‘other comorbidities’, either by clustering or mapping to a category e.g. ICD code 

**Background**: Currently researchers have no visibility of free text fields. ICD code maybe longer term, more complex because ICD codes are tiered and some categories may not map well. A simple clustering approach is a better first bet

**Value:** Surface to researchers the data contained in their free text fields, notably ‘other combordities’ or similar

**Deliverables:** POC demo in feature branch, video sent to Esteban & Co for feedback

**Stakeholders:** Esteban, Laura Merson

**Blockers:** None, should go for something computationally simple and cheap in the first instance

**Opportunities:**

Notes:

Omid lots of good ideas here:

- Use BERTopic as resource to do clustering
- Good framework level tool - we could potentially drop in Omid’s compact bert based models
- could be an easy win for ISARIC - compact enough that we don’t need to hit an API or download a big model
- if we then want to name the clusters that might be a heavier task, but no asbolute requirement for this in conversations with ISARIC to date


## Plan
1. Get dataset of short, clinical free text to experiment with
2. Compile BERTopic modelling pipeline including:
    * Omid lightweight clinical LLMs for encoding 
    * probably BERTopic defaults for other modular components
3. Test on example dataset
    * cluster free text
    * visualize similar to how it might look on dashboard
    * (probably don't integrate into Vertex due to data not appearing in example df - but I could stitch something in to maintain the cohesion of demos)

In [1]:
import os
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModel, Pipeline
import torch.nn.functional as F
from typing import List, Union, Dict, Tuple

  from .autonotebook import tqdm as notebook_tqdm


## 1. Example dataset

Try MIMIC-IV demo dataset at https://physionet.org/content/mimic-iv-demo/2.2/

Or ISARIC data sent from Omid - 'terms_training' unspecified, possibly cancer related? Better to use ICN publicly available

In [2]:
# Download the ICN dataset to local from the following URL: 
# https://github.com/nlpie-research/Lightweight-Clinical-Transformers/blob/main/ICN/ISARIC%20Anonymised%20Clinical%20Terms%209AUG23.xlsx
icn_url = "https://github.com/nlpie-research/Lightweight-Clinical-Transformers/blob/main/ICN/ISARIC%20Anonymised%20Clinical%20Terms%209AUG23.xlsx?raw=true"
data_df = pd.read_excel(icn_url)
data_df

Unnamed: 0,FREE-TEXT TERM,CONTROLLED TERMS
0,ppm insertion,
1,c. diff,
2,diseases of the respiratory system,
3,under investigation for jerky movements at she...,
4,cva hypothyrodism,
...,...,...
5535,mestatic pancoast tumor identified on ct,malignant neoplasm
5536,known aml,malignant neoplasm
5537,new lung cancer diagnosis,malignant neoplasm
5538,metastatic breast ca,malignant neoplasm


In [3]:
# # data_dir = "../data/physionet.org/files/mimic-iv-demo/2.2/"
# data_dir = "../data/omid_isaric/"
# # # try with diganoses tablae - descriptions of ICD categories
# # filepath = "hosp/d_icd_diagnoses.csv"
# filepath = "terms_training.csv"

# # might be able to check validity of clusters later by looking at ICD super category?

In [4]:
# d_icd_diagnoses_df = pd.read_csv(os.path.join(data_dir, filepath))
# data_df = pd.read_csv(os.path.join(data_dir, filepath))

In [5]:
data_df.dtypes

FREE-TEXT TERM      object
CONTROLLED TERMS    object
dtype: object

## 2. BERTopic modelling pipeline

In [6]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, TextGeneration, MaximalMarginalRelevance
from transformers.pipelines import pipeline
import torch
from typing import Tuple
from flair.embeddings import TransformerDocumentEmbeddings

In [7]:
data_df.sample(10)

Unnamed: 0,FREE-TEXT TERM,CONTROLLED TERMS
2209,respiratoryfailure,
2109,possible hap,
5405,multiple myeloma,malignant neoplasm
1891,moderate copd,
3507,breast cancer july year redacted pallative sta...,malignant neoplasm
3947,progressive cancer symptoms,malignant neoplasm
2500,bibasilar consolidation,
2282,alitism,
840,billiary sepsis,
373,acute retroperionteal hematoma with active ext...,


In [8]:
# select subset of table
freetext_col = "FREE-TEXT TERM"
n = None
if n:
    data_subsample = data_df.sample(n=min(n, len(data_df)))
else:
    data_subsample = data_df
docs = data_subsample[freetext_col].astype(str).tolist()

In [9]:
docs

['ppm insertion',
 'c. diff',
 'diseases of the respiratory system',
 "under investigation for jerky movements at sheffield children's hospital",
 'cva hypothyrodism',
 'joint pain and swollen',
 'uti urinary sepsis',
 'consolidation on chest x-ray',
 'cephalitis',
 'hyponatremic',
 'superficial thrombophlebitis of leg',
 'constipated with overflow diarrhoea',
 'seenotes',
 'ppm and a cholecystectomy',
 'fractured l2 spine',
 'uti',
 'depressive disorder nec',
 'diagnosed with mild cognitive impairment',
 'itching and burning sensation in the extremities',
 'left sided pneumonia secondary to a recent chest injury',
 'af with fast ventricular responses',
 'restaurant worker',
 'bronchiectasis',
 'cholangitis due to deranged liver function tests. hypotension',
 'under investugators for liver disease - not diagnosed at time of admission',
 'hypokalaemia (diagnosis on discharge from uhnm)',
 'he gets very cold',
 'runny nose',
 'left forefoot amputation(arterial clot)',
 'erthymia',
 'chol

In [10]:
# try flair as simpler implementation
distil_biobert = TransformerDocumentEmbeddings('nlpie/distil-biobert')

In [11]:
# example using HF pipeline to create sentence embeddings - but presume bertopic does this under the hood anyway
class SentenceEncoderPipeline(Pipeline):
    def __init__(self, model, tokenizer, device=None, max_length=512):
        """
        Initialize the sentence encoder pipeline.
        
        Args:
            model: Pre-trained model
            tokenizer: Associated tokenizer
            device: Device to use ('cuda' or 'cpu')
            max_length: Maximum sequence length
        """
        super().__init__(
            model=model,
            tokenizer=tokenizer,
            device=device if device is not None else -1,
            max_length=max_length
        )

    def _sanitize_parameters(
        self,
        return_tensors=None,
        normalize=None,
        **kwargs
    ) -> Tuple[Dict, Dict, Dict]:
        """
        Sanitize and separate parameters for different pipeline stages.
        
        Returns:
            tuple: (preprocess_params, forward_params, postprocess_params)
        """
        preprocess_params = {}
        forward_params = {}
        postprocess_params = {}

        # Handle return_tensors parameter
        if return_tensors is not None:
            postprocess_params["return_tensors"] = return_tensors

        # Handle normalize parameter
        if normalize is not None:
            forward_params["normalize"] = normalize

        return preprocess_params, forward_params, postprocess_params
        
    def _mean_pooling(self, model_output, attention_mask):
        """
        Perform mean pooling on token embeddings.
        """
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    def preprocess(self, inputs, **kwargs):
        """
        Preprocess the inputs before model forward pass.
        """
        return self.tokenizer(
            inputs,
            padding=True,
            truncation=True,
            max_length=self.max_length,
            return_tensors="pt"
        ).to(self.device)
    
    def _forward(self, model_inputs, **kwargs):
        """
        Forward pass through the model.
        """
        normalize = kwargs.get('normalize', True)
        
        with torch.no_grad():
            outputs = self.model(**model_inputs)
        
        embeddings = self._mean_pooling(outputs, model_inputs['attention_mask'])
        
        if normalize:
            embeddings = F.normalize(embeddings, p=2, dim=1)
            
        return {"embeddings": embeddings}
    
    def postprocess(self, model_outputs, **kwargs):
        """
        Postprocess the model outputs.
        """
        return model_outputs["embeddings"].cpu().numpy()

def create_sentence_encoder_pipeline(
    model_name: str = 'bert-base-uncased',
    device: int = -1,
    max_length: int = 512,
    **kwargs
) -> SentenceEncoderPipeline:
    """
    Create a sentence encoder pipeline.
    
    Args:
        model_name: Name of the HuggingFace model to use
        device: Device to use (-1 for CPU, 0+ for GPU)
        max_length: Maximum sequence length
        **kwargs: Additional arguments to pass to pipeline creation
        
    Returns:
        SentenceEncoderPipeline instance
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    
    return SentenceEncoderPipeline(
        model=model,
        tokenizer=tokenizer,
        device=device,
        max_length=max_length,
        **kwargs
    )

In [12]:
# experiment with different models here

# set device to gpu if available 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# ##### to embed documents
# # embedding_model = pipeline("feature-extraction", 
# #                            model="nlpie/distil-biobert", 
# #                            device=device)

embedding_model = create_sentence_encoder_pipeline(
    model_name="nlpie/distil-biobert",
    device=device)

# embedding_model = distil_biobert

# ##### to describe clusters

# representation_model = KeyBERTInspired()

# try a huggingface model
prompt = """
I have a topic that contains the following documents: \n[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the above information, can you give a short label of the topic?
"""

# try an open source lightweight medical llm

# Create your representation model
generator = pipeline(
                    # task='text2text-generation', 
                     model='google/flan-t5-base',
                    # model='FreedomIntelligence/Apollo-0.5B',
                    # model='facebook/MobileLLM-125M',
                    # trust_remote_code=True,
                    device=device)

llm_one_word = TextGeneration(generator,
                              prompt=prompt,
                              )

# representation_model = TextGeneration(generator)
keybert_mmr = aspect_model2 = [KeyBERTInspired(), MaximalMarginalRelevance()]

# try combining models:
representation_model = {
    "Name": llm_one_word,
    "Main": keybert_mmr,
}

#### create model

topic_model = BERTopic(
    embedding_model=embedding_model,
    representation_model=representation_model,
    # nr_topics="auto", # merge topics clustered together
    nr_topics=9,
    )

In [13]:
llm_one_word.prompt

'\nI have a topic that contains the following documents: \n[DOCUMENTS]\nThe topic is described by the following keywords: [KEYWORDS]\n\nBased on the above information, can you give a short label of the topic?\n'

## 3. Fit model on example dataset

In [14]:
topics, probs = topic_model.fit_transform(docs)

In [15]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,953,"[lung cancer with liver and pleural mets, , , ...","[cancer, tumour, carcinoma, lumpectomy, adenoc...","[lung cancer and lobectomy in year redacted, p..."
1,0,2358,"[metastatic prostate/lung cancer, , , , , , , ...","[cancer, carcinoma, tumor, tumour, metastasis,...","[breast cancer year redacted, breast cancer in..."
2,1,1190,"[diabetic ketoacidosis, , , , , , , , , ]","[pneumonitis, infection, covid, amputation, in...","[previous covid 19 in april year redacted, cov..."
3,2,425,"[pulmonary embolism, , , , , , , , , ]","[respiratory, pulmonary, lung, emphysema, apne...","[type 1 respiratory failure, pulmonary mets, p..."
4,3,241,"[cva, , , , , , , , , ]","[cva, dvt, pe, lvsd, pes, lvs, lvf, ckd, prev,...","[previous cva and mi, recurrent dvt's & pe, pr..."
5,4,150,"[anxiety & depression, , , , , , , , , ]","[anxiety, depressive, depression, psychiatric,...","[mixed anxiety disorder, anxiety and depressiv..."
6,5,113,"[uti, , , , , , , , , ]","[uti, ut, urinary, urinating, uropathy, kidney...","[uti, uti, uti]"
7,6,99,"[meningioma, , , , , , , , , ]","[meningioma, meningioma___, menigioma, meningi...","[meningioma, meningioma, meningioma]"
8,7,11,"[electrolyte imbalance, , , , , , , , , ]","[electrolyte, electrolytes, imbalance, dehydra...",[ibd and admission with major electrolyte imba...


In [16]:
topic_model.get_topic(0)

[('cancer', 0.61546826),
 ('carcinoma', 0.56458044),
 ('tumor', 0.55808854),
 ('tumour', 0.54486686),
 ('metastasis', 0.5371325),
 ('malignancy', 0.5318986),
 ('malignant', 0.5132815),
 ('metastases', 0.47693974),
 ('adenocarcinoma', 0.4607199),
 ('metastatic', 0.45346665)]

In [17]:
topic_model.get_document_info(docs)

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,ppm insertion,3,"[cva, , , , , , , , , ]","[cva, dvt, pe, lvsd, pes, lvs, lvf, ckd, prev,...","[previous cva and mi, recurrent dvt's & pe, pr...",cva - dvt - pe - lvsd - pes - lvs - lvf - ckd ...,0.791406,False
1,c. diff,-1,"[lung cancer with liver and pleural mets, , , ...","[cancer, tumour, carcinoma, lumpectomy, adenoc...","[lung cancer and lobectomy in year redacted, p...",cancer - tumour - carcinoma - lumpectomy - ade...,0.000000,False
2,diseases of the respiratory system,2,"[pulmonary embolism, , , , , , , , , ]","[respiratory, pulmonary, lung, emphysema, apne...","[type 1 respiratory failure, pulmonary mets, p...",respiratory - pulmonary - lung - emphysema - a...,0.531045,False
3,under investigation for jerky movements at she...,-1,"[lung cancer with liver and pleural mets, , , ...","[cancer, tumour, carcinoma, lumpectomy, adenoc...","[lung cancer and lobectomy in year redacted, p...",cancer - tumour - carcinoma - lumpectomy - ade...,0.000000,False
4,cva hypothyrodism,6,"[meningioma, , , , , , , , , ]","[meningioma, meningioma___, menigioma, meningi...","[meningioma, meningioma, meningioma]",meningioma - meningioma___ - menigioma - menin...,0.691257,False
...,...,...,...,...,...,...,...,...
5535,mestatic pancoast tumor identified on ct,0,"[metastatic prostate/lung cancer, , , , , , , ...","[cancer, carcinoma, tumor, tumour, metastasis,...","[breast cancer year redacted, breast cancer in...",cancer - carcinoma - tumor - tumour - metastas...,0.364190,False
5536,known aml,3,"[cva, , , , , , , , , ]","[cva, dvt, pe, lvsd, pes, lvs, lvf, ckd, prev,...","[previous cva and mi, recurrent dvt's & pe, pr...",cva - dvt - pe - lvsd - pes - lvs - lvf - ckd ...,1.000000,False
5537,new lung cancer diagnosis,0,"[metastatic prostate/lung cancer, , , , , , , ...","[cancer, carcinoma, tumor, tumour, metastasis,...","[breast cancer year redacted, breast cancer in...",cancer - carcinoma - tumor - tumour - metastas...,1.000000,False
5538,metastatic breast ca,-1,"[lung cancer with liver and pleural mets, , , ...","[cancer, tumour, carcinoma, lumpectomy, adenoc...","[lung cancer and lobectomy in year redacted, p...",cancer - tumour - carcinoma - lumpectomy - ade...,0.000000,False


In [18]:
topic_model.visualize_topics()


In [19]:
custom_labels = list(topic_model.get_topic_info()['Name'].apply(lambda x: x[0]))
counts = list(topic_model.get_topic_info()['Count'])
custom_labels = [label + f"<br>n={count}" for label, count in zip(custom_labels, counts)]
custom_labels

['lung cancer with liver and pleural mets<br>n=953',
 'metastatic prostate/lung cancer<br>n=2358',
 'diabetic ketoacidosis<br>n=1190',
 'pulmonary embolism<br>n=425',
 'cva<br>n=241',
 'anxiety & depression<br>n=150',
 'uti<br>n=113',
 'meningioma<br>n=99',
 'electrolyte imbalance<br>n=11']

In [20]:
topic_model.set_topic_labels(custom_labels)

In [21]:
# topic_model.set_topic_labels
fig = topic_model.visualize_barchart(custom_labels=True,
                               title="Topics and keywords <br>")

# Add a "master" x-axis title as an annotation
fig.add_annotation(
        dict(
            text="c-TF-IDF score",  # Text for the shared x-axis
            x=0.5,  # Position at the center
            y=-0.15,  # Position below the bottom subplot
            xref="paper",
            yref="paper",
            showarrow=False,
            # font=dict(size=16)
        )
    
)
# fig.show()


In [22]:
# dummy visualizing over time - create some synthetic timestamps

# Define the start and end dates
start_date = "2021-01-01"
end_date = "2023-12-31"

# Generate random dates between the start and end dates
n = len(docs)  # Number of random dates to generate
date_range = pd.date_range(start=start_date, end=end_date)
random_dates = np.random.choice(date_range, n)

# add to docs 
topics_over_time = topic_model.topics_over_time(docs, random_dates, nr_bins=20)

In [23]:
# visualize
topic_model.visualize_topics_over_time(topics_over_time,
                                       custom_labels=True)