# CHI-31: Freetext Clustering Proof of Concept

**Goal:** FreeTextAnalysis – some visualization of free text fields like ‘other comorbidities’, either by clustering or mapping to a category e.g. ICD code 

**Background**: Currently researchers have no visibility of free text fields. ICD code maybe longer term, more complex because ICD codes are tiered and some categories may not map well. A simple clustering approach is a better first bet

**Value:** Surface to researchers the data contained in their free text fields, notably ‘other combordities’ or similar

**Deliverables:** POC demo in feature branch, video sent to Esteban & Co for feedback

**Stakeholders:** Esteban, Laura Merson

**Blockers:** None, should go for something computationally simple and cheap in the first instance

**Opportunities:**

Notes:

Omid lots of good ideas here:

- Use BERTopic as resource to do clustering
- Good framework level tool - we could potentially drop in Omid’s compact bert based models
- could be an easy win for ISARIC - compact enough that we don’t need to hit an API or download a big model
- if we then want to name the clusters that might be a heavier task, but no asbolute requirement for this in conversations with ISARIC to date


## Plan
1. Get dataset of short, clinical free text to experiment with
2. Compile BERTopic modelling pipeline including:
    * Omid lightweight clinical LLMs for encoding 
    * probably BERTopic defaults for other modular components
3. Test on example dataset
    * cluster free text
    * visualize similar to how it might look on dashboard
    * (probably don't integrate into Vertex due to data not appearing in example df - but I could stitch something in to maintain the cohesion of demos)

In [1]:
import os
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModel, Pipeline
import torch.nn.functional as F
from typing import List, Union, Dict, Tuple

  from .autonotebook import tqdm as notebook_tqdm


## 1. Example dataset

Try MIMIC-IV demo dataset at https://physionet.org/content/mimic-iv-demo/2.2/

In [2]:
data_dir = "../data/physionet.org/files/mimic-iv-demo/2.2/"
# try with diganoses tablae - descriptions of ICD categories
filepath = "hosp/d_icd_diagnoses.csv"

# might be able to check validity of clusters later by looking at ICD super category?

In [3]:
d_icd_diagnoses_df = pd.read_csv(os.path.join(data_dir, filepath))

In [4]:
d_icd_diagnoses_df

Unnamed: 0,icd_code,icd_version,long_title
0,0090,9,"Infectious colitis, enteritis, and gastroenter..."
1,01160,9,"Tuberculous pneumonia [any form], unspecified"
2,01186,9,"Other specified pulmonary tuberculosis, tuberc..."
3,01200,9,"Tuberculous pleurisy, unspecified"
4,01236,9,"Tuberculous laryngitis, tubercle bacilli not f..."
...,...,...,...
109770,Z88,10,"Allergy status to drugs, medicaments and biolo..."
109771,Z89012,10,Acquired absence of left thumb
109772,Z90410,10,Acquired total absence of pancreas
109773,Z948,10,Other transplanted organ and tissue status


## 2. BERTopic modelling pipeline

In [5]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, TextGeneration, MaximalMarginalRelevance
from transformers.pipelines import pipeline
import torch
from typing import Tuple
from flair.embeddings import TransformerDocumentEmbeddings

In [6]:
# select subset of table
docs = d_icd_diagnoses_df.loc[:1000, 'long_title']

In [7]:
docs

0       Infectious colitis, enteritis, and gastroenter...
1           Tuberculous pneumonia [any form], unspecified
2       Other specified pulmonary tuberculosis, tuberc...
3                       Tuberculous pleurisy, unspecified
4       Tuberculous laryngitis, tubercle bacilli not f...
                              ...                        
996     Chronic glomerulonephritis with lesion of prol...
997                      Infection of kidney, unspecified
998                                     Urethral caruncle
999                      Urinary obstruction, unspecified
1000                   Other specified disorders of penis
Name: long_title, Length: 1001, dtype: object

In [8]:
# try flair as simpler implementation
distil_biobert = TransformerDocumentEmbeddings('nlpie/distil-biobert')

In [9]:
# # example using HF pipeline to create sentence embeddings - but presume bertopic does this under the hood anyway
# class SentenceEncoderPipeline(Pipeline):
#     def __init__(self, model, tokenizer, device=None, max_length=512):
#         """
#         Initialize the sentence encoder pipeline.
        
#         Args:
#             model: Pre-trained model
#             tokenizer: Associated tokenizer
#             device: Device to use ('cuda' or 'cpu')
#             max_length: Maximum sequence length
#         """
#         super().__init__(
#             model=model,
#             tokenizer=tokenizer,
#             device=device if device is not None else -1,
#             max_length=max_length
#         )

#     def _sanitize_parameters(
#         self,
#         return_tensors=None,
#         normalize=None,
#         **kwargs
#     ) -> Tuple[Dict, Dict, Dict]:
#         """
#         Sanitize and separate parameters for different pipeline stages.
        
#         Returns:
#             tuple: (preprocess_params, forward_params, postprocess_params)
#         """
#         preprocess_params = {}
#         forward_params = {}
#         postprocess_params = {}

#         # Handle return_tensors parameter
#         if return_tensors is not None:
#             postprocess_params["return_tensors"] = return_tensors

#         # Handle normalize parameter
#         if normalize is not None:
#             forward_params["normalize"] = normalize

#         return preprocess_params, forward_params, postprocess_params
        
#     def _mean_pooling(self, model_output, attention_mask):
#         """
#         Perform mean pooling on token embeddings.
#         """
#         token_embeddings = model_output[0]
#         input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
#         return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
#     def preprocess(self, inputs, **kwargs):
#         """
#         Preprocess the inputs before model forward pass.
#         """
#         return self.tokenizer(
#             inputs,
#             padding=True,
#             truncation=True,
#             max_length=self.max_length,
#             return_tensors="pt"
#         ).to(self.device)
    
#     def _forward(self, model_inputs, **kwargs):
#         """
#         Forward pass through the model.
#         """
#         normalize = kwargs.get('normalize', True)
        
#         with torch.no_grad():
#             outputs = self.model(**model_inputs)
        
#         embeddings = self._mean_pooling(outputs, model_inputs['attention_mask'])
        
#         if normalize:
#             embeddings = F.normalize(embeddings, p=2, dim=1)
            
#         return {"embeddings": embeddings}
    
#     def postprocess(self, model_outputs, **kwargs):
#         """
#         Postprocess the model outputs.
#         """
#         return model_outputs["embeddings"].cpu().numpy()

# def create_sentence_encoder_pipeline(
#     model_name: str = 'bert-base-uncased',
#     device: int = -1,
#     max_length: int = 512,
#     **kwargs
# ) -> SentenceEncoderPipeline:
#     """
#     Create a sentence encoder pipeline.
    
#     Args:
#         model_name: Name of the HuggingFace model to use
#         device: Device to use (-1 for CPU, 0+ for GPU)
#         max_length: Maximum sequence length
#         **kwargs: Additional arguments to pass to pipeline creation
        
#     Returns:
#         SentenceEncoderPipeline instance
#     """
#     tokenizer = AutoTokenizer.from_pretrained(model_name)
#     model = AutoModel.from_pretrained(model_name)
    
#     return SentenceEncoderPipeline(
#         model=model,
#         tokenizer=tokenizer,
#         device=device,
#         max_length=max_length,
#         **kwargs
#     )

In [10]:
# experiment with different models here

# set device to gpu if available 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# ##### to embed documents
# # embedding_model = pipeline("feature-extraction", 
# #                            model="nlpie/distil-biobert", 
# #                            device=device)

# embedding_model = create_sentence_encoder_pipeline(
#     model_name="nlpie/distil-biobert",
#     device=device)

embedding_model = distil_biobert

# ##### to describe clusters

# representation_model = KeyBERTInspired()

# try a huggingface model
prompt = """
I have topic that contains the following documents: \n[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the above information, can you give a short label of the topic?
"""

# Create your representation model
generator = pipeline('text2text-generation', 
                     model='google/flan-t5-base',
                     device=device)

llm_one_word = TextGeneration(generator)

# representation_model = TextGeneration(generator)
keybert_mmr = aspect_model2 = [KeyBERTInspired(), MaximalMarginalRelevance()]

# try combining models:
representation_model = {
    "Name": llm_one_word,
    "Main": keybert_mmr,
}

#### create model

topic_model = BERTopic(
    embedding_model=embedding_model,
    representation_model=representation_model,
    nr_topics="auto", # merge topics clustered together
    )

## 3. Fit model on example dataset

In [11]:
topics, probs = topic_model.fit_transform(docs)

In [12]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,1,"[hemorrhage, , , , , , , , , ]","[hemorrhage, perforation, puncture, vaccinatio...","[Accidental cut, puncture, perforation or hemo..."
1,0,841,"[antepartum, , , , , , , , , ]","[complication, hemorrhage, antepartum, obstruc...","[Prolonged pregnancy, delivered, with or witho..."
2,1,95,"[leukemia, , , , , , , , , ]","[neoplasm, neoplasms, hemangioma, malignant, l...","[Malignant neoplasm of vulva, unspecified site..."
3,2,42,"[concussion, , , , , , , , , ]","[hemorrhage, intracranial, laceration, subarac...",[Open fractures involving skull or face with o...
4,3,22,"[tuberculosis, , , , , , , , , ]","[bacteriological, tuberculous, tuberculoma, pe...","[Other specified miliary tuberculosis, tubercl..."


In [13]:
topic_model.get_topic(0)

[('complication', 0.9605564),
 ('hemorrhage', 0.9563548),
 ('antepartum', 0.9520293),
 ('obstruction', 0.951741),
 ('pregnancy', 0.9454864),
 ('not', 0.9454721),
 ('complicating', 0.94444),
 ('unspecified', 0.9436005),
 ('congenital', 0.9435065),
 ('without', 0.94184005)]

In [14]:
topic_model.get_document_info(docs)

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,"Infectious colitis, enteritis, and gastroenter...",0,"[antepartum, , , , , , , , , ]","[complication, hemorrhage, antepartum, obstruc...","[Prolonged pregnancy, delivered, with or witho...",complication - hemorrhage - antepartum - obstr...,1.000000,False
1,"Tuberculous pneumonia [any form], unspecified",0,"[antepartum, , , , , , , , , ]","[complication, hemorrhage, antepartum, obstruc...","[Prolonged pregnancy, delivered, with or witho...",complication - hemorrhage - antepartum - obstr...,0.801948,False
2,"Other specified pulmonary tuberculosis, tuberc...",3,"[tuberculosis, , , , , , , , , ]","[bacteriological, tuberculous, tuberculoma, pe...","[Other specified miliary tuberculosis, tubercl...",bacteriological - tuberculous - tuberculoma - ...,0.853436,False
3,"Tuberculous pleurisy, unspecified",0,"[antepartum, , , , , , , , , ]","[complication, hemorrhage, antepartum, obstruc...","[Prolonged pregnancy, delivered, with or witho...",complication - hemorrhage - antepartum - obstr...,0.832191,False
4,"Tuberculous laryngitis, tubercle bacilli not f...",3,"[tuberculosis, , , , , , , , , ]","[bacteriological, tuberculous, tuberculoma, pe...","[Other specified miliary tuberculosis, tubercl...",bacteriological - tuberculous - tuberculoma - ...,0.901871,False
...,...,...,...,...,...,...,...,...
996,Chronic glomerulonephritis with lesion of prol...,0,"[antepartum, , , , , , , , , ]","[complication, hemorrhage, antepartum, obstruc...","[Prolonged pregnancy, delivered, with or witho...",complication - hemorrhage - antepartum - obstr...,1.000000,False
997,"Infection of kidney, unspecified",0,"[antepartum, , , , , , , , , ]","[complication, hemorrhage, antepartum, obstruc...","[Prolonged pregnancy, delivered, with or witho...",complication - hemorrhage - antepartum - obstr...,1.000000,False
998,Urethral caruncle,0,"[antepartum, , , , , , , , , ]","[complication, hemorrhage, antepartum, obstruc...","[Prolonged pregnancy, delivered, with or witho...",complication - hemorrhage - antepartum - obstr...,1.000000,False
999,"Urinary obstruction, unspecified",0,"[antepartum, , , , , , , , , ]","[complication, hemorrhage, antepartum, obstruc...","[Prolonged pregnancy, delivered, with or witho...",complication - hemorrhage - antepartum - obstr...,1.000000,False


In [15]:
topic_model.visualize_topics()


In [17]:
custom_labels = list(topic_model.get_topic_info()['Name'].apply(lambda x: x[0]))
custom_labels

['hemorrhage', 'antepartum', 'leukemia', 'concussion', 'tuberculosis']

In [18]:
topic_model.set_topic_labels(custom_labels)

In [19]:
# topic_model.set_topic_labels
topic_model.visualize_barchart(custom_labels=True)

In [21]:
# dummy visualizing over time - create some synthetic timestamps

# Define the start and end dates
start_date = "2021-01-01"
end_date = "2023-12-31"

# Generate random dates between the start and end dates
n = len(docs)  # Number of random dates to generate
date_range = pd.date_range(start=start_date, end=end_date)
random_dates = np.random.choice(date_range, n)

# add to docs 
topics_over_time = topic_model.topics_over_time(docs, random_dates, nr_bins=20)

In [22]:
# visualize
topic_model.visualize_topics_over_time(topics_over_time)