# CHI-31: Freetext Clustering Proof of Concept

**Goal:** FreeTextAnalysis – some visualization of free text fields like ‘other comorbidities’, either by clustering or mapping to a category e.g. ICD code 

**Background**: Currently researchers have no visibility of free text fields. ICD code maybe longer term, more complex because ICD codes are tiered and some categories may not map well. A simple clustering approach is a better first bet

**Value:** Surface to researchers the data contained in their free text fields, notably ‘other combordities’ or similar

**Deliverables:** POC demo in feature branch, video sent to Esteban & Co for feedback

**Stakeholders:** Esteban, Laura Merson

**Blockers:** None, should go for something computationally simple and cheap in the first instance

**Opportunities:**

Notes:

Omid lots of good ideas here:

- Use BERTopic as resource to do clustering
- Good framework level tool - we could potentially drop in Omid’s compact bert based models
- could be an easy win for ISARIC - compact enough that we don’t need to hit an API or download a big model
- if we then want to name the clusters that might be a heavier task, but no asbolute requirement for this in conversations with ISARIC to date


## Plan
1. Get dataset of short, clinical free text to experiment with
2. Compile BERTopic modelling pipeline including:
    * Omid lightweight clinical LLMs for encoding 
    * probably BERTopic defaults for other modular components
3. Test on example dataset
    * cluster free text
    * visualize similar to how it might look on dashboard
    * (probably don't integrate into Vertex due to data not appearing in example df - but I could stitch something in to maintain the cohesion of demos)

In [1]:
import os
import pandas as pd
import numpy as np

### 1. Example dataset

Try MIMIC-IV demo dataset at https://physionet.org/content/mimic-iv-demo/2.2/

In [2]:
data_dir = "../data/mimic-iv-demo/2.2/"
# try with diganoses tablae - descriptions of ICD categories
filepath = "hosp/d_icd_diagnoses.csv"

# might be able to check validity of clusters later by looking at ICD super category?

In [3]:
d_icd_diagnoses_df = pd.read_csv(os.path.join(data_dir, filepath))

In [4]:
d_icd_diagnoses_df

Unnamed: 0,icd_code,icd_version,long_title
0,0090,9,"Infectious colitis, enteritis, and gastroenter..."
1,01160,9,"Tuberculous pneumonia [any form], unspecified"
2,01186,9,"Other specified pulmonary tuberculosis, tuberc..."
3,01200,9,"Tuberculous pleurisy, unspecified"
4,01236,9,"Tuberculous laryngitis, tubercle bacilli not f..."
...,...,...,...
109770,Z88,10,"Allergy status to drugs, medicaments and biolo..."
109771,Z89012,10,Acquired absence of left thumb
109772,Z90410,10,Acquired total absence of pancreas
109773,Z948,10,Other transplanted organ and tissue status


## 2. BERTopic modelling pipeline

In [5]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, TextGeneration
from transformers.pipelines import pipeline
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
# select subset of table
docs = d_icd_diagnoses_df.loc[:1000, 'long_title']

In [7]:
docs

0       Infectious colitis, enteritis, and gastroenter...
1           Tuberculous pneumonia [any form], unspecified
2       Other specified pulmonary tuberculosis, tuberc...
3                       Tuberculous pleurisy, unspecified
4       Tuberculous laryngitis, tubercle bacilli not f...
                              ...                        
996     Chronic glomerulonephritis with lesion of prol...
997                      Infection of kidney, unspecified
998                                     Urethral caruncle
999                      Urinary obstruction, unspecified
1000                   Other specified disorders of penis
Name: long_title, Length: 1001, dtype: object

In [8]:
# experiment with different models here

# set device to gpu if available 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


##### to embed documents
embedding_model = pipeline("feature-extraction", 
                           model="nlpie/distil-biobert", 
                           device=device)

##### to describe clusters

# representation_model = KeyBERTInspired()

# try a huggingface model
prompt = """
I have topic that contains the following documents: \n[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the above information, can you give a short label of the topic?
"""

# Create your representation model
generator = pipeline('text2text-generation', 
                     model='google/flan-t5-base',
                     device=device)
representation_model = TextGeneration(generator)

#### create model

topic_model = BERTopic(
    # embedding_model=embedding_model,
    representation_model=representation_model,
    nr_topics="auto", # merge topics clustered together
    )

Some weights of BertModel were not initialized from the model checkpoint at nlpie/distil-biobert and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
topics, probs = topic_model.fit_transform(docs)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [10]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,181,-1_erythema___,"[erythema, , , , , , , , , ]","[Insect bite, nonvenomous of shoulder and uppe..."
1,0,206,0_sprain___,"[sprain, , , , , , , , , ]",[Closed dislocation of interphalangeal (joint)...
2,1,104,1_pregnancy___,"[pregnancy, , , , , , , , , ]","[Prolonged pregnancy, antepartum condition or ..."
3,2,80,2_migraine___,"[migraine, , , , , , , , , ]",[Personal history of other disorders of nervou...
4,3,57,3_neoplasm___,"[neoplasm, , , , , , , , , ]","[Malignant neoplasm of vulva, unspecified site..."
5,4,48,4_tuberculosis___,"[tuberculosis, , , , , , , , , ]","[Other specified pulmonary tuberculosis, tuber..."
6,5,48,5_glaucoma___,"[glaucoma, , , , , , , , , ]",[One eye: total vision impairment; other eye: ...
7,6,44,6_concussion___,"[concussion, , , , , , , , , ]",[Open fractures involving skull or face with o...
8,7,44,7_drowning in water___,"[drowning in water, , , , , , , , , ]",[Nontraffic accident involving other off-road ...
9,8,36,8_poisoning___,"[poisoning, , , , , , , , , ]",[Anti-infectives and other drugs and preparati...


In [11]:
topic_model.get_topic(0)

[('sprain', 1),
 ('', 0),
 ('', 0),
 ('', 0),
 ('', 0),
 ('', 0),
 ('', 0),
 ('', 0),
 ('', 0),
 ('', 0)]

In [12]:
topic_model.get_document_info(docs)

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,"Infectious colitis, enteritis, and gastroenter...",0,0_sprain___,"[sprain, , , , , , , , , ]",[Closed dislocation of interphalangeal (joint)...,sprain - - - - - - - - -,0.772043,False
1,"Tuberculous pneumonia [any form], unspecified",4,4_tuberculosis___,"[tuberculosis, , , , , , , , , ]","[Other specified pulmonary tuberculosis, tuber...",tuberculosis - - - - - - - - -,0.487465,False
2,"Other specified pulmonary tuberculosis, tuberc...",4,4_tuberculosis___,"[tuberculosis, , , , , , , , , ]","[Other specified pulmonary tuberculosis, tuber...",tuberculosis - - - - - - - - -,1.000000,True
3,"Tuberculous pleurisy, unspecified",4,4_tuberculosis___,"[tuberculosis, , , , , , , , , ]","[Other specified pulmonary tuberculosis, tuber...",tuberculosis - - - - - - - - -,0.653610,False
4,"Tuberculous laryngitis, tubercle bacilli not f...",4,4_tuberculosis___,"[tuberculosis, , , , , , , , , ]","[Other specified pulmonary tuberculosis, tuber...",tuberculosis - - - - - - - - -,1.000000,False
...,...,...,...,...,...,...,...,...
996,Chronic glomerulonephritis with lesion of prol...,12,12_nephroptosis___,"[nephroptosis, , , , , , , , , ]",[Acute kidney failure with lesion of renal cor...,nephroptosis - - - - - - - - -,1.000000,False
997,"Infection of kidney, unspecified",12,12_nephroptosis___,"[nephroptosis, , , , , , , , , ]",[Acute kidney failure with lesion of renal cor...,nephroptosis - - - - - - - - -,0.369595,False
998,Urethral caruncle,-1,-1_erythema___,"[erythema, , , , , , , , , ]","[Insect bite, nonvenomous of shoulder and uppe...",erythema - - - - - - - - -,0.000000,False
999,"Urinary obstruction, unspecified",-1,-1_erythema___,"[erythema, , , , , , , , , ]","[Insect bite, nonvenomous of shoulder and uppe...",erythema - - - - - - - - -,0.000000,False


In [13]:
topic_model.visualize_topics()


In [17]:
topic_model.visualize_barchart()

In [14]:
# dummy visualizing over time - create some synthetic timestamps

# Define the start and end dates
start_date = "2021-01-01"
end_date = "2023-12-31"

# Generate random dates between the start and end dates
n = len(docs)  # Number of random dates to generate
date_range = pd.date_range(start=start_date, end=end_date)
random_dates = np.random.choice(date_range, n)

# add to docs 
topics_over_time = topic_model.topics_over_time(docs, random_dates)



In [15]:
# visualize
topic_model.visualize_topics_over_time(topics_over_time)