## 1. Data Exploration 

We have four datasets to explore. Below is a summary of each dataset:
- **PMC-Patients**: This large-scale dataset contains 167K patient summaries extracted from open-access case studies published in PubMed Central. Each note encapsulates a detailed case presentation as written by a doctor, presenting a thorough summary encompassing the patientâ€™s visit, medical history, symptoms, administered treatments, as well as the discharge summary and outcome of the intervention. These comprehensive case presentations offer a rich and diverse collection of medical scenarios, forming a robust foundation for our model training and evaluation.
- **NoteChat**: An extension of PMC-Patients with 167K synthetic patient-doctor conversations. Each dialogue transcript within the NoteChat dataset was generated from a clinical note by ChatGPT (version `gpt-3.5-turbo-0613`).
- **Augmented-Clinical-Notes**: The PMC-Patients and NoteChat datasets are augmented by extracting structured patient information from the 30,000 longest clinical notes. GPT-4 (version `gpt-4-turbo-0613`) is prompted with zero-shot instructions, using each clinical note and a structured template that defines key medical features.
- **MACCROBAT**: A dataset of 200 annotated clinical case reports with detailed entity and relation annotations. The annotations follow the MACCROBAT schema, which includes 33 entity types and 15 relation types relevant to clinical case reports.

How we will tie these datasets together:
1. **MACCROBAT Annotations**: We will use the MACCROBAT dataset to fine-tune our model for entity and relation extraction tasks. The detailed annotations will help the model learn to identify and classify various medical entities and their relationships within clinical texts.
2. **PMC-Patients**: We will use this dataset to extract medical entities and relations from the clinical notes. This will be used in the demo.
3. **NoteChat**: We will use this dataset to extract medical entities and relations from the dialogue conversation to simulate real-time extraction in the demo.  
4. **Augmented-Clinical-Notes**: This contains both PMC-Patients and NoteChat data with structured patient information extracted by GPT-4. So we will just use this instead of the other two datasets for demo.

### Maccrobat Dataset Exploration

In [32]:
def read_and_extract_maccrobat_file(file_id):
    '''Reads a MACCROBAT2018 file given its file ID (without extension) and returns the text and entities.'''
    # Step 1: Read text
    with open(f"../data/MACCROBAT2018/{file_id}.txt", encoding="utf-8") as f:
        text = f.read()

    # Step 2: Read entities
    # and handle these lines as well with multiple entities:T15	Disease_disorder 474 481;490 503	cardiac malformations
    entities = []
    with open(f"../data/MACCROBAT2018/{file_id}.ann", encoding="utf-8") as f:
        for line in f:
            if line.startswith("T"):
                parts = line.strip().split('\t')
                eid, etype_offsets, etext = parts
                if ';' in etype_offsets:
                    # only the first one has the etype, the rest are just offsets
                    etype, first_start, first_end = etype_offsets.split(';')[0].split()
                    entities.append({
                        "type": etype,
                        "start": int(first_start),
                        "end": int(first_end)
                    })
                    # process the rest
                    for offset in etype_offsets.split(';')[1:]:
                        start, end = offset.split()
                        entities.append({
                            "type": etype,
                            "start": int(start),
                            "end": int(end)
                        })
                else:   
                    etype, start, end = etype_offsets.split()
                    entities.append({
                        "type": etype,
                        "start": int(start),
                        "end": int(end)
                    })
    return text, entities

# test function
text, entities = read_and_extract_maccrobat_file("21308977")
print("Text:", text[:200], "...")
print("Entities:", entities)

Text: The patient was a 3-year-old girl with the following features of VACTERL association: absent C1 vertebra, supernumerary lumbar vertebrae, hypoplastic sacrum/coccyx, fatty filum terminale with tethered ...
Entities: [{'type': 'Age', 'start': 18, 'end': 28}, {'type': 'Sex', 'start': 29, 'end': 33}, {'type': 'Diagnostic_procedure', 'start': 65, 'end': 84}, {'type': 'Subject', 'start': 554, 'end': 564}, {'type': 'Disease_disorder', 'start': 626, 'end': 648}, {'type': 'History', 'start': 626, 'end': 648}, {'type': 'Detailed_description', 'start': 609, 'end': 625}, {'type': 'Disease_disorder', 'start': 485, 'end': 503}, {'type': 'Disease_disorder', 'start': 512, 'end': 539}, {'type': 'Disease_disorder', 'start': 474, 'end': 481}, {'type': 'Disease_disorder', 'start': 490, 'end': 503}, {'type': 'Disease_disorder', 'start': 734, 'end': 747}, {'type': 'Detailed_description', 'start': 717, 'end': 733}, {'type': 'Diagnostic_procedure', 'start': 2831, 'end': 2853}, {'type': 'Diagnostic_proce

In [33]:
# Reading all files .txt and .ann from MACCROBAT directory
import os
maccrobat_dir = "../data/MACCROBAT/"
all_file_ids = set()
for filename in os.listdir(maccrobat_dir):
    if filename.endswith(".txt"):
        file_id = filename[:-4]
        all_file_ids.add(file_id)
print(f"Total number of files: {len(all_file_ids)}")

# Average entities per file and average text length using the read_and_extract_maccrobat_file function
total_entities = 0
total_text_length = 0
for file_id in all_file_ids:
    text, entities = read_and_extract_maccrobat_file(file_id)
    total_entities += len(entities)
    total_text_length += len(text)

average_entities_per_file = total_entities / len(all_file_ids)
print(f"Average entities per file: {average_entities_per_file}")

average_text_length_per_file = total_text_length / len(all_file_ids)
print(f"Average text length per file: {average_text_length_per_file}")

Total number of files: 200
Average entities per file: 125.5
Average text length per file: 2828.58


### Augmented Clinical Notes

In [1]:
# Load jsonl data
import json

with open("../data/augmented-clinical-notes/augmented_notes_30K.jsonl", "r") as f:
    data = [json.loads(line) for line in f]


In [3]:
# Get only `idx`, `full_note` and `conversation`
# And only 30 samples for demo purposes
data_sample = [{"idx": item["idx"], "full_note": item["full_note"], "conversation": item["conversation"]} for item in data[:30]]
print(f"Loaded {len(data_sample)} samples from augmented clinical notes.")

# Save the sample to a new jsonl file
with open("../data/augmented-clinical-notes/augmented_notes_30K_sample.jsonl", "w") as f:
    for item in data_sample:
        f.write(json.dumps(item) + "\n")

Loaded 30 samples from augmented clinical notes.


In [35]:
# Display example entry
print(json.dumps(data[0], indent=2))

{
  "note": "A a sixteen year-old girl, presented to our Outpatient department with the complaints of discomfort in the neck and lower back as well as restriction of body movements. She was not able to maintain an erect posture and would tend to fall on either side while standing up from a sitting position. She would keep her head turned to the right and upwards due to the sustained contraction of the neck muscles. There was a sideways bending of the back in the lumbar region. To counter the abnormal positioning of the back and neck, she would keep her limbs in a specific position to allow her body weight to be supported. Due to the restrictions with the body movements at the neck and in the lumbar region, she would require assistance in standing and walking. She would require her parents to help her with daily chores, including all activities of self-care.\nShe had been experiencing these difficulties for the past four months since when she was introduced to olanzapine tablets for the

Here note is from PMC-Patients and conversation is from NoteChat and summary has been augmented.