# Step 1. Loading Public Clinical Notes Data from Hugging Face
Per the dataset page on HuggingFace located at [https://huggingface.co/datasets/AGBonnet/augmented-clinical-notes](https://huggingface.co/datasets/AGBonnet/augmented-clinical-notes)

### Augmented Clinical Notes

The Augmented Clinical Notes dataset is an extension of existing datasets containing 30,000 triplets from different sources:

    - Real clinical notes (PMC-Patients): Clinical notes correspond to patient summaries from the PMC-Patients dataset, which are extracted from PubMed Central case studies.
    - Synthetic dialogues (NoteChat): Synthetic patient-doctor conversations were generated from clinical notes using GPT 3.5.
    - Structured patient information (ours): From clinical notes, we generate structured patient summaries using GPT-4 and a tailored medical information template (see details below).

This dataset was used to train MediNote-7B and MediNote-13B, a set of clinical note generators fine-tuned from the MediTron large language models.

Our full report is available [here](https://huggingface.co/datasets/AGBonnet/augmented-clinical-notes/blob/main/report.pdf).
### Dataset Details

    - Curated by: Antoine Bonnet and Paul Boulenger
    - Language(s): English only
    - Repository: EPFL-IC-Make-Team/ClinicalNotes
    - Paper: MediNote: Automated Clinical Notes



In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [2]:
# load in the clinical notes dataset from huggingface
df = pd.read_json("hf://datasets/AGBonnet/augmented-clinical-notes/augmented_notes_30K.jsonl", lines=True, nrows=300)



In [3]:
df

Unnamed: 0,note,conversation,idx,summary,full_note
0,"A a sixteen year-old girl, presented to our Ou...","Doctor: Good morning, what brings you to the O...",155216,"{\n""visit motivation"": ""Discomfort in the neck...","A a sixteen year-old girl, presented to our Ou..."
1,This is the case of a 56-year-old man that was...,"Doctor: Hi, how are you feeling today?\nPatien...",77465,"{\n""visit motivation"": ""Complaints of a dull p...",This is the case of a 56-year-old man that was...
2,A 36-year old female patient visited our hospi...,"Doctor: Hello, what brings you to the hospital...",133948,"{\n""visit motivation"": ""Pain and restricted ra...",A 36-year old female patient visited our hospi...
3,A 49-year-old male presented with a complaint ...,"Doctor: Good morning, Mr. [Patient's Name]. I'...",80176,"{\n""visit motivation"": ""Pain in the left proxi...",A 49-year-old male presented with a complaint ...
4,A 47-year-old male patient was referred to the...,"Doctor: Good morning, how are you feeling toda...",72232,"{\n""visit motivation"": ""Recurrent attacks of p...",A 47-year-old male patient was referred to the...
...,...,...,...,...,...
295,"A 36-year-old man, originally from Latin Ameri...","Doctor: Hello, what brings you in today?\nPati...",174877,"{\n""visit motivation"": ""Complaints of abdomina...","A 36-year-old man, originally from Latin Ameri..."
296,An otherwise healthy 22-year-old caucasian wom...,"Doctor: Good morning, how can I help you today...",41761,"{\n""visit motivation"": ""Accelerated growth of ...",An otherwise healthy 22-year-old caucasian wom...
297,"A 36-year-old man, originally from Latin Ameri...","Doctor: Good afternoon, sir. I understand that...",7876,"{\n""visit motivation"": ""Complaints of abdomina...","A 36-year-old man, originally from Latin Ameri..."
298,Our patient is a 38-year-old male who presente...,"Doctor: Hi there, how are you feeling today?\n...",182286,"{\n""visit motivation"": ""Right chest wall and s...",Our patient is a 38-year-old male who presente...


# Step 2. Importing scispaCy package and creation of model for Named Entity Linking

In [4]:
import spacy
import scispacy
from scispacy.linking import EntityLinker

In [5]:
# now we create our model instance which can be used to process biomedical text
nlp = spacy.load("en_core_sci_sm")

  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]


In [6]:
# now we add a linker to the UMLS knowledgebase to our model pipeline
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "mesh"})

<scispacy.linking.EntityLinker at 0x33e5e3850>

In [7]:
linker = nlp.get_pipe("scispacy_linker")

### Quick Example: Getting the Entities from one of the clinical notes

In [9]:
doc = nlp(df['note'][0])
for ent in doc.ents:
    print(ent.text, ent.label_, ent._.kb_ents)

year-old ENTITY []
girl ENTITY [('C0043210', 0.9925637245178223)]
Outpatient department ENTITY [('C0029921', 0.7193280458450317)]
complaints ENTITY []
discomfort ENTITY []
neck ENTITY [('C0027530', 0.9866217970848083), ('C0007859', 0.7661657333374023), ('C0746787', 0.7246235609054565)]
restriction ENTITY [('C1135809', 0.8025245070457458), ('C0425422', 0.7399560213088989)]
body movements ENTITY [('C0026649', 0.8049532771110535), ('C1621968', 0.7099595069885254)]
erect ENTITY []
posture ENTITY [('C1262869', 0.9909231662750244), ('C1256755', 0.7185211181640625)]
fall ENTITY []
side ENTITY []
standing ENTITY [('C4277736', 0.7244225740432739)]
sitting position ENTITY [('C0277814', 0.9761723279953003), ('C2584297', 0.714629590511322)]
head ENTITY [('C0018670', 0.9888428449630737), ('C0018681', 0.7552989721298218), ('C0411280', 0.7389206290245056), ('C3658201', 0.7268886566162109)]
right ENTITY [('C0225844', 0.7840086817741394), ('C0030706', 0.7462760806083679), ('C0035617', 0.729196786880493

# Step 3. Definition of Function for Extracting Named Entities from the Clinical Notes
Here we define a function that, given the text for a note, extracts the top 3 linked vocabulary terms by score/confidence \
 for each entity in the note, and then returns these as a pandas `DataFrame`

In [None]:
def get_linked_entities_for_doc(text, nlp, linker):
  # get the document
  doc = nlp(text)
  # get the linked entities
  linked_entities = []
  for ent in doc.ents: # get all recognized entities
    for entry in ent._.kb_ents[:3]: # get the top 3 linked vocabulary terms for each entity
      linked_entities.append({
          'entity_name': ent.text,
          'cui': entry[0],
          'score': entry[1],
          'name': linker.kb.cui_to_entity[entry[0]].canonical_name,
          'definition': linker.kb.cui_to_entity[entry[0]].definition,
          'type_ids': ','.join(linker.kb.cui_to_entity[entry[0]].types),
      })
  return pd.DataFrame(linked_entities).drop_duplicates()


#### Here we get the linked entities for a single document and filter to only include those with a score of at least 0.9

In [None]:
get_linked_entities_for_doc(df['full_note'][1], nlp, linker).query('score >= 0.9')

In [None]:
import pprint

In [None]:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(df['full_note'][1])

# Step 4. Extracting All Named Entities from The Clinical Notes
Now we loop over the clinical notes dataset, and use our previously defined function to \
extract all linked entities for each note with a confidence score >= 0.9

In [None]:
from tqdm import tqdm

In [None]:
linked_entity_dfs = []
for _, row in tqdm(df.iterrows(), total=len(df)):
  linked_entities = get_linked_entities_for_doc(row['full_note'], nlp, linker).query('score >= 0.9')
  linked_entity_dfs.append(
      linked_entities.assign(type_ids_lst=lambda x: x['type_ids'].str.split(','))
      .explode('type_ids_lst').assign(note_id=row['idx'])
    )

In [None]:
linked_entities_all = pd.concat(linked_entity_dfs)

In [None]:
linked_entities_all

In [None]:
# inspecting the top 50 occuring linked entity names
linked_entities_all['entity_name'].value_counts().head(50)

# Step 5. Labeling Semantic Types for All Linked Entities
Now we load in a dataset of semantic type labels and link these into the linked entities dataset

In [None]:
# first load in the file containing the labels for the semantic types
semantic_type_labels = pd.read_csv('https://github.com/expmed/arch_workshop_scispacy_entity_linking_ws11/raw/refs/heads/main/umls_terms.txt')

In [None]:
semantic_type_labels

In [None]:
# now we add in these semantic type labels
linked_entities_final = linked_entities_all.merge(
    semantic_type_labels,
    left_on='type_ids_lst',
    right_on='tui',
    how='left'
)

In [None]:
linked_entities_final

## Mini Exercise: Show the Top 20 Most Frequently Occuring Semantic Types Among Linked Entities

In [None]:
# Your solution below...

# Step 6. Utilizing Publicly Available Crosswalk File to Link Entities to MeSH Terms and SNOMED Terms
Here we utilize data extracted from the `MRCONSO.RRF` file hosted at [the National Library of Medicine](https://www.nlm.nih.gov/research/umls/new_users/online_learning/Meta_006.html) to crosswalk the CUI #s of the linked entities to Medical Subject Heading (MeSH) terms and Systematized Nomenclature of Medicine (SNOMED) terms.

In [None]:
# first we load in mappings from UMLS concept unique ids to Medical Subject Heading (MeSH) terms
mrconso_mesh_mappings = pd.read_parquet('https://github.com/expmed/arch_workshop_scispacy_entity_linking_ws11/raw/refs/heads/main/mrconso_mesh.parquet')

In [None]:
mrconso_mesh_mappings

In [None]:
# now we can merge the linked entities with the MeSH mappings
linked_entities_mesh = linked_entities_final.merge(
    mrconso_mesh_mappings,
    left_on='cui',
    right_on='CUI',
    how='left'
)

In [None]:
linked_entities_mesh

In [None]:
# now we load in the mappings from CUIs to Systematized Nomenclature of Medicine - Clinical Terms
mrconso_snomed_mappings = pd.read_parquet('https://github.com/expmed/arch_workshop_scispacy_entity_linking_ws11/raw/refs/heads/main/mrconso_snomed.parquet')

In [None]:
linked_entities_snomed = linked_entities_final.merge(
    mrconso_snomed_mappings,
    left_on='cui',
    right_on='CUI',
    how='left'
)

In [None]:
linked_entities_snomed

In [None]:
print(f"{len(linked_entities_snomed.dropna(subset=['CODE'])) / len(linked_entities_snomed) * 100}% of the entities have a SNOMED code")

In [None]:
print(f"{len(linked_entities_mesh.dropna(subset=['CODE'])) / len(linked_entities_mesh) * 100}% of the entities have a MeSH code")

# Step 7. Utilize MeSH Hierarchy to Semantically Group Linked Entities
This file was originally downloaded from the NIH National Library of Medicine Website at [The Following Link](https://www.nlm.nih.gov/databases/download/mesh.html). The original file is in XML format, which I then processed and converted into a CSV file for ease of loading and reduced disk storage.

In [None]:
# now load in the MeSH Hierarchy file
mesh_hierarchy = pd.read_csv('https://github.com/expmed/arch_workshop_scispacy_entity_linking_ws11/raw/refs/heads/main/mesh_hierarchy.csv')

In [None]:
mesh_hierarchy

In [None]:
# now we link in the tree numbers to the MeSH mapped entities
linked_entities_mesh_hierarchy = linked_entities_mesh.merge(
    mesh_hierarchy[['UI', 'tree_number']],
    left_on='CODE',
    right_on='UI',
    how='inner'
)

In [None]:
linked_entities_mesh_hierarchy

In [None]:
set(linked_entities_mesh_hierarchy['tree_number'])

In [None]:
# format the mesh hierarchy as a lookup table/dictionary
mesh_dictionary = {
    row['tree_number']: row['name']
    for _, row in tqdm(mesh_hierarchy.iterrows(), total=len(mesh_hierarchy))
}

Here we define a function that iteratively walks down the MeSH hierarchy by taking longer and longer \
prefixes of the MeSH tree numbers, until all tree numbers and their ancestors have been enumerated

In [None]:
# now we specify a function to walk up the MeSH tree for each entity
def walk_mesh_hierarchy(entities_df, mesh_hierarchy):
  result = entities_df.copy()
  # get the set of distinct tree numbers in the dataset
  tree_nums = set(entities_df['tree_number'].tolist())
  # start at the top level
  level = 1
  # while we still have tree numbers to process
  while len(tree_nums) > 0:
    print(f"Processing level {level}")
    # save the mappings for the current level in a list
    level_mappings = []
    # keep track of tree numbers to remove after processing this level
    to_remove = set()
    # loop over the tree nums
    for tree_num in tree_nums:
      # get the prefix for the current tree level
      prefix = '.'.join(tree_num.split(".")[:level])
      # if the prefix is different from the tree number, save a mapping for the current level
      if prefix != tree_num:
        level_mappings.append({
            'tree_number': tree_num,
            f'level_{level}_tree_number': prefix,
            f'level_{level}_parent_name': mesh_hierarchy[prefix]
        })
      else:
        # we have already enumerated all ancestors if the prefix matches, so remove the tree number
        to_remove.add(tree_num)
    # merge in the mappings for the current level if we have any
    if len(level_mappings) > 0:
      result = result.merge(
          pd.DataFrame(level_mappings),
          on='tree_number',
          how='left'
      )
    # move one level down the tree
    level += 1
    # update the set of tree_nums
    tree_nums = tree_nums - to_remove
  # return the result dataframe
  return result


In [None]:
linked_entities_mesh_final = walk_mesh_hierarchy(
    linked_entities_mesh_hierarchy,
    mesh_dictionary
)

In [None]:
linked_entities_mesh_final[['note_id', 'entity_name', 'name', 'definition', 'label', 'level_1_parent_name', 'level_2_parent_name', 'tree_number']]

# Exercises

In [None]:
# Exercise 1: Count the number of patient notes that mention respiratory tract diseases

In [None]:
# Exercise 2: For entities with a tree number prefixed by 'C' (Diseases) Rank them by number of notes mentioning each kind of disease
# Use the level 1 parent name


In [None]:
# Exercise 3: What are the 10 most frequent anatomical parts mentioned in notes tagged with Neoplasms?
# Note: MeSH terms categorized as anatomical have a tree number prefixed by 'A'

