## Reproducing LePendu et al steps

Annotates clinical text from EHR systems -> extract disease and drug mentions from the EHR.<br>

Reproduce steps:
1. [DONE] Obtain lexicon from ontologies downloaded from [UMLS](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) (umls-2021AB-full.zip), using [Owlready2](https://pypi.org/project/Owlready2/)
2. [TODO] Annotate by string matching the text to the lexicon downloaded
3. [DONE] Apply negex trigger rules to separate negated terms (to incorporate negation detection—the ability to discern whether a term is negated within the context of the narrative.)
4.1. [DONE] We compile terms (both positive and negative) into a temporally ordered series of sets for each patient and ..
    2. [TODO] combine them with coded and structured data when possible
5. [TODO] We reason over the structure of the ontologies to normalize and to aggregate terms for further analysis
6. From each note, output: set of negated terms, set of non-negated terms

__Additional:__
1. __Ontology:__<br>
    For this study, use a subset of those ontologies (Table 3) that are most relevant to clinical domains, including:
    - Unified Medical Language System (UMLS) terminologies such as __SNOMED-CT__, 
    - the __National Drug File (NDFRT)__, and 
    - __RxNORM__
    - as well as ontologies like the Human Disease Ontology. <br>
    
    The resulting lexicon contains __2.8 million unique terms.__
2. The output of the annotation workflow is a set of __negated and non-negated terms__ from each note (Figure 1, step 3). <br>
   As a result, for each patient we end up with a temporal series of terms mentioned in the notes (red denotes negated terms in Figure 1, step 4).<br>
   We also include __manually encoded ICD9 terms__ for each patient encounter as additional terms.<br>
   Because each encounter’s date is recorded, we can order each set of terms for a patient to create a timeline view of the patient’s record. <br>
   Using the terms as features, we can define patterns of interest (such as patients with rheumatoid arthritis, who take rofecoxib, and then get myocardial infarctio), which we can use in data mining applications.
3. __Normalizing and aggregating terms.__<br>
We use the __RxNORM terminology__ to __normalize the drug having the trade name Vioxx into its primary active ingredient__, rofecoxib.<br>
From the set of ontologies we use, the Annotator identifies all notes containing any string denoting this term as either its primary label or synonym.<br>
We use __all other ontologies to normalize strings denoting rheumatoid arthritis or myocardial infarction__ and the Annotator identifies all notes containing them.<br>
As an option, we can also enable reasoning to infer all subsumed terms, which increases the number of notes that we can identify beyond pure string matches.<br>
For example, patients with Caplan’s or Felty’s syndrome may also fit the cohort of patients with rheumatoid arthritis.<br>
Therefore, notes that mention these diseases can automatically be included as well even though their associated strings look nothing alike.<br>
We did not use such reasoning for results reported in this specific study.
4. Obtaining ICD9 discharge codes<br>
Patient visits include in some cases the discharge diagnosis in the form of an ICD9 code. <br>
The ICD9 codes for rheumatoid arthritis begin with 714 and the ICD9 code for myocardial infarction begins with 410. <br>
We __include these manually encoded terms as part of the analysis__ as a __comparison against what we can find in the text itself__.
5. We are also extending the system to discern additional contextual cues such as family history versus recent diagnosis.

Timeline:
1. day 1: Reproduce 1&2, and additional 1&2
2. day 2: Reproduce 4.2. and additional 3&4
3. day 3: Revise steps, read the paper again, Reproduce 5,6, and additional 5

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', 100) 

path = 'data_pool/'

In [2]:
mimic_iii_demo = path + 'mimic-iii-clinical-database-demo-1.4/'
df = pd.read_csv(mimic_iii_demo + 'D_ICD_DIAGNOSES.csv')
print('df.iloc[0][\'long_title\']', df.iloc[0]['long_title'])
df.head(3)

df.iloc[0]['long_title'] Erythema nodosum with hypersensitivity reaction in tuberculosis, tubercle bacilli not found by bacteriological or histological examination, but tuberculosis confirmed by other methods [inoculation of animals]


Unnamed: 0,row_id,icd9_code,short_title,long_title
0,1,1716,Erythem nod tb-oth test,"Erythema nodosum with hypersensitivity reaction in tuberculosis, tubercle bacilli not found by b..."
1,2,1720,TB periph lymph-unspec,"Tuberculosis of peripheral lymph nodes, unspecified"
2,3,1721,TB periph lymph-no exam,"Tuberculosis of peripheral lymph nodes, bacteriological or histological examination not done"


### Obtaining ontologies from UMLS

Annotate by string matching the text to the lexicon downloaded

More terminologies:
- [UMLS Metathesaurus Vocabulary Documentation](nlm.nih.gov/research/umls/sourcereleasedocs/index.html)
- [VSAC Downloadable Resources](https://vsac.nlm.nih.gov/download/ccda?rel=20210810)
- [RxNorm](https://www.nlm.nih.gov/research/umls/rxnorm/index.html)
- [CDE](https://cde.nlm.nih.gov/home)

Some guide on querying:
- [Documentation from source](https://bitbucket.org/jibalamy/owlready2/src/master/doc/)
- [Owlready2](https://pypi.org/project/Owlready2/)
- [How to extract/get all the terms(classes/properties) in an ontology](https://stackoverflow.com/questions/52655960/how-to-extract-get-all-the-termsclasses-properties-in-an-ontology)
- [querying the owl ontology](https://stackoverflow.com/questions/22542348/querying-the-owl-ontology)
- [SPARQL query](https://owlready2.readthedocs.io/en/v0.35/sparql.html)

In [3]:
from owlready2 import get_ontology
from owlready2.pymedtermino2.umls import import_umls



In [4]:
%%time

# include more from: https://www.nlm.nih.gov/research/umls/sourcereleasedocs/index.html
import_umls("umls-2021AB-full.zip", terminologies = ["ICD10", "SNOMEDCT_US", "CUI"])

PYM = get_ontology("http://PYM/").load()
PYM_classes_list = list(PYM.classes())

terms = []
for i in range(len(PYM_classes_list)):
    str_class_selected = str(PYM_classes_list[i])
    if '#' in str_class_selected:
        strings_selected = str_class_selected[str_class_selected.index('#')+2:-1].replace(' ; ',';').split('; ')
        terms.extend(strings_selected)

unique_terms = list(set(terms))
while '' in unique_terms:
    unique_terms.pop(unique_terms.index(''))
        
print(len(unique_terms), 'unique terms obtained')

Importing UMLS from umls-2021AB-full.zip with Python version 3.8.8 and Owlready version 2-0.35...
Full UMLS release - importing UMLS from inner Zip file 2021AB-full/2021ab-1-meta.nlm...
  Parsing 2021AB/META/MRSTY.RRF.gz as MRSTY with encoding UTF-8
  Parsing 2021AB/META/MRRANK.RRF.gz as MRRANK with encoding UTF-8
  Parsing 2021AB/META/MRCONSO.RRF.aa.gz as MRCONSO with encoding UTF-8
  Parsing 2021AB/META/MRCONSO.RRF.ab.gz as MRCONSO with encoding UTF-8
  Parsing 2021AB/META/MRDEF.RRF.gz as MRDEF with encoding UTF-8
Full UMLS release - importing UMLS from inner Zip file 2021AB-full/2021ab-2-meta.nlm...
  Parsing 2021AB/META/MRREL.RRF.aa.gz as MRREL with encoding UTF-8
  Parsing 2021AB/META/MRREL.RRF.ab.gz as MRREL with encoding UTF-8
  Parsing 2021AB/META/MRREL.RRF.ac.gz as MRREL with encoding UTF-8
  Parsing 2021AB/META/MRREL.RRF.ad.gz as MRREL with encoding UTF-8
  Parsing 2021AB/META/MRSAT.RRF.aa.gz as MRSAT with encoding UTF-8
  Parsing 2021AB/META/MRSAT.RRF.ab.gz as MRSAT with enc

NameError: name 'unique_terms' is not defined

In [54]:
terms = []
for i in range(len(PYM_classes_list)):
    str_class_selected = str(PYM_classes_list[i])
    if '#' in str_class_selected:
        strings_selected = str_class_selected[str_class_selected.index('#')+2:-1].replace(' ; ',';').split('; ')
        if 'H' in strings_selected:
            print(str_class_selected, strings_selected)
        terms.extend(strings_selected)

CUI["C0439111"] # H
 ['H']
SNOMEDCT_US["257967009"] # H
 ['H']


In [55]:
sample_clinical_text = "This is a 31 year old male s/p seizure on ladder with resulting fall 15-20 feet on [**09-17**] \
                        now presenting to the T/SICU post surgical repair of multiple facial fractures, \
                        right mandibular fracture, and left distal radius fracture. \
                        He needs to remain intubated for 48 hours post-op. \
                        His past medical history is significant only for seizure disorder, \
                        and his only medication is depakote. He has no known allergies."

In [46]:
for i, unique_term in enumerate(unique_terms):
    if unique_term in sample_clinical_text:
        print(i, unique_term)

4409 c
18300 5
19537 -2
23501 z
34599 a
50498 S
56555 m
71021 48 hours
79176 w
88425 15
103363 rem
109455 cal
125378 g
127627 y
134625 17
158107 year
169429 U
176354 I
176963 hour
178763 s
180525 t
185016 k
229427 rad
236416 is
236907 n
241478 ft
248897 2
253897 20 feet
263116 us
275576 3
282099 7
285780 i
307054 20
329881 -1
330978 r
334461 d
373637 31
380794 u
381711 0
387280 8
415304 4
419479 ng
419986 48
420545 e
434100 H
440647 b
451148 9
463734 T
472654 l
486132 1
486857 p
486890 8 hours
491100 f
491383 o
494675 C
503530 h


### Separate negated terms using NegEx

TO DO: baca lagi code examplenya negex

In [None]:
import csv
import pandas as pd
from ordered_set import OrderedSet #4.0.2

from negex.python.negex import negTagger

In [None]:
def find_negated_terms(sentence):
    negated_term_count = sentence.count('[NEGATED]')//2
    
    negated_terms = OrderedSet()
    if negated_term_count > 0:
        unpack_negated_words = sentence.split('[NEGATED]')
        for i in range(negated_term_count):
            # store negated terms
            negated_term = unpack_negated_words[2*(i+1)-1]
            negated_terms.append(negated_term)

    return negated_terms

def get_patients_negated_terms(negex_triggers_text_filepath = './negex/python/negex_triggers.txt',
                               patients_clinical_reports_text_filepath = './negex/python/Annotations-1-120.txt'):
    
    rfile = open(negex_triggers_text_filepath)
    reports = csv.reader(open(patients_clinical_reports_text_filepath,'r'), delimiter = '\t')
    
    irules = sortRules(rfile.readlines())

    count = 0
    tagged_sentences = []
    for i, row in enumerate(reports):
        #skip header
        if count == 0:
            count = count+1
            continue
        if count == 1:
            tagger = negTagger(sentence = row[2], phrases = [row[1]], rules = irules, negP=False)
            tagged_sentences.append(tagger.getNegTaggedSentence())

    patients_negated_terms = []
    for i, sentence in enumerate(tagged_sentences):
        negated_terms = find_negated_terms(sentence)

        # store the patient's negated terms
        patients_negated_terms.append(negated_terms)
        
    rfile.close()
        
    return patients_negated_terms, tagged_sentences

In [None]:
%%time
patients_negated_terms, tagged_sentences = get_patients_negated_terms()

In [None]:
tagged_sentences[0]

### Summary

Output the negated and non-negated terms

### Additional info

Steps:
1. Gather dataset
2. Preprocess using Open Biomedical Annotator
    - Normalize records:
        - if data type = :
            - __diagnoses, medications, procedures, lab tests__: count presence of each normalized code in patient EHRs<br>
                → aiming to facilitate the modelling of related clinical events
            - __free text clinical notes__: LePendu et al:
                - Allowed identifying the negated tags and those related to family history
                    - A tag that appeared as negated in the note was considered not relevant and discarded <br>
                      → Negated tags were identified using NegEx:
                              a regular expression algorithm that implements several phrases indicating negation:
                              - filters out sentences containing phrases that falsely appear to be negation phrases,
                              - and limits the scope of the negation phrases23
                    - A tag that was related to family history was just flagged as such and differentiated from the directly patient-related tags.
                    - We then analyzed similarities in the representation of temporally consecutive notes to remove duplicated information (e.g., notes recorded twice by mistake)
                
                - The parsed notes were further processed to reduce the sparseness of the representation (about 2 million normalized tags were extracted) and to obtain a semantic abstraction of the embedded clinical information. 
                    - To this aim we modeled the parsed notes using topic modeling <br>
                        → an unsupervised inference process that captures patterns of word co-occurrences within documents to define topics and represent a document as a multinomial over these topics.<br>
                             → Topic modeling has been applied to generalize clinical notes and improve automatic processing of patients data in several studies (e.g., see5,26–28). <br>
                        → We used latent Dirichlet allocation as our implementation of topic modeling and we estimated the number of topics through perplexity analysis over one million random notes. <br>
                        We found that 300 topics obtained the best mathematical generalization; therefore, each note was eventually summarized as a multinomial of 300 topic probabilities. <br>
                        For each patient, we eventually retained one single topic-based representation averaged over all the notes available before the split-point.
                        
                        
# HADCL Goals

- Analyze the trade-off of applying traditional approach with recent sequence modeling approach for modeling EHR data
- Benchmarking different pretrained language models for understanding EHR data
- Exploring the use of EHR data allocation estimation in Hong Kong population
    - Unplanned readmission
    - Mortality risk
    - Disease diagnosis
    - Length of stay

In [None]:
# mock data

mock_data = {}
mock_data['age'] = [30,29,60]
mock_data['gender'] = ['M', 'F', 'M']
mock_data['race'] = ['race_a', 'race_b', 'race_a']
mock_data['diagnoses'] = [['001', '139', '140', '239'],
                          ['005', '199', '240', '533', '213', '343'],
                          ['009', '239', '440', '335', '213']]
mock_data['medications'] = [['A', 'B', 'C', 'D'],
                            ['Z', 'A', 'D', 'W', 'A', 'B'],
                            ['A', 'F', 'N', 'G', 'C']]
mock_data['procedures'] = [['A', 'B', 'C', 'D'],
                            ['Z', 'A', 'D', 'W', 'A', 'B'],
                            ['A', 'F', 'N', 'G', 'C']]
mock_data['lab_tests'] = [['A', 'B', 'C', 'D'],
                            ['Z', 'A', 'D', 'W', 'A', 'B'],
                            ['A', 'F', 'N', 'G', 'C']]
mock_data['free_text_clinical_notes'] = [['ut aut reiciendis voluptatibu', 'erum hic tenetur a sapie', ' molestiae non recus', 'ur aut perferendi'],
                            ['ctus, s maiores alias consequats dolorib', ' et molestiae non recusandae. Itaqu', ' rnte delectus, s maiores ali', 'onsequats doloribus asperiores repellat.', 'facere possimus, omnis voluptas ass', 'officiis debitis aut rerum necessi'],
                            ['laceat, umenda est, omnis dolor rep', ' Temporibus autem quibusdam et aut ', 't, ut et voluptates repudiand', 'evenieae sinte earumas c', 'quod maxime pellendus.tatibus saepe ']]


df_demog_clinidesc = pd.DataFrame(mock_data)

In [None]:
def openbiomedicalannotator(all_clinical_records=True):
    if all_clinical_records:
#         return harmonized_codes_for_procedures_and_lab_tests,\
#                 normalized_medications_based_on_brand_name_and_dosage,\
#                 extracted_clinical_concepts_from_free_text_notes
        return pass

# Deprecated

In [None]:
## deprecated

# icd9_cm_v32 = path + 'ICD-9-CM-v32-master-descriptions/' #
# df_term = pd.read_excel(icd9_cm_v32 + 'CMS32_DESC_LONG_SHORT_SG.xlsx')
# df_term.head()

# sample_clinical_text = "This is a 31 year old male s/p seizure on ladder with resulting fall 15-20 feet on [**09-17**] now presenting to the T/SICU post surgical repair of multiple facial fractures, right mandibular fracture, and left distal radius fracture. He needs to remain intubated for 48 hours post-op. His past medical history is significant only for seizure disorder, and his only medication is depakote. He has no known allergies."

In [None]:
## deprecated

# ICD10       = PYM["ICD10"]
# SNOMEDCT_US = PYM["SNOMEDCT_US"]
# CUI         = PYM["CUI"]

In [None]:
## deprecated
## for inspecting purposes

# default_world.set_backend(filename = "pym.sqlite3")
# include more from: https://www.nlm.nih.gov/research/umls/sourcereleasedocs/index.html
# import_umls("umls-2021AB-full.zip", terminologies = ["ICD10", "SNOMEDCT_US", "CUI"])
# default_world.save()

# %%time
# count = 0
# for i in range(len(PYM_classes_list)):
#     str_class_selected = str(PYM_classes_list[i])
#     if '_' in str_class_selected and 'SNOMEDCT_US' not in str_class_selected:
#         count += 1
#         print(i, str(PYM_classes_list[i]))
#         if count == 1000:
#             break

# list(default_world.sparql("""
#            SELECT (COUNT(?x) AS ?nb)
#            { ?x a owl:Class . }
#     """))

# select_all_list = list(default_world.sparql("""
#                                select *
#                                {?s a owl:Class.}
#                         """))

# PYM.__dict__
# type(PYM.get_children_of(ICD10["K44"])[0])

# a = PYM.get_children_of(ICD10["K44"])[0]
# a.mro()

# PYM.get_children_of(ICD10["K44"])[0].is_a

# PYM.get_children_of(SNOMEDCT_US["346453007"])

# SNOMEDCT_US[186675001]

# SNOMEDCT_US[186675001] >> ICD10

In [None]:
## deprecated

# onto_path.append('data_pool/Read_V22015.owl')
# link = "https://bioportal.bioontology.org/ontologies/CSO"

# link = "https://data.bioontology.org/ontologies/CSO/submissions/1/download?apikey=6ff7e312-2d31-49ab-8163-5faa3568fa6f"
# onto = get_ontology(link)
# onto.load()

# onto = get_ontology("file:///home/bryan/hadcl/data_pool/Read_V22015.owl").load()

# len(list(onto.classes()))

# for annot_prop in onto.metadata:
#     print(annot_prop, ":", annot_prop[onto.metadata])

In [None]:
# deprecated

# list_count, tagged_sentences, negation_flags, scopes = [], [], [], []

# count = 0
# for i, row in enumerate(reports):
#     if count == 0:
#         count = count+1
#         continue
#     if count == 1:
#         tagger = negTagger(sentence = row[2], phrases = [row[1]], rules = irules, negP=False)
#         if len(tagger.getScopes()) > 0:
#             if len(tagger.getScopes()[0])  >1:
#                 print('tagger.getNegTaggedSentence()\n', tagger.getNegTaggedSentence(), '\n')
#                 print('tagger.getNegationFlag()\n', tagger.getNegationFlag(), '\n')
#                 print('tagger.getScopes()\n', tagger.getScopes(), '\n')
#                 tagged_sentences.append(tagger.getNegTaggedSentence())
#                 negation_flags.append(tagger.getNegationFlag())
#                 scopes.append(tagger.getScopes())
#                 list_count.append(i)
#         if count > 3:
#             break