## Reproducing LePendu et al

Annotates clinical text from EHR systems -> __extract disease and drug mentions from the EHR.__<br>

Re-read the paper again!

### More to do
- Obtaining ICD9 discharge codes<br>
Patient visits include in some cases the discharge diagnosis in the form of an ICD9 code. <br>
The ICD9 codes for rheumatoid arthritis begin with 714 and the ICD9 code for myocardial infarction begins with 410. <br>
We __include these manually encoded terms as part of the analysis__ as a __comparison against what we can find in the text itself__.
- We are also extending the system to discern additional contextual cues such as family history versus recent diagnosis.

### More details
- More terminologies [UMLS Metathesaurus Vocabulary Documentation](nlm.nih.gov/research/umls/sourcereleasedocs/index.html), [VSAC Downloadable Resources](https://vsac.nlm.nih.gov/download/ccda?rel=20210810), [CDE](https://cde.nlm.nih.gov/home)<br>
- Some guide on querying: [How to extract/get all the terms(classes/properties) in an ontology](https://stackoverflow.com/questions/52655960/how-to-extract-get-all-the-termsclasses-properties-in-an-ontology), [querying the owl ontology](https://stackoverflow.com/questions/22542348/querying-the-owl-ontology), [SPARQL query](https://owlready2.readthedocs.io/en/v0.35/sparql.html)

### HADCL Goals

- Analyze the trade-off of applying traditional approach with recent sequence modeling approach for modeling EHR data
- Benchmarking different pretrained language models for understanding EHR data
- Exploring the use of EHR data allocation estimation in Hong Kong population
    - Unplanned readmission
    - Mortality risk
    - Disease diagnosis
    - Length of stay

## 1. Initialize

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', 100) 

from terminology import obtain_terminologies
from annotate import get_patients_annotated_terms
from helper import temporally_ordered_list_join

path = 'data_pool/'



In [2]:
# sample_clinical_text = "This is a 31 year old male s/p seizure on ladder with resulting fall 15-20 feet on [**09-17**] now presenting to the T/SICU post surgical repair of multiple facial fractures, open reduction of mandibular fracture, right mandibular fracture, and left distal radius fracture. He needs to remain intubated for 48 hours post-op. His past medical history is significant only for seizure disorder, and his only medication is depakote. He has no known allergies."
                        # (the "Open reduction of mandibular fracture" is fabricated)

# mimic_iii_demo = path + 'mimic-iii-clinical-database-demo-1.4/'
# df = pd.read_csv(mimic_iii_demo + 'D_ICD_DIAGNOSES.csv')
# print('df.iloc[0][\'long_title\']', df.iloc[0]['long_title'])
# df.head(3)

In [3]:
df = pd.read_csv('./negex/python/Annotations-1-120.txt', delimiter='\t')
df.head()

Unnamed: 0,Report No.,Concept,Sentence,Negation
0,1,shortness of breath,"S_O_H Counters Report Type Record Type Subgroup Classifier 1,01TdvtyYejbW DS DS 1504 E_O_H [...",Affirmed
1,1,staph bacteremia,"S_O_H Counters Report Type Record Type Subgroup Classifier 1,01TdvtyYejbW DS DS 1504 E_O_H [...",Affirmed
2,1,infection,The patient was transferred to **INSTITUTION for explantation of a pacemaker system that was f...,Affirmed
3,1,infection,The patient was sent back to **INSTITUTION the following day for further evaluation and manage...,Affirmed
4,1,infection,It was planned to follow up in the **INSTITUTION for reimplantation of the pacemaker once th...,Affirmed


In [4]:
df.shape

(2376, 4)

## 2. Obtaining terminologies

Obtained are UMLS terminologies: SNOMEDCT_US; RXNORM; NDFRT, and Human Disease Ontology<br>
Ontologies are downloaded from [UMLS](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) and [OLS Ontology Search](https://www.ebi.ac.uk/ols/ontologies/doid) (umls-2021AB-full.zip), using ([Owlready2](https://pypi.org/project/Owlready2/) | [Source](https://bitbucket.org/jibalamy/owlready2/src/master/doc/))<br>

__TODO__:
1. Need to also include __manually encoded ICD9 terms__ for each patient encounter as additional terms
2. Combine with coded and structured data when possible
3. The resulting lexicon contains __2.8 million unique terms.__, gathered are only 550k

In [5]:
%%time

unique_terms = obtain_terminologies()

Importing UMLS from umls-2021AB-full.zip with Python version 3.8.8 and Owlready version 2-0.35...
Full UMLS release - importing UMLS from inner Zip file 2021AB-full/2021ab-1-meta.nlm...
  Parsing 2021AB/META/MRRANK.RRF.gz as MRRANK with encoding UTF-8
  Parsing 2021AB/META/MRCONSO.RRF.aa.gz as MRCONSO with encoding UTF-8
  Parsing 2021AB/META/MRCONSO.RRF.ab.gz as MRCONSO with encoding UTF-8
  Parsing 2021AB/META/MRDEF.RRF.gz as MRDEF with encoding UTF-8
Full UMLS release - importing UMLS from inner Zip file 2021AB-full/2021ab-2-meta.nlm...
  Parsing 2021AB/META/MRREL.RRF.aa.gz as MRREL with encoding UTF-8
  Parsing 2021AB/META/MRREL.RRF.ab.gz as MRREL with encoding UTF-8
  Parsing 2021AB/META/MRREL.RRF.ac.gz as MRREL with encoding UTF-8
  Parsing 2021AB/META/MRREL.RRF.ad.gz as MRREL with encoding UTF-8
  Parsing 2021AB/META/MRSAT.RRF.aa.gz as MRSAT with encoding UTF-8
  Parsing 2021AB/META/MRSAT.RRF.ab.gz as MRSAT with encoding UTF-8
  Parsing 2021AB/META/MRSAT.RRF.ac.gz as MRSAT with 



NameError: name 'choose_label' is not defined

# 3. Annotate the recognized and negated terms

Annotate by string matching the text to the lexicon downloaded, and apply negex trigger rules to separate negated terms

In [6]:
%%time
df_result = get_patients_annotated_terms(unique_terms[:10], df)
df_result.to_csv('result.csv', index=False)
df_result = pd.read_csv('result.csv')
df_result.head(2)

NameError: name 'unique_terms' is not defined

## 4. Compile terms for timeline view of the patient's record

We compile terms (From each note, output:set of negated terms, set of non-negated terms) into a temporally ordered series of sets for each patient to create a timeline view of the patient’s record.

The patient id and admission date are now __mocked__

In [7]:
# mock id and admission date

import random
import datetime

random_id, random_date = [], []
for i in range(df_result.shape[0]):
    n = random.randint(1,30)
    month = random.randint(1,12)
    day = random.randint(1,28)
    random_date.append(datetime.date(2020, month, day))
    random_id.append(n)
    
df_result['patient_id'] = random_id
df_result['admission_date'] = random_date

NameError: name 'df_result' is not defined

In [None]:
df_result_grouped_pos = df_result.sort_values(['patient_id', 'admission_date']).groupby(['patient_id'])['recognized_terms'].apply(temporally_ordered_list_join).reset_index()
df_result_grouped_neg = df_result.sort_values(['patient_id', 'admission_date']).groupby(['patient_id'])['negated_terms'].apply(temporally_ordered_list_join).reset_index()
df_result_grouped = df_result_grouped_pos.merge(df_result_grouped_neg, on='patient_id', how='left')
df_result_grouped.head()

## 5. Normalize and aggregate terms

Use the __RxNORM terminology__ to normalize the drug having the trade name Vioxx into its primary active ingredient,

We reason over the structure of the ontologies to normalize and to aggregate terms for further analysis<br>
From the set of ontologies we use, the Annotator identifies all notes containing any string denoting this term as either its primary label or synonym.<br>
We use __all other ontologies to normalize strings denoting rheumatoid arthritis or myocardial infarction__ and the Annotator identifies all notes containing them.<br>
As an option, we can also enable reasoning to infer all subsumed terms, which increases the number of notes that we can identify beyond pure string matches.<br>
For example, patients with Caplan’s or Felty’s syndrome may also fit the cohort of patients with rheumatoid arthritis.<br>
Therefore, notes that mention these diseases can automatically be included as well even though their associated strings look nothing alike.<br>
We did not use such reasoning for results reported in this specific study.

6. Finishing Touches

__TO DO__:
- Using the terms as features, we can define patterns of interest (such as patients with rheumatoid arthritis, who take rofecoxib, and then get myocardial infarctio), which we can use in data mining applications.# Deprecated

### Additional info

Steps:
1. Gather dataset
2. Preprocess using Open Biomedical Annotator
    - Normalize records:
        - if data type = :
            - __diagnoses, medications, procedures, lab tests__: count presence of each normalized code in patient EHRs<br>
                → aiming to facilitate the modelling of related clinical events
            - __free text clinical notes__: LePendu et al:
                - Allowed identifying the negated tags and those related to family history
                    - A tag that appeared as negated in the note was considered not relevant and discarded <br>
                      → Negated tags were identified using NegEx:
                              a regular expression algorithm that implements several phrases indicating negation:
                              - filters out sentences containing phrases that falsely appear to be negation phrases,
                              - and limits the scope of the negation phrases
                    - A tag that was related to family history was just flagged as such and differentiated from the directly patient-related tags.
                    - We then analyzed similarities in the representation of temporally consecutive notes to remove duplicated information (e.g., notes recorded twice by mistake)
                
                - The parsed notes were further processed to reduce the sparseness of the representation (about 2 million normalized tags were extracted) and to obtain a semantic abstraction of the embedded clinical information. 
                    - To this aim we modeled the parsed notes using topic modeling <br>
                        → an unsupervised inference process that captures patterns of word co-occurrences within documents to define topics and represent a document as a multinomial over these topics.<br>
                             → Topic modeling has been applied to generalize clinical notes and improve automatic processing of patients data in several studies (e.g., see5,26–28). <br>
                        → We used latent Dirichlet allocation as our implementation of topic modeling and we estimated the number of topics through perplexity analysis over one million random notes. <br>
                        We found that 300 topics obtained the best mathematical generalization; therefore, each note was eventually summarized as a multinomial of 300 topic probabilities. <br>
                        For each patient, we eventually retained one single topic-based representation averaged over all the notes available before the split-point.

In [None]:
# sanity checks

# for i, term in enumerate(split_semicolons_term):
#     if 'mandibular fracture' in term:
#         print(i, term)

# for term in split_semicolons_term:
#     if len(term) <3:
#         print(term)

# terms = []
# for i in range(len(PYM_classes_list)):
#     str_class_selected = str(PYM_classes_list[i])
#     if '#' in str_class_selected:
#         strings_selected = str_class_selected[str_class_selected.index('#')+2:-1].replace(' ; ',';').split('; ')
#         if 'H' in strings_selected:
#             print(str_class_selected, strings_selected)
#         terms.extend(strings_selected)

# recognized_terms = []
# for term in split_semicolons_term:
#     if ' ' + term.lower() + ' ' in sample_clinical_text:
#         recognized_terms.append(term)

In [None]:
## deprecated

# icd9_cm_v32 = path + 'ICD-9-CM-v32-master-descriptions/' #
# df_term = pd.read_excel(icd9_cm_v32 + 'CMS32_DESC_LONG_SHORT_SG.xlsx')
# df_term.head()

# sample_clinical_text = "This is a 31 year old male s/p seizure on ladder with resulting fall 15-20 feet on [**09-17**] now presenting to the T/SICU post surgical repair of multiple facial fractures, right mandibular fracture, and left distal radius fracture. He needs to remain intubated for 48 hours post-op. His past medical history is significant only for seizure disorder, and his only medication is depakote. He has no known allergies."

In [None]:
## deprecated

# ICD10       = PYM["ICD10"]
# SNOMEDCT_US = PYM["SNOMEDCT_US"]
# CUI         = PYM["CUI"]

In [None]:
## deprecated
## for inspecting purposes

# default_world.set_backend(filename = "pym.sqlite3")
# include more from: https://www.nlm.nih.gov/research/umls/sourcereleasedocs/index.html
# import_umls("umls-2021AB-full.zip", terminologies = ["ICD10", "SNOMEDCT_US", "CUI"])
# default_world.save()

# %%time
# count = 0
# for i in range(len(PYM_classes_list)):
#     str_class_selected = str(PYM_classes_list[i])
#     if '_' in str_class_selected and 'SNOMEDCT_US' not in str_class_selected:
#         count += 1
#         print(i, str(PYM_classes_list[i]))
#         if count == 1000:
#             break

# list(default_world.sparql("""
#            SELECT (COUNT(?x) AS ?nb)
#            { ?x a owl:Class . }
#     """))

# select_all_list = list(default_world.sparql("""
#                                select *
#                                {?s a owl:Class.}
#                         """))

# PYM.__dict__
# type(PYM.get_children_of(ICD10["K44"])[0])

# a = PYM.get_children_of(ICD10["K44"])[0]
# a.mro()

# PYM.get_children_of(ICD10["K44"])[0].is_a

# PYM.get_children_of(SNOMEDCT_US["346453007"])

# SNOMEDCT_US[186675001]

# SNOMEDCT_US[186675001] >> ICD10

In [None]:
## deprecated

# onto_path.append('data_pool/Read_V22015.owl')
# link = "https://bioportal.bioontology.org/ontologies/CSO"

# link = "https://data.bioontology.org/ontologies/CSO/submissions/1/download?apikey=6ff7e312-2d31-49ab-8163-5faa3568fa6f"
# onto = get_ontology(link)
# onto.load()

# onto = get_ontology("file:///home/bryan/hadcl/data_pool/Read_V22015.owl").load()

# len(list(onto.classes()))

# for annot_prop in onto.metadata:
#     print(annot_prop, ":", annot_prop[onto.metadata])

In [None]:
# deprecated

# list_count, tagged_sentences, negation_flags, scopes = [], [], [], []

# count = 0
# for i, row in enumerate(reports):
#     if count == 0:
#         count = count+1
#         continue
#     if count == 1:
#         tagger = negTagger(sentence = row[2], phrases = [row[1]], rules = irules, negP=False)
#         if len(tagger.getScopes()) > 0:
#             if len(tagger.getScopes()[0])  >1:
#                 print('tagger.getNegTaggedSentence()\n', tagger.getNegTaggedSentence(), '\n')
#                 print('tagger.getNegationFlag()\n', tagger.getNegationFlag(), '\n')
#                 print('tagger.getScopes()\n', tagger.getScopes(), '\n')
#                 tagged_sentences.append(tagger.getNegTaggedSentence())
#                 negation_flags.append(tagger.getNegationFlag())
#                 scopes.append(tagger.getScopes())
#                 list_count.append(i)
#         if count > 3:
#             break

In [None]:
# # mock data

# mock_data = {}
# mock_data['age'] = [30,29,60]
# mock_data['gender'] = ['M', 'F', 'M']
# mock_data['race'] = ['race_a', 'race_b', 'race_a']
# mock_data['diagnoses'] = [['001', '139', '140', '239'],
#                           ['005', '199', '240', '533', '213', '343'],
#                           ['009', '239', '440', '335', '213']]
# mock_data['medications'] = [['A', 'B', 'C', 'D'],
#                             ['Z', 'A', 'D', 'W', 'A', 'B'],
#                             ['A', 'F', 'N', 'G', 'C']]
# mock_data['procedures'] = [['A', 'B', 'C', 'D'],
#                             ['Z', 'A', 'D', 'W', 'A', 'B'],
#                             ['A', 'F', 'N', 'G', 'C']]
# mock_data['lab_tests'] = [['A', 'B', 'C', 'D'],
#                             ['Z', 'A', 'D', 'W', 'A', 'B'],
#                             ['A', 'F', 'N', 'G', 'C']]
# mock_data['free_text_clinical_notes'] = [['ut aut reiciendis voluptatibu', 'erum hic tenetur a sapie', ' molestiae non recus', 'ur aut perferendi'],
#                             ['ctus, s maiores alias consequats dolorib', ' et molestiae non recusandae. Itaqu', ' rnte delectus, s maiores ali', 'onsequats doloribus asperiores repellat.', 'facere possimus, omnis voluptas ass', 'officiis debitis aut rerum necessi'],
#                             ['laceat, umenda est, omnis dolor rep', ' Temporibus autem quibusdam et aut ', 't, ut et voluptates repudiand', 'evenieae sinte earumas c', 'quod maxime pellendus.tatibus saepe ']]


# df_demog_clinidesc = pd.DataFrame(mock_data)

# def openbiomedicalannotator(all_clinical_records=True):
#     if all_clinical_records:
# #         return harmonized_codes_for_procedures_and_lab_tests,\
# #                 normalized_medications_based_on_brand_name_and_dosage,\
# #                 extracted_clinical_concepts_from_free_text_notes
#         return pass