## Reproducing LePendu et al

Annotates clinical text from EHR systems -> __extract disease and drug mentions from the EHR.__<br>

Re-read the paper again!

### More to do
- Obtaining ICD9 discharge codes<br>
Patient visits include in some cases the discharge diagnosis in the form of an ICD9 code. <br>
The ICD9 codes for rheumatoid arthritis begin with 714 and the ICD9 code for myocardial infarction begins with 410. <br>
We __include these manually encoded terms as part of the analysis__ as a __comparison against what we can find in the text itself__.
- We are also extending the system to discern additional contextual cues such as family history versus recent diagnosis.

### More details
- More terminologies [UMLS Metathesaurus Vocabulary Documentation](nlm.nih.gov/research/umls/sourcereleasedocs/index.html), [VSAC Downloadable Resources](https://vsac.nlm.nih.gov/download/ccda?rel=20210810), [CDE](https://cde.nlm.nih.gov/home)<br>
- Some guide on querying: [How to extract/get all the terms(classes/properties) in an ontology](https://stackoverflow.com/questions/52655960/how-to-extract-get-all-the-termsclasses-properties-in-an-ontology), [querying the owl ontology](https://stackoverflow.com/questions/22542348/querying-the-owl-ontology), [SPARQL query](https://owlready2.readthedocs.io/en/v0.35/sparql.html)

### HADCL Goals

- Analyze the trade-off of applying traditional approach with recent sequence modeling approach for modeling EHR data
- Benchmarking different pretrained language models for understanding EHR data
- Exploring the use of EHR data allocation estimation in Hong Kong population
    - Unplanned readmission
    - Mortality risk
    - Disease diagnosis
    - Length of stay

## 1. Initialize

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', 100) 

from terminology import obtain_lexicons
from annotate import get_patients_annotated_terms
from helper import temporally_ordered_list_join
from normalizer import get_rxnorm_normalizer

path = 'data_pool/'



In [2]:
# sample_clinical_text = "This is a 31 year old male s/p seizure on ladder with resulting fall 15-20 feet on [**09-17**] now presenting to the T/SICU post surgical repair of multiple facial fractures, open reduction of mandibular fracture, right mandibular fracture, and left distal radius fracture. He needs to remain intubated for 48 hours post-op. His past medical history is significant only for seizure disorder, and his only medication is depakote. He has no known allergies."
                        # (the "Open reduction of mandibular fracture" is fabricated)

# mimic_iii_demo = path + 'mimic-iii-clinical-database-demo-1.4/'
# df = pd.read_csv(mimic_iii_demo + 'D_ICD_DIAGNOSES.csv')
# print('df.iloc[0][\'long_title\']', df.iloc[0]['long_title'])
# df.head(3)

In [3]:
df = pd.read_csv('./negex/python/Annotations-1-120.txt', delimiter='\t')
df.head()

Unnamed: 0,Report No.,Concept,Sentence,Negation
0,1,shortness of breath,"S_O_H Counters Report Type Record Type Subgroup Classifier 1,01TdvtyYejbW DS DS 1504 E_O_H [...",Affirmed
1,1,staph bacteremia,"S_O_H Counters Report Type Record Type Subgroup Classifier 1,01TdvtyYejbW DS DS 1504 E_O_H [...",Affirmed
2,1,infection,The patient was transferred to **INSTITUTION for explantation of a pacemaker system that was f...,Affirmed
3,1,infection,The patient was sent back to **INSTITUTION the following day for further evaluation and manage...,Affirmed
4,1,infection,It was planned to follow up in the **INSTITUTION for reimplantation of the pacemaker once th...,Affirmed


In [4]:
df.shape

(2376, 4)

## 2. Obtaining terminologies

Obtained are UMLS terminologies: SNOMEDCT_US; RXNORM; NDFRT, and Human Disease Ontology<br>
Ontologies are downloaded from [UMLS](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) and [OLS Ontology Search](https://www.ebi.ac.uk/ols/ontologies/doid) (umls-2021AB-full.zip), using ([Owlready2](https://pypi.org/project/Owlready2/) | [Source](https://bitbucket.org/jibalamy/owlready2/src/master/doc/))<br>

__TODO__:
1. Need to also include __manually encoded ICD9 terms__ for each patient encounter as additional terms
2. Combine with coded and structured data when possible
3. The resulting lexicon contains __2.8 million unique terms.__, gathered are only 550k

In [5]:
%%time

unique_terms = obtain_lexicons()

Importing UMLS from umls-2021AB-full.zip with Python version 3.8.8 and Owlready version 2-0.35...
Full UMLS release - importing UMLS from inner Zip file 2021AB-full/2021ab-1-meta.nlm...
  Parsing 2021AB/META/MRRANK.RRF.gz as MRRANK with encoding UTF-8
  Parsing 2021AB/META/MRCONSO.RRF.aa.gz as MRCONSO with encoding UTF-8
  Parsing 2021AB/META/MRCONSO.RRF.ab.gz as MRCONSO with encoding UTF-8
  Parsing 2021AB/META/MRDEF.RRF.gz as MRDEF with encoding UTF-8
Full UMLS release - importing UMLS from inner Zip file 2021AB-full/2021ab-2-meta.nlm...
  Parsing 2021AB/META/MRREL.RRF.aa.gz as MRREL with encoding UTF-8
  Parsing 2021AB/META/MRREL.RRF.ab.gz as MRREL with encoding UTF-8
  Parsing 2021AB/META/MRREL.RRF.ac.gz as MRREL with encoding UTF-8
  Parsing 2021AB/META/MRREL.RRF.ad.gz as MRREL with encoding UTF-8
  Parsing 2021AB/META/MRSAT.RRF.aa.gz as MRSAT with encoding UTF-8
  Parsing 2021AB/META/MRSAT.RRF.ab.gz as MRSAT with encoding UTF-8
  Parsing 2021AB/META/MRSAT.RRF.ac.gz as MRSAT with 

  return pd.np.nan
  return pd.np.nan


Obtaining Negex samples' Concept
478914 unique terms obtained
CPU times: user 9min 8s, sys: 7.01 s, total: 9min 15s
Wall time: 9min 15s


In [6]:
for unique_term in unique_terms:
    if len(unique_term) < 5:
        print(unique_term)

GB9
Pi
LR2
BL55
acre
3.75
555
Chi
1.24
Dry
MI
KI10
Bee
mol
Psi
625
Out
10.8
mm/h
U/mg
CHF
Does
Avar
s
face
1011
Vine
w
1432
Zinc
PVC
840
RGA
LU2
8G
Ammi
PUL
1.25
290
3.7
Foam
kPa
179
ns
7.5
Pus
Qwo
6.6
2
26
Herb
15
CV23
30mm
28.6
Hot
V
Hz
HEP
4.56
1A
Taro
Rum
3/24
Pack
11.6
6712
volt
g/L
1295
pint
12.6
Dike
In
A
ft
3.1
Dirk
Door
66
If
Muse
72
>97
450
BL7
D
T
50
0.46
BL17
bran
500
Fan
720
ST40
J12
4.8
tube
P1.9
Luis
22.5
Gin
Onge
pain
Hemi
LI15
Hoya
Ibu
3400
Bulb
TE5
0.42
GB2
1.05
Soil
Sium
3705
kat
GB17
CV18
0.04
mm/s
313
SP16
Inzo
GB24
0.8
Xepi
Rest
Gy/h
ST44
mmHg
8.75
41.3
F
Oat
Peak
KI11
Beye
1.14
Cats
Bali
Boa
Fog
4.4
Dull
BL29
Snow
Ruff
nCi
Ogen
Togo
J16
0.2
Gaze
Guam
Cod
KI3
P1.7
JVD
mU/L
a
LR4
O157
On
Ofev
GV14
ORB
P1.6
Alli
Beer
km
0.92
440
TE18
PE's
Phi
i
33.9
6/5
0.99
139
312
GB6
GB28
174
TE13
KI19
1.08
GV24
I
Iron
335
-1
Jay
Cafe
tin
kg-m
PC7
LYM
Fume
Fire
Cox
Bim
ST32
SI14
Rusk
SP8
/12h
770
Atom
Lamb
BL60
Max
BL25
PC2
1.88
Seed
Rice
mIU
Tie
Inca
250
ST19
boil
CAU
>
G
Per
Al

# 3. Annotate the recognized and negated terms

Annotate by string matching the text to the lexicon downloaded, and apply negex trigger rules to separate negated terms

In [7]:
%%time
df_result = get_patients_annotated_terms(unique_terms[:10], df)
df_result.to_csv('result.csv', index=False)
df_result = pd.read_csv('result.csv')
df_result.head(2)

CPU times: user 2.31 s, sys: 4 ms, total: 2.31 s
Wall time: 2.34 s


Unnamed: 0,negated_terms,recognized_terms,tagged_sentences
0,OrderedSet(),OrderedSet(),"S O H Counters Report Type Record Type Subgroup Classifier 1,01TdvtyYejbW DS DS 1504 E O H [Repo..."
1,OrderedSet(),OrderedSet(),The patient was transferred to **INSTITUTION for explantation of a pacemaker system that was fel...


## 4. Compile terms for timeline view of the patient's record

We compile terms (From each note, output:set of negated terms, set of non-negated terms) into a temporally ordered series of sets for each patient to create a timeline view of the patient’s record.

The patient id and admission date are now __mocked__

In [8]:
# mock id and admission date

import random
import datetime

random_id, random_date = [], []
for i in range(df_result.shape[0]):
    n = random.randint(1,30)
    month = random.randint(1,12)
    day = random.randint(1,28)
    random_date.append(datetime.date(2020, month, day))
    random_id.append(n)
    
df_result['patient_id'] = random_id
df_result['admission_date'] = random_date

In [9]:
df_result_grouped_pos = df_result.sort_values(['patient_id', 'admission_date']).groupby(['patient_id'])['recognized_terms'].apply(temporally_ordered_list_join).reset_index()
df_result_grouped_neg = df_result.sort_values(['patient_id', 'admission_date']).groupby(['patient_id'])['negated_terms'].apply(temporally_ordered_list_join).reset_index()
df_result_grouped = df_result_grouped_pos.merge(df_result_grouped_neg, on='patient_id', how='left')
df_result_grouped.head(2)

Unnamed: 0,patient_id,recognized_terms,negated_terms
0,1,"[OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(...","[OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(..."
1,2,"[OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(...","[OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(), OrderedSet(..."


## 5. Normalize and aggregate terms

### TODO:
- Check terminologies for obtaining the correct lexicon (len(unique_term) < 5 and len(unique_term) < 2) still weird
- Find better normalizer dict
    - Google more
    - [Subset from what's done] Derive normalizer from RXNorm UMLS
    - [No good join] mere from PYM['RXNORM'] to the RXNATOMARCHIVE and RXNCONSO

In [10]:
normalizer_dict = get_rxnorm_normalizer()

Produced 14869 keys as the RXNorm normalizer dict.


In [11]:
normalizer_dict

{'a acalifornia hn hnlike atexas xa inactivated b bmassachusettslike': 'formula flulaval',
 'a acalifornia hnlike': 'acaliforniahnvlike',
 'a acalifornia hnlike avictoria b bbrisbane antigen bmassachusettslike': 'quadravalent',
 'a acalifornia hnlike avictoria b bbrisbanelike bmassachusettslike': 'quadravalent',
 'a acalifornia hnlike avictoria b bmassachusettslike': 'trivalent recombinant',
 'a acalifornia hnlike avictoria b bwisconsinlike': 'intradermal fluzone formula',
 'a acalifornia hnlike unt avictoria b bwisconsinlike': 'formula flumist',
 'a atexas xa hn inactivated acaliforniahnvlike b bmassachusettslike': 'formula flulaval',
 'a avictoria hnlike acaliforniahnvlike b bbrisbane antigen bmassachusettslike': 'quadravalent',
 'a avictoria hnlike acaliforniahnvlike b bmassachusettslike': 'trivalent recombinant',
 'a avictoria hnlike acaliforniahnvlike b bwisconsinlike': 'intradermal fluzone formula',
 'a avictoria hnlike unt acaliforniahnvlike b bwisconsinlike': 'formula flumist',

6. Finishing Touches

We reason over the structure of the ontologies to normalize and to aggregate terms for further analysis<br>
From the set of ontologies we use, the Annotator identifies all notes containing any string denoting this term as either its primary label or synonym.<br>
We use __all other ontologies to normalize strings denoting rheumatoid arthritis or myocardial infarction__ and the Annotator identifies all notes containing them.<br>
As an option, we can also enable reasoning to infer all subsumed terms, which increases the number of notes that we can identify beyond pure string matches.<br>
For example, patients with Caplan’s or Felty’s syndrome may also fit the cohort of patients with rheumatoid arthritis.<br>
Therefore, notes that mention these diseases can automatically be included as well even though their associated strings look nothing alike.<br>
We did not use such reasoning for results reported in this specific study.

__TO DO__:
- Using the terms as features, we can define patterns of interest (such as patients with rheumatoid arthritis, who take rofecoxib, and then get myocardial infarctio), which we can use in data mining applications.# Deprecated

### Additional info

Steps:
1. Gather dataset
2. Preprocess using Open Biomedical Annotator
    - Normalize records:
        - if data type = :
            - __diagnoses, medications, procedures, lab tests__: count presence of each normalized code in patient EHRs<br>
                → aiming to facilitate the modelling of related clinical events
            - __free text clinical notes__: LePendu et al:
                - Allowed identifying the negated tags and those related to family history
                    - A tag that appeared as negated in the note was considered not relevant and discarded <br>
                      → Negated tags were identified using NegEx:
                              a regular expression algorithm that implements several phrases indicating negation:
                              - filters out sentences containing phrases that falsely appear to be negation phrases,
                              - and limits the scope of the negation phrases
                    - A tag that was related to family history was just flagged as such and differentiated from the directly patient-related tags.
                    - We then analyzed similarities in the representation of temporally consecutive notes to remove duplicated information (e.g., notes recorded twice by mistake)
                
                - The parsed notes were further processed to reduce the sparseness of the representation (about 2 million normalized tags were extracted) and to obtain a semantic abstraction of the embedded clinical information. 
                    - To this aim we modeled the parsed notes using topic modeling <br>
                        → an unsupervised inference process that captures patterns of word co-occurrences within documents to define topics and represent a document as a multinomial over these topics.<br>
                             → Topic modeling has been applied to generalize clinical notes and improve automatic processing of patients data in several studies (e.g., see5,26–28). <br>
                        → We used latent Dirichlet allocation as our implementation of topic modeling and we estimated the number of topics through perplexity analysis over one million random notes. <br>
                        We found that 300 topics obtained the best mathematical generalization; therefore, each note was eventually summarized as a multinomial of 300 topic probabilities. <br>
                        For each patient, we eventually retained one single topic-based representation averaged over all the notes available before the split-point.