# Extract the HbA1c values from MIMIC-III dataset documents

In this notebook I will be developing a python script for extracting the HbA1c values from text documents. The main thing I am looking for is going to be A1c followed by some value.

I will be using PyConText to accomplish this task. I have found when experimenting with PyConText on the MIMIC-III dataset that you sometimes get some very odd things that come back when you use modifiers with numbers, so I am going to use one regular expression to obtain both the mention of HbA1c and the value. I will then remove everything that is not the number to obtain the actual value. 


First step, import PyConText and define the functions (taken from the PyConText github page and modified with help from Jeff Ferraro) so that I can run the actual text parsing. 

In [1]:
import pyConTextNLP.pyConText as pyConText
# itemData has been rewritten, so that it can take relative local path, where you can redirect it to your customized yml files later
import itemData
import re
import glob
import pandas as pd

In [2]:
my_targets=itemData.get_items('Yaml_Files/A1c_targets.yml')
my_modifiers=itemData.get_items('Yaml_Files/A1c_modifiers.yml')

Copy over the functions: *markup_sentence* and *markup_doc* (your revised version) from previous notebook below. 

In [3]:
example_data = open(r"Text_Files/36662.txt", "r") 
#example_data.close()
text = example_data.read()
example_data.close()
# print(text)

In [4]:
text_remove_1 = re.sub(r"\[\*\*.*?\*\*\]", "", text)

In [5]:
text_remove_2 = re.sub(r"\d{1,2}:\d{2}\s?P?A?\.?M\.?", "", text_remove_1)

In [6]:
text_remove_3 = re.sub(r"\s{2,}", r" ", text_remove_2)

In [7]:
#print(text_remove_3)

In [8]:
text_remove_3 = re.sub(r"\s{2,}", r" ", text_remove_2)

In [9]:
# def markup_sentence(s, modifiers, targets, prune_inactive=True):
#     """
#     """
#     markup = pyConText.ConTextMarkup()
#     markup.setRawText(s)
#     markup.cleanText()
#     markup.markItems(my_modifiers, mode="modifier")
#     markup.markItems(my_targets, mode="target")
#     markup.pruneMarks()
#     markup.dropMarks('Exclusion')
#     # apply modifiers to any targets within the modifiers scope
#     markup.applyModifiers()
#     markup.pruneSelfModifyingRelationships()
#     if prune_inactive:
#         markup.dropInactiveModifiers()
#     return markup

# def markup_doc(doc_text:str)->pyConText.ConTextDocument:
#     rslts=[]
#     context = pyConText.ConTextDocument()
#     for s in doc_text.split('.'):
#         m = markup_sentence(s, modifiers=my_modifiers, targets=my_targets)
#         rslts.append(m)

#     for r in rslts:
#         context.addMarkup(r)
#     return context

# def get_output(something):
#     context=markup_doc(something)
#     output = context.getDocumentGraph()
#     return output

In [10]:
## This one is the same, it just doesn't split it into sentences. 
def markup_sentence(s, modifiers, targets, prune_inactive=True):
    """
    """
    markup = pyConText.ConTextMarkup()
    markup.setRawText(s)
    markup.cleanText()
    markup.markItems(my_modifiers, mode="modifier")
    markup.markItems(my_targets, mode="target")
    markup.pruneMarks()
    markup.dropMarks('Exclusion')
    # apply modifiers to any targets within the modifiers scope
    markup.applyModifiers()
    markup.pruneSelfModifyingRelationships()
    if prune_inactive:
        markup.dropInactiveModifiers()
    return markup

def markup_doc(doc_text:str)->pyConText.ConTextDocument:
    rslts=[]
    context = pyConText.ConTextDocument()
    #for s in doc_text.split('.'):
    m = markup_sentence(doc_text, modifiers=my_modifiers, targets=my_targets)
    rslts.append(m)

    for r in rslts:
        context.addMarkup(r)
    return context

def get_output(something):
    context=markup_doc(something)
    output = context.getDocumentGraph()
    return output

In [11]:
get_output(text_remove_3)

__________________________________________
rawText: Admission Date: Discharge Date: Date of Birth: Sex: M Service: CARDIOTHORACIC Allergies:
Patient recorded as having No Known Allergies to Drugs Attending:
Chief Complaint:
Abnormal EKG, no symptoms Major Surgical or Invasive Procedure:
Coronary artery bypass grafts x 4(Left Internal Mammary Artery >
left anterior descending, saphenous vein graft > obtuse
marginal, saphenous vein graft > diagonal, saphenous vein graft
> Posterior descending artery) History of Present Illness:
64 year old white male underwent a strees test, which was
positive due to an abnormal EKG on routine examination. He
underwent cardiac catherization at outside hospital which
revealed coronary artery disease. He was referred for surgical
intervention. Past Medical History:
Diabetes mellitus ( diet-controlled)
hypertension
hyperlipidemia
s/p Melanoma resection Social History:
Works as a custodian.
52 pack year smoker, quit 20 years ago.
Drinks 8 shots/weekend
Lives

In [12]:
from xml.etree import ElementTree
context=markup_doc(text_remove_3)
print(context.getXML())


<ConTextDocument>
Admission Date: Discharge Date: Date of Birth: Sex: M Service: CARDIOTHORACIC Allergies: Patient recorded as having No Known Allergies to Drugs Attending: Chief Complaint: Abnormal EKG, no symptoms Major Surgical or Invasive Procedure: Coronary artery bypass grafts x 4(Left Internal Mammary Artery > left anterior descending, saphenous vein graft > obtuse marginal, saphenous vein graft > diagonal, saphenous vein graft > Posterior descending artery) History of Present Illness: 64 year old white male underwent a strees test, which was positive due to an abnormal EKG on routine examination. He underwent cardiac catherization at outside hospital which revealed coronary artery disease. He was referred for surgical intervention. Past Medical History: Diabetes mellitus ( diet-controlled) hypertension hyperlipidemia s/p Melanoma resection Social History: Works as a custodian. 52 pack year smoker, quit 20 years ago. Drinks 8 shots/weekend Lives with his wife Family History: no

In [13]:
context=markup_doc(text_remove_3)
root = ElementTree.fromstring(context.getDocumentGraph().getXML())
for this in root.findall('.//phrase'):
    print(this.text)
for this in root.findall('.//spanStart'):
    print(this.text)
for this in root.findall('.//spanStop'):
    print(this.text)
for this in root.findall('.//id'):
    print(this.text)

#print(root.find('.//phrase'))

 A1c 6.2 
 4036 
 4043 
 340050500664543781943983675328977982271 


In [14]:
tmp_text = "DIABETES MONITORING %HbA1c 6; new HbA1c 13.6"
context=markup_doc(tmp_text)
root = ElementTree.fromstring(context.getDocumentGraph().getXML())
for this in root.findall('.//phrase'):
    print(this.text)
for this in root.findall('.//spanStart'):
    print(this.text)
for this in root.findall('.//spanStop'):
    print(this.text)

 A1c 6 
 A1c 13.6 
 23 
 36 
 28 
 44 


It looks like the way that I am doing it now, I can 

In [15]:
import numpy as np
def extract_temperature(input_doc):
    output_list_numbers = []
    output_list_text = []
    context=markup_doc(input_doc)
    root = ElementTree.fromstring(context.getDocumentGraph().getXML())
    for this in root.findall('.//phrase'): #search for <phrase> anywhere in the document
        #print(this.text)
        if [int(s) for s in str.split(this.text) if s.isdigit()]:
            #print(this.text, "Success")
            output_list_numbers.append(int(this.text))
        else:
            output_list_text.append(this.text)
            #print ("Failed, ", this.text)
    output_list = np.column_stack((output_list_numbers, output_list_text))
    return output_list

In [16]:
tmp_text = "DIABETES MONITORING not f/u %HbA1c 6"

In [17]:
get_output(tmp_text)

__________________________________________
rawText: DIABETES MONITORING not f/u %HbA1c 6
cleanedText: None
********************************
TARGET: <id> 340076367867323064107564591493323182911 </id> <phrase> A1c 6 </phrase> <category> ['hba1c_value'] </category> 
----MODIFIED BY: <id> 340076318745862305263675283496073974591 </id> <phrase> f/u </phrase> <category> ['future_order_a1c'] </category> 
__________________________________________

In [18]:
tmp_text = "DIABETES MONITORING %HbA1c 6; new f/u HbA1c 13.6"
from xml.etree import ElementTree
context=markup_doc(tmp_text)
print(context.getXML())


<ConTextDocument>
DIABETES MONITORING %HbA1c 6; new f/u HbA1c 13.6 <section>
<sectionLabel> document </sectionLabel>
<sentence>
<sentenceNumber> 0 </sentenceNumber>
<sentenceOffset> 0 </sentenceOffset></sentence>

<ConTextMarkup>
<rawText> DIABETES MONITORING %HbA1c 6; new f/u HbA1c 13.6 </rawText>
<cleanText> DIABETES MONITORING %HbA1c 6; new f/u HbA1c 13.6 </cleanText>
<nodes>

<node>
    <category> target </category>

<tagObject>
<id> 340087865458267134148236166591395943231 </id>
<phrase> A1c 6 </phrase>
<literal> A1C_COLON_OR_SPACE </literal>
<category> ['hba1c_value'] </category>
<spanStart> 23 </spanStart>
<spanStop> 28 </spanStop>
<scopeStart> 0 </scopeStart>
<scopeStop> 48 </scopeStop>
</tagObject>

</node>

<node>
    <category> modifier </category>

<tagObject>
<id> 340087792568357621025045580530961634111 </id>
<phrase> f/u </phrase>
<literal> f/u </literal>
<category> ['future_order_a1c'] </category>
<spanStart> 34 </spanStart>
<spanStop> 37 </spanStop>
<scopeStart> 37 </sc

In [20]:
tmp_text = "DIABETES MONITORING %HbA1c 6; new f/u HbA1c 13.6"
context=markup_doc(tmp_text)
root = ElementTree.fromstring(context.getDocumentGraph().getXML())
for node in root.findall('.//node'):
    #print(node.text)
    rank = node.find('.//phrase').text
    #print(rank)
    name = node.find('.//literal').text
    print (name, rank)#, modified_by)

 A1C_COLON_OR_SPACE   A1c 6 
 f/u   f/u 
 A1C_COLON_OR_SPACE   A1c 13.6 


In [21]:
tmp_text = "DIABETES MONITORING %HbA1c 6; new f/u HbA1c 13.6"
context=markup_doc(tmp_text)
root = ElementTree.fromstring(context.getDocumentGraph().getXML())
for node in root.findall('.//node'):
    #print(node.text)
    phrase = node.find('.//phrase').text
    #print(rank)
    literal = node.find('.//literal').text
    Start = node.find('.//spanStart').text
    Stop = node.find('.//spanStop').text
    Node_ID = node.find('.//id').text
    category = node.find('.//category').text
    
    try:
        modified_by = node.find('.//modifyingNode').text
    #modified_by = node.find('.//modifyingNode').text
    except:
        modified_by = "None"
    try:
        modifying_category = node.find('.//modifyingCategory').text
    #modified_by = node.find('.//modifyingNode').text
    except:
        modifying_category = "None"
    try:
        node_modified = node.find('.//modifiedNode').text
    #modified_by = node.find('.//modifyingNode').text
    except:
        node_modified = "None"
 
    
    print (Start, Stop, phrase, literal, modified_by, Node_ID, modifying_category, node_modified)

 23   28   A1c 6   A1C_COLON_OR_SPACE  None  8820238909091154756036543416812196671  None None
 34   37   f/u   f/u  None  8820185033940645056286979806925968191  None  8820252377878782180973934319283753791 
 40   48   A1c 13.6   A1C_COLON_OR_SPACE   8820185033940645056286979806925968191   8820252377878782180973934319283753791   ['future_order_a1c']  None


Ok, I have figured out how to get the pieces of a node that I can use for every node. I can put these into lists and then add the lists into a dataframe, then transpose the dataframe and I can have something to work with. The next step is going to be reading in the documents and figuring out how to apply 

In [22]:
import os
os.listdir()

['.DS_Store',
 'Yaml_Files',
 'First_Run_Extract_A1c.csv',
 'Text_Files',
 '__pycache__',
 'NLP_Extract_A1c_Values.ipynb',
 'visual.py',
 'itemData.py',
 '.ipynb_checkpoints',
 'tmp_NLP_Extract_A1c_Values.ipynb']

In [23]:
import glob
list_of_files = glob.glob("/Users/david/Documents/David_Sant/Classes/NLP_BMI_6115_Biomedical_Text_Processing/Final_Project/Clamp_Documents/test/*.txt")
list_of_files[0:3]
#len(list_of_files)

['/Users/david/Documents/David_Sant/Classes/NLP_BMI_6115_Biomedical_Text_Processing/Final_Project/Clamp_Documents/test/591025.txt',
 '/Users/david/Documents/David_Sant/Classes/NLP_BMI_6115_Biomedical_Text_Processing/Final_Project/Clamp_Documents/test/460590.txt',
 '/Users/david/Documents/David_Sant/Classes/NLP_BMI_6115_Biomedical_Text_Processing/Final_Project/Clamp_Documents/test/26356.txt']

In [24]:
replaced_list = [w.replace('/Users/david/Documents/David_Sant/Classes/NLP_BMI_6115_Biomedical_Text_Processing/Final_Project/Clamp_Documents/test/', '') for w in list_of_files] 
list_of_identifiers = [i.replace(".txt", "") for i in replaced_list] 
print(list_of_identifiers[0:10])

['591025', '460590', '26356', '19164', '739925', '502100', '542040', '316110', '717600', '649768']


In [25]:
list_of_text = [] 
for file in list_of_files:
    text_file = open(file, 'r')
    list_of_text.append(text_file.read()) # Not Readlines
    text_file.close()

In [26]:
print(list_of_text[0])

CVICU
   HPI:
   HD3
   readmit left pleural effusion
   66M s/p CABGx3(LIMA-LAD,SVG-OM,SVG-PLB) [**5-14**]
   EF:60% Wt:104kg Cr:1.1 HgA1c:6.2
   PMH:HTN,GERD,peripheral neuropathy, CAD, chronic diastolic HF
   [**Last Name (un) **]: HCTZ 25', lopressor 75''', lipitor 10', Naproxen 500mg''prn,
   omeprazole 20mg', MVI, colace 100''
   Current medications:
   Albuterol Inhaler, Albuterol 0.083% Neb Soln, Argatroban, Aspirin,
   Atorvastatin, Docusate Sodium, Furosemide, Metoprolol Tartrate,
   Multivitamins, Naproxen, Omeprazole, Piperacillin-Tazobactam Na,
   Potassium Chloride, Vancomycin
   24 Hour Events:
   Transferred to ICU for respiratory distress, resolved with oxygen
   Started on heparin for pulmonary embolsm
   Allergies:
   No Known Drug Allergies
   Last dose of Antibiotics:
   Vancomycin - [**2796-5-29**] 12:07 AM
   Piperacillin/Tazobactam (Zosyn) - [**2796-5-29**] 04:06 AM
   Infusions:
   Heparin Sodium - 1,300 units/hour
   Other ICU medications:
   Other medications

In [27]:
text_df = pd.DataFrame({"Identifier" : list_of_identifiers, "Text": list_of_text}) 
text_df.head()
# I might end up changing the identifier to the index column, but for now I am just going to use this. 

Unnamed: 0,Identifier,Text
0,591025,CVICU\n HPI:\n HD3\n readmit left pleura...
1,460590,TITLE:\n Chief Complaint:\n 24 Hour Events...
2,26356,Admission Date: [**3367-2-4**] D...
3,19164,Admission Date: [**2807-7-8**] Discharge ...
4,739925,[**2945-9-9**] 3:04 PM\n CHEST (PA & LAT) ...


In [29]:
def get_a1c_flag(a):
    try:
        if float(a) < 7.1:
            return "Good"
        elif float(a) >= 7.1 and float(a) < 10.1:
            return "Moderate"
        elif float(a) >= 10.1:
            return "Poor"
        else:
            return "Not Sure"
    except:
        return "Not a value"
i = 0
output_array = []
while i < len(text_df):
    raw_text = text_df["Text"][i]
    remove_MIMIC_comments = re.sub(r"\[\*\*.*?\*\*\]", "", raw_text)
    remove_times = re.sub(r"\d{1,2}:\d{2}\s?P?A?\.?M\.?", "", remove_MIMIC_comments)
    cleaned_text = re.sub(r"\s{2,}", r" ", remove_times)
    
    context=markup_doc(cleaned_text)
    root = ElementTree.fromstring(context.getDocumentGraph().getXML())
    for node in root.findall('.//node'):
        phrase = node.find('.//phrase').text
        tmp1 =  re.sub(r"[A|a]1[C|c]", "", phrase)
        A1c_Value = re.sub(r"[^\d{1,2}\.?\d{0,1}]", "", tmp1)
        A1c_Flag = get_a1c_flag(A1c_Value)
        literal = node.find('.//literal').text
        Start = node.find('.//spanStart').text
        Stop = node.find('.//spanStop').text
        Node_ID = node.find('.//id').text
        category = node.find('.//category').text
        try:
            modified_by = node.find('.//modifyingNode').text
        except:
            modified_by = "None"
        try:
            modifying_category = node.find('.//modifyingCategory').text
        except:
            modifying_category = "None"
        try:
            node_modified = node.find('.//modifiedNode').text
        except:
            node_modified = "None"
        output_array.append([text_df["Identifier"][i], Start, Stop, phrase, literal, A1c_Value, A1c_Flag, Node_ID,
                             modifying_category, modified_by, node_modified])
    i += 1
            
#output_array

[['591025',
  ' 110 ',
  ' 117 ',
  ' A1c:6.2 ',
  ' A1C_COLON_OR_SPACE ',
  '6.2',
  'Good',
  ' 28289837286129485628891116091057132351 ',
  'None',
  'None',
  'None'],
 ['502100',
  ' 4005 ',
  ' 4008 ',
  ' f/u ',
  ' f/u ',
  '',
  'Not a value',
  ' 28297033580130656258674737688066151231 ',
  'None',
  'None',
  ' 28297312463262706469143066962771333951 '],
 ['502100',
  ' 5048 ',
  ' 5061 ',
  ' A1c in of 9.2 ',
  ' A1C_IN_OF ',
  '9.2',
  'Moderate',
  ' 28297312463262706469143066962771333951 ',
  " ['future_order_a1c'] ",
  ' 28297033580130656258674737688066151231 ',
  'None'],
 ['608452',
  ' 125 ',
  ' 132 ',
  ' A1c 6.3 ',
  ' A1C_COLON_OR_SPACE ',
  '6.3',
  'Good',
  ' 28306105204738539525329198470379623231 ',
  'None',
  'None',
  'None'],
 ['608452',
  ' 1413 ',
  ' 1420 ',
  ' A1c 6.3 ',
  ' A1C_COLON_OR_SPACE ',
  '6.3',
  'Good',
  ' 28306144026538171514854619306915287871 ',
  'None',
  'None',
  'None'],
 ['472998',
  ' 2143 ',
  ' 2150 ',
  ' A1c 5.5 ',
  ' A1C_COLO

In [30]:
len(output_array)

44

In [31]:
type(output_array)

list

In [32]:
test_df = pd.DataFrame(output_array, columns=("Identifier", "Start", "Stop", "Phrase", "Annotation_Type", "A1c_Value", "A1c_Flag", "Node_ID", "Modifying_Category", "Modified_By", "Node_Modified"))
test_df.head()

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,A1c_Value,A1c_Flag,Node_ID,Modifying_Category,Modified_By,Node_Modified
0,591025,110,117,A1c:6.2,A1C_COLON_OR_SPACE,6.2,Good,28289837286129485628891116091057132351,,,
1,502100,4005,4008,f/u,f/u,,Not a value,28297033580130656258674737688066151231,,,28297312463262706469143066962771333951
2,502100,5048,5061,A1c in of 9.2,A1C_IN_OF,9.2,Moderate,28297312463262706469143066962771333951,['future_order_a1c'],28297033580130656258674737688066151231,
3,608452,125,132,A1c 6.3,A1C_COLON_OR_SPACE,6.3,Good,28306105204738539525329198470379623231,,,
4,608452,1413,1420,A1c 6.3,A1C_COLON_OR_SPACE,6.3,Good,28306144026538171514854619306915287871,,,


In [33]:
test_df.to_csv("First_Run_Extract_A1c.csv")

In [None]:
#A1c_Value_Results.groupby("Identifier").max()
# Seemed to work, but I want it to take the max A1c value, not the max start, stop, or index

In [None]:
A1c_Value_Results["A1c_Value"] = A1c_Value_Results["A1c_Value"].apply(pd.to_numeric)
A1c_Value_Results = A1c_Value_Results.groupby("Identifier").apply(lambda x: x.loc[x.A1c_Value.idxmax()])