# Extract the HbA1c values from MIMIC-III dataset documents

In this notebook I will be developing a python script for extracting the HbA1c values from text documents. The main thing I am looking for is going to be A1c followed by some value.

I will be using PyConText to accomplish this task. I have found when experimenting with PyConText on the MIMIC-III dataset that you sometimes get some very odd things that come back when you use modifiers with numbers, so I am going to use one regular expression to obtain both the mention of HbA1c and the value. I will then remove everything that is not the number to obtain the actual value. 


First step, import PyConText and define the functions (taken from the PyConText github page and modified with help from Jeff Ferraro) so that I can run the actual text parsing. 

In [1]:
import pyConTextNLP.pyConText as pyConText
# itemData has been rewritten, so that it can take relative local path, where you can redirect it to your customized yml files later
import itemData
import re
import glob
import pandas as pd
from xml.etree import ElementTree
import math


In [2]:
my_targets=itemData.get_items('Yaml_Files/A1c_targets.yml')
my_modifiers=itemData.get_items('Yaml_Files/A1c_modifiers.yml')

The functions *markup_sentence* and *markup_doc* were both ones that we went over in the NLP lab.

In [3]:
## This one is the same, it just doesn't split it into sentences. 
def markup_sentence(s, modifiers, targets, prune_inactive=True):
    """
    """
    markup = pyConText.ConTextMarkup()
    markup.setRawText(s)
    markup.cleanText()
    markup.markItems(my_modifiers, mode="modifier")
    markup.markItems(my_targets, mode="target")
    markup.pruneMarks()
    markup.dropMarks('Exclusion')
    # apply modifiers to any targets within the modifiers scope
    markup.applyModifiers()
    markup.pruneSelfModifyingRelationships()
    if prune_inactive:
        markup.dropInactiveModifiers()
    return markup

def markup_doc(doc_text:str)->pyConText.ConTextDocument:
    rslts=[]
    context = pyConText.ConTextDocument()
    #for s in doc_text.split('.'):
    m = markup_sentence(doc_text, modifiers=my_modifiers, targets=my_targets)
    rslts.append(m)

    for r in rslts:
        context.addMarkup(r)
    return context

def get_output(something):
    context=markup_doc(something)
    output = context.getDocumentGraph()
    return output

Ok, I have figured out how to get the pieces of a node that I can use for every node. I can put these into lists and then add the lists into a dataframe, then transpose the dataframe and I can have something to work with. The next step is going to be reading in the documents and figuring out how to apply 

In [4]:
import glob
list_of_files = glob.glob("/Users/david/Documents/David_Sant/Classes/NLP_BMI_6115_Biomedical_Text_Processing/Final_Project/Clamp_Documents/test/*.txt")
list_of_files[0:3]
#len(list_of_files)

['/Users/david/Documents/David_Sant/Classes/NLP_BMI_6115_Biomedical_Text_Processing/Final_Project/Clamp_Documents/test/591025.txt',
 '/Users/david/Documents/David_Sant/Classes/NLP_BMI_6115_Biomedical_Text_Processing/Final_Project/Clamp_Documents/test/460590.txt',
 '/Users/david/Documents/David_Sant/Classes/NLP_BMI_6115_Biomedical_Text_Processing/Final_Project/Clamp_Documents/test/26356.txt']

In [5]:
replaced_list = [w.replace('/Users/david/Documents/David_Sant/Classes/NLP_BMI_6115_Biomedical_Text_Processing/Final_Project/Clamp_Documents/test/', '') for w in list_of_files] 
list_of_identifiers = [i.replace(".txt", "") for i in replaced_list] 
print(list_of_identifiers[0:10])

['591025', '460590', '26356', '19164', '739925', '502100', '542040', '316110', '717600', '649768']


In [6]:
list_of_text = [] 
for file in list_of_files:
    text_file = open(file, 'r')
    list_of_text.append(text_file.read()) # Not Readlines
    text_file.close()

In [7]:
print(list_of_text[0])

CVICU
   HPI:
   HD3
   readmit left pleural effusion
   66M s/p CABGx3(LIMA-LAD,SVG-OM,SVG-PLB) [**5-14**]
   EF:60% Wt:104kg Cr:1.1 HgA1c:6.2
   PMH:HTN,GERD,peripheral neuropathy, CAD, chronic diastolic HF
   [**Last Name (un) **]: HCTZ 25', lopressor 75''', lipitor 10', Naproxen 500mg''prn,
   omeprazole 20mg', MVI, colace 100''
   Current medications:
   Albuterol Inhaler, Albuterol 0.083% Neb Soln, Argatroban, Aspirin,
   Atorvastatin, Docusate Sodium, Furosemide, Metoprolol Tartrate,
   Multivitamins, Naproxen, Omeprazole, Piperacillin-Tazobactam Na,
   Potassium Chloride, Vancomycin
   24 Hour Events:
   Transferred to ICU for respiratory distress, resolved with oxygen
   Started on heparin for pulmonary embolsm
   Allergies:
   No Known Drug Allergies
   Last dose of Antibiotics:
   Vancomycin - [**2796-5-29**] 12:07 AM
   Piperacillin/Tazobactam (Zosyn) - [**2796-5-29**] 04:06 AM
   Infusions:
   Heparin Sodium - 1,300 units/hour
   Other ICU medications:
   Other medications

In [8]:
text_df = pd.DataFrame({"Identifier" : list_of_identifiers, "Text": list_of_text}) 
text_df.head()
# I might end up changing the identifier to the index column, but for now I am just going to use this. 

Unnamed: 0,Identifier,Text
0,591025,CVICU\n HPI:\n HD3\n readmit left pleura...
1,460590,TITLE:\n Chief Complaint:\n 24 Hour Events...
2,26356,Admission Date: [**3367-2-4**] D...
3,19164,Admission Date: [**2807-7-8**] Discharge ...
4,739925,[**2945-9-9**] 3:04 PM\n CHEST (PA & LAT) ...


In [9]:
def get_a1c_flag(a):
    try:
        if float(a) < 7.1:
            return "Good"
        elif float(a) >= 7.1 and float(a) < 10.1:
            return "Moderate"
        elif float(a) >= 10.1:
            return "Poor"
        else:
            return "Not Sure"
    except:
        return "Not a value"
i = 0
output_array = []
while i < len(text_df):
    raw_text = text_df["Text"][i]
    remove_MIMIC_comments = re.sub(r"\[\*\*.*?\*\*\]", "", raw_text)
    remove_times = re.sub(r"\d{1,2}:\d{2}\s?P?A?\.?M\.?", "", remove_MIMIC_comments)
    cleaned_text = re.sub(r"\s{2,}", r" ", remove_times)
    
    context=markup_doc(cleaned_text)
    root = ElementTree.fromstring(context.getDocumentGraph().getXML())
    for node in root.findall('.//node'):
        phrase = node.find('.//phrase').text
        tmp1 =  re.sub(r"[A|a]1[C|c]", "", phrase)
        A1c_Value = re.sub(r"[^\d{1,2}\.?\d{0,1}]", "", tmp1)
        A1c_Flag = get_a1c_flag(A1c_Value)
        literal = node.find('.//literal').text
        Start = node.find('.//spanStart').text
        Stop = node.find('.//spanStop').text
        Node_ID = node.find('.//id').text
        category = node.find('.//category').text
        try:
            modified_by = node.find('.//modifyingNode').text
        except:
            modified_by = "None"
        try:
            modifying_category = node.find('.//modifyingCategory').text
        except:
            modifying_category = "None"
        try:
            node_modified = node.find('.//modifiedNode').text
        except:
            node_modified = "None"
        output_array.append([text_df["Identifier"][i], Start, Stop, phrase, literal, A1c_Value, A1c_Flag, Node_ID,
                             modifying_category, modified_by, node_modified])
    i += 1
            
#output_array

In [10]:
len(output_array)

44

In [11]:
type(output_array)

list

In [12]:
test_df = pd.DataFrame(output_array, columns=("Identifier", "Start", "Stop", "Phrase", "Annotation_Type", "A1c_Value", "A1c_Flag", "Node_ID", "Modifying_Category", "Modified_By", "Node_Modified"))
test_df.head()

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,A1c_Value,A1c_Flag,Node_ID,Modifying_Category,Modified_By,Node_Modified
0,591025,110,117,A1c:6.2,A1C_COLON_OR_SPACE,6.2,Good,272471105618916522623705373226404696895,,,
1,502100,4005,4008,f/u,f/u,,Not a value,272478004015026639619579643098160452415,,,272478244076359057840522551536329970495
2,502100,5048,5061,A1c in of 9.2,A1C_IN_OF,9.2,Moderate,272478244076359057840522551536329970495,['future_order_a1c'],272478004015026639619579643098160452415,
3,608452,125,132,A1c 6.3,A1C_COLON_OR_SPACE,6.3,Good,272487144568135910296207810263710716735,,,
4,608452,1413,1420,A1c 6.3,A1C_COLON_OR_SPACE,6.3,Good,272487180220809041715159727358488367935,,,


I had put in f/u as a modifier so that it would have a modifier file, but all of my targets require a value. In other words, if they put f/u and a value it is either a mistake typing or it refers to something else. As such, I can get rid of all of the modifier columns and the rows about node ID, modifying node, modified by, and node modified. 

In [13]:
modifier_columns = test_df[test_df["Node_Modified"]!="None"]
modifier_columns

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,A1c_Value,A1c_Flag,Node_ID,Modifying_Category,Modified_By,Node_Modified
1,502100,4005,4008,f/u,f/u,,Not a value,272478004015026639619579643098160452415,,,272478244076359057840522551536329970495
17,558451,3115,3118,f/u,f/u,,Not a value,272511513566362047721164832511955063615,,,272511580910300184845851787024312849215
41,664150,4330,4333,f/u,f/u,,Not a value,272546792282566399345408485762160677695,,,272546936477822175306502906012150289215


In [14]:
modifier_columns = test_df[test_df["Node_Modified"]!="None"]
A1c_Value_Results = test_df[["Identifier", "Start", "Stop", "Phrase", "Annotation_Type", "A1c_Value", "A1c_Flag"]].drop(modifier_columns.index, axis = 0)
len(A1c_Value_Results)

41

In [15]:
A1c_Value_Results.head()

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,A1c_Value,A1c_Flag
0,591025,110,117,A1c:6.2,A1C_COLON_OR_SPACE,6.2,Good
2,502100,5048,5061,A1c in of 9.2,A1C_IN_OF,9.2,Moderate
3,608452,125,132,A1c 6.3,A1C_COLON_OR_SPACE,6.3,Good
4,608452,1413,1420,A1c 6.3,A1C_COLON_OR_SPACE,6.3,Good
5,472998,2143,2150,A1c 5.5,A1C_COLON_OR_SPACE,5.5,Good


In [16]:
A1c_Value_Results["A1c_Value"] = A1c_Value_Results["A1c_Value"].apply(pd.to_numeric)
A1c_Value_Results = A1c_Value_Results.groupby("Identifier").apply(lambda x: x.loc[x.A1c_Value.idxmax()])
A1c_Value_Results.to_csv("Output_Files/A1c_Results_annotated_files.csv")
modifier_columns.to_csv("Output_Files/Modifier_Columns_to_A1c_annotated_files.csv") # I know this isn't needed, but I am going to save it for just in case

In [17]:
A1c_Value_Results.head()

Unnamed: 0_level_0,Identifier,Start,Stop,Phrase,Annotation_Type,A1c_Value,A1c_Flag
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1352,1352,3902,3909,A1c-5.6,A1C_DASH,5.6,Good
23934,23934,1507,1517,A1C of 6.4,A1C_IN_OF,6.4,Good
24525,24525,2999,3006,A1c-6.5,A1C_DASH,6.5,Good
26293,26293,487,498,A1C was 5.7,A1C_IS_OR_WAS,5.7,Good
29424,29424,2202,2209,A1c-5.1,A1C_DASH,5.1,Good


In [18]:
Manual_Results = pd.read_csv("Manual_Annotation_Results/Manual.Annotation.A1c.Results.csv", index_col = 1)

In [19]:
Manual_Results.head()

Unnamed: 0_level_0,File,Sentence,HbA1c
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,10.txt,No mentions of diabetes or HbA1c,
1245072,1245072.txt,insulin-dependent diabetes,
1352,1352.txt,%HbA1c-5.6,5.6
1489918,1489918.txt,DIABETIC DIAGNOSIS,
1557914,1557914.txt,No mentions of diabetes or HbA1c,


In [20]:
Manual_Results.dtypes

File         object
Sentence     object
HbA1c       float64
dtype: object

In [21]:
A1c_Value_Results.dtypes

Identifier          object
Start               object
Stop                object
Phrase              object
Annotation_Type     object
A1c_Value          float64
A1c_Flag            object
dtype: object

In [22]:
A1c_Value_Results["Identifier"] = A1c_Value_Results["Identifier"].apply(pd.to_numeric)

In [23]:
Merged_Manual_and_Machine = pd.merge(Manual_Results, A1c_Value_Results, on=['Identifier'], how = 'outer')
Merged_Manual_and_Machine.head()

Defaulting to column, but this will raise an ambiguity error in a future version
  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,Identifier,File,Sentence,HbA1c,Start,Stop,Phrase,Annotation_Type,A1c_Value,A1c_Flag
0,10,10.txt,No mentions of diabetes or HbA1c,,,,,,,
1,1245072,1245072.txt,insulin-dependent diabetes,,,,,,,
2,1352,1352.txt,%HbA1c-5.6,5.6,3902.0,3909.0,A1c-5.6,A1C_DASH,5.6,Good
3,1489918,1489918.txt,DIABETIC DIAGNOSIS,,,,,,,
4,1557914,1557914.txt,No mentions of diabetes or HbA1c,,,,,,,


If both HbA1c and A1c_Value are NaN = True Negative
<br>If both HbA1c and A1c_Value are Numbers (must be the same) = True Positive
<br>If HbA1c is NaN and A1c_Value is a number = False Positive
<br>If HbA1c is a number and A1c_Value is Nan = False Negative
<br>If Both columns give a number, but it doesn't match, give it False Positive even though this isn't technically correct. Luckily, I don't think I have any of those. 

In [24]:
def get_category(manual, machine):
    if math.isnan(manual):
        if math.isnan(machine):
            return "True_Negative"
        else:
            return "False_Positive"
    else:
        if math.isnan(machine):
            return "False_Negative"
        elif manual == machine:
            return "True_Positive"
        else:
            return "Non_Matching_Values"


Merged_Manual_and_Machine["Category"] = Merged_Manual_and_Machine.apply(lambda x: get_category(x["HbA1c"], x["A1c_Value"]), axis = 1)
#["HbA1c", "A1c_Value"].apply(get_category, axis = 1)

In [25]:
Merged_Manual_and_Machine.head()

Unnamed: 0,Identifier,File,Sentence,HbA1c,Start,Stop,Phrase,Annotation_Type,A1c_Value,A1c_Flag,Category
0,10,10.txt,No mentions of diabetes or HbA1c,,,,,,,,True_Negative
1,1245072,1245072.txt,insulin-dependent diabetes,,,,,,,,True_Negative
2,1352,1352.txt,%HbA1c-5.6,5.6,3902.0,3909.0,A1c-5.6,A1C_DASH,5.6,Good,True_Positive
3,1489918,1489918.txt,DIABETIC DIAGNOSIS,,,,,,,,True_Negative
4,1557914,1557914.txt,No mentions of diabetes or HbA1c,,,,,,,,True_Negative


In [26]:
results = Merged_Manual_and_Machine.groupby(["Category"]).size()
results

Category
True_Negative    53
True_Positive    39
dtype: int64

Luckily for me, there were no instances where a value was picked up but it was incorrect. In fact, there are only true positives and true negatives. This means 100% for everything.

I only have two possibilities here (Yes, A1c value or no, no A1c value)
<br>Below, TP = True positives, TN = True Negatives, FP = False Positives, FN = False Negatives

Recall Yes A1c = TP/(TP+FP) = 39/(39+0) = 1
<br>Precision Yes A1c = TP/(TP+FN) = 39/(39+0) = 1
<br>F-Measure = 2xPrecisionxRecall/(Precision + Recall) = 2x1x1/(1+1) = 1

Accuracy = (TP + TN)/total = (53+39)/92 = 1
<br>NPV = TN/(FN+TN) = 53/(0+53) = 1
<br>PPV = TP/(TP+FP) = 39/(39+0) = 1
<br>Sens = TP/(TP+FN) = 39/(39+0) = 1
<br>Spec = TN/(FP+TN) = 53/(0+53) = 1
