# Extract the HbA1c values from MIMIC-III dataset documents

In this notebook I will be developing a python script for extracting the HbA1c values from text documents. The main thing I am looking for is going to be A1c followed by some value.

I will be using PyConText to accomplish this task. I have found when experimenting with PyConText on the MIMIC-III dataset that you sometimes get some very odd things that come back when you use modifiers with numbers, so I am going to use one regular expression to obtain both the mention of HbA1c and the value. I will then remove everything that is not the number to obtain the actual value. 

This notebook is almost identical to the previous notebook, except that in this notebook I am running this on the test dataset. 


First step, import PyConText and define the functions (taken from the PyConText github page and modified with help from Jeff Ferraro) so that I can run the actual text parsing. 

In [1]:
import pyConTextNLP.pyConText as pyConText
# itemData has been rewritten, so that it can take relative local path, where you can redirect it to your customized yml files later
import os
import itemData
import re
import glob
import pandas as pd
from xml.etree import ElementTree
import math


In [2]:
my_targets=itemData.get_items('Yaml_Files/A1c_targets.yml')
my_modifiers=itemData.get_items('Yaml_Files/A1c_modifiers.yml')

The functions *markup_sentence* and *markup_doc* were both ones that we went over in the NLP lab.

In [3]:
## This one is the same, it just doesn't split it into sentences. 
def markup_sentence(s, modifiers, targets, prune_inactive=True):
    """
    """
    markup = pyConText.ConTextMarkup()
    markup.setRawText(s)
    markup.cleanText()
    markup.markItems(my_modifiers, mode="modifier")
    markup.markItems(my_targets, mode="target")
    markup.pruneMarks()
    markup.dropMarks('Exclusion')
    # apply modifiers to any targets within the modifiers scope
    markup.applyModifiers()
    markup.pruneSelfModifyingRelationships()
    if prune_inactive:
        markup.dropInactiveModifiers()
    return markup

def markup_doc(doc_text:str)->pyConText.ConTextDocument:
    rslts=[]
    context = pyConText.ConTextDocument()
    #for s in doc_text.split('.'):
    m = markup_sentence(doc_text, modifiers=my_modifiers, targets=my_targets)
    rslts.append(m)

    for r in rslts:
        context.addMarkup(r)
    return context

def get_output(something):
    context=markup_doc(something)
    output = context.getDocumentGraph()
    return output

Ok, I have figured out how to get the pieces of a node that I can use for every node. I can put these into lists and then add the lists into a dataframe, then transpose the dataframe and I can have something to work with. The next step is going to be reading in the documents and figuring out how to apply 

In [4]:
os.listdir('Text_Files/')

['.DS_Store',
 'Training_Dataset',
 'test_files.txt',
 'Testing_Dataset',
 'list_of_Files.txt']

In [5]:
import glob
list_of_files = glob.glob("Text_Files/Testing_Dataset/*.txt") # New Folder
print(len(list_of_files))
list_of_files[0:3]
#len(list_of_files)

75


['Text_Files/Testing_Dataset/750400.txt',
 'Text_Files/Testing_Dataset/1520872.txt',
 'Text_Files/Testing_Dataset/1671892.txt']

In [6]:
replaced_list = [w.replace('Text_Files/Testing_Dataset/', '') for w in list_of_files] # New Folder
list_of_identifiers = [i.replace(".txt", "") for i in replaced_list] 
print(list_of_identifiers[0:10])

['750400', '1520872', '1671892', '502100', '542040', '460197', '612341', '499708', '1874098', '1028562']


In [7]:
list_of_text = [] 
for file in list_of_files:
    text_file = open(file, 'r')
    list_of_text.append(text_file.read()) # Not Readlines
    text_file.close()

In [8]:
print(list_of_text[0])

[**2642-3-27**] 11:24 AM
 CHEST (PORTABLE AP)                                             Clip # [**Clip Number (Radiology) 107400**]
 Reason: please confirm right arm picc tip; page 0-2443 with results.
 ______________________________________________________________________________
 UNDERLYING MEDICAL CONDITION:
  63 year old woman with bilateral pneumococcal pneumonia with effusions,
  sepsis, s/p tube thoracostomy and now s/p tracheostomy with new NG tube
  placement.

  Requiring longterm access with triple lumen cl out.
 REASON FOR THIS EXAMINATION:
  please confirm right arm picc tip; page 0-2443 with results. thanks
 ______________________________________________________________________________
                                 FINAL REPORT
 INDICATION:  For PICC line placement in patient with pneumonia and
 tracheostomy.

 FINDINGS:  PICC line is in the right brachiocephalic vein. Tracheostomy tube
 is 3 cm above carina. Chest tube is in right upper hemithorax. The tube
 extends

In [9]:
text_df = pd.DataFrame({"Identifier" : list_of_identifiers, "Text": list_of_text}) 
text_df.head()
# I might end up changing the identifier to the index column, but for now I am just going to use this. 

Unnamed: 0,Identifier,Text
0,750400,[**2642-3-27**] 11:24 AM\n CHEST (PORTABLE AP)...
1,1520872,Neonatology Attending\n\nDOL 127 PMA 42 6/7 we...
2,1671892,Nursing note addendum\nPalliative care consult...
3,502100,TITLE:\n Chief Complaint:\n 24 Hour Events...
4,542040,61 year old male s/p replacement of L perc ne...


In [10]:
def get_a1c_flag(a):
    try:
        if float(a) < 7.1:
            return "Good"
        elif float(a) >= 7.1 and float(a) < 10.1:
            return "Moderate"
        elif float(a) >= 10.1:
            return "Poor"
        else:
            return "Not Sure"
    except:
        return "Not a value"
i = 0
output_array = []
while i < len(text_df):
    raw_text = text_df["Text"][i]
    remove_MIMIC_comments = re.sub(r"\[\*\*.*?\*\*\]", "", raw_text)
    remove_times = re.sub(r"\d{1,2}:\d{2}\s?P?A?\.?M\.?", "", remove_MIMIC_comments)
    cleaned_text = re.sub(r"\s{2,}", r" ", remove_times)
    
    context=markup_doc(cleaned_text)
    root = ElementTree.fromstring(context.getDocumentGraph().getXML())
    for node in root.findall('.//node'):
        phrase = node.find('.//phrase').text
        tmp1 =  re.sub(r"[A|a]1[C|c]", "", phrase)
        A1c_Value = re.sub(r"[^\d{1,2}\.?\d{0,1}]", "", tmp1)
        A1c_Flag = get_a1c_flag(A1c_Value)
        literal = node.find('.//literal').text
        Start = node.find('.//spanStart').text
        Stop = node.find('.//spanStop').text
        Node_ID = node.find('.//id').text
        category = node.find('.//category').text
        try:
            modified_by = node.find('.//modifyingNode').text
        except:
            modified_by = "None"
        try:
            modifying_category = node.find('.//modifyingCategory').text
        except:
            modifying_category = "None"
        try:
            node_modified = node.find('.//modifiedNode').text
        except:
            node_modified = "None"
        output_array.append([text_df["Identifier"][i], Start, Stop, phrase, literal, A1c_Value, A1c_Flag, Node_ID,
                             modifying_category, modified_by, node_modified])
    i += 1
            
#output_array

In [11]:
len(output_array)

20

In [12]:
type(output_array)

list

In [13]:
test_df = pd.DataFrame(output_array, columns=("Identifier", "Start", "Stop", "Phrase", "Annotation_Type", "A1c_Value", "A1c_Flag", "Node_ID", "Modifying_Category", "Modified_By", "Node_Modified"))
test_df.head() # New dataframe name

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,A1c_Value,A1c_Flag,Node_ID,Modifying_Category,Modified_By,Node_Modified
0,502100,4005,4008,f/u,f/u,,Not a value,264947198095629370295783696176431780671,,,264947492032112298216476168224487527231
1,502100,5048,5061,A1c in of 9.2,A1C_IN_OF,9.2,Moderate,264947492032112298216476168224487527231,['future_order_a1c'],264947198095629370295783696176431780671,
2,608452,125,132,A1c 6.3,A1C_COLON_OR_SPACE,6.3,Good,264956787872420096851206018736180450111,,,
3,608452,1413,1420,A1c 6.3,A1C_COLON_OR_SPACE,6.3,Good,264956829071064604268661567379034624831,,,
4,472998,2143,2150,A1c 5.5,A1C_COLON_OR_SPACE,5.5,Good,264960922790221716306985025794948485951,,,


I had put in f/u as a modifier so that it would have a modifier file, but all of my targets require a value. In other words, if they put f/u and a value it is either a mistake typing or it refers to something else. As such, I can get rid of all of the modifier columns and the rows about node ID, modifying node, modified by, and node modified. 

In [14]:
modifier_columns = test_df[test_df["Node_Modified"]!="None"]
modifier_columns

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,A1c_Value,A1c_Flag,Node_ID,Modifying_Category,Modified_By,Node_Modified
0,502100,4005,4008,f/u,f/u,,Not a value,264947198095629370295783696176431780671,,,264947492032112298216476168224487527231
9,558451,3115,3118,f/u,f/u,,Not a value,264975419167116951252834516531541463871,,,264975494433871339803955230398294283071


In [15]:
modifier_columns = test_df[test_df["Node_Modified"]!="None"]
A1c_Value_Results = test_df[["Identifier", "Start", "Stop", "Phrase", "Annotation_Type", "A1c_Value", "A1c_Flag"]].drop(modifier_columns.index, axis = 0)
len(A1c_Value_Results)

18

In [16]:
A1c_Value_Results.head()

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,A1c_Value,A1c_Flag
1,502100,5048,5061,A1c in of 9.2,A1C_IN_OF,9.2,Moderate
2,608452,125,132,A1c 6.3,A1C_COLON_OR_SPACE,6.3,Good
3,608452,1413,1420,A1c 6.3,A1C_COLON_OR_SPACE,6.3,Good
4,472998,2143,2150,A1c 5.5,A1C_COLON_OR_SPACE,5.5,Good
5,607400,876,884,A1c: 6.9,A1C_COLON_OR_SPACE,6.9,Good


In [17]:
A1c_Value_Results["A1c_Value"] = A1c_Value_Results["A1c_Value"].apply(pd.to_numeric)
A1c_Value_Results = A1c_Value_Results.groupby("Identifier").apply(lambda x: x.loc[x.A1c_Value.idxmax()])
A1c_Value_Results.to_csv("Output_Files/A1c_Results_Test_Dataset.csv") #New output file names
modifier_columns.to_csv("Output_Files/Modifier_Columns_to_A1c_Test_Dataset.csv") # I know this isn't needed, but I am going to save it for just in case

In [18]:
A1c_Value_Results.head()

Unnamed: 0_level_0,Identifier,Start,Stop,Phrase,Annotation_Type,A1c_Value,A1c_Flag
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
26293,26293,487,498,A1C was 5.7,A1C_IS_OR_WAS,5.7,Good
3010,3010,1143,1153,A1C of 6.7,A1C_IN_OF,6.7,Good
405628,405628,173,180,A1C:6.4,A1C_COLON_OR_SPACE,6.4,Good
410226,410226,560,567,A1c 7.0,A1C_COLON_OR_SPACE,7.0,Good
420089,420089,129,137,A1c: 6.2,A1C_COLON_OR_SPACE,6.2,Good


In [19]:
Manual_Results = pd.read_csv("Manual_Annotation_Results/Manual.Annotation.Test_Dataset.A1c.Results.csv", index_col = 1)

In [20]:
Manual_Results.head()

Unnamed: 0_level_0,File,Sentence,HbA1c
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1028562,1028562.txt,No mentions of diabetes or HbA1c,
1032480,1032480.txt,No mentions of diabetes or HbA1c,
1037568,1037568.txt,No mentions of diabetes or HbA1c,
1129782,1129782.txt,No mentions of diabetes or HbA1c,
114118,114118.txt,No mentions of diabetes or HbA1c,


In [21]:
Manual_Results.dtypes

File         object
Sentence     object
HbA1c       float64
dtype: object

In [22]:
A1c_Value_Results.dtypes

Identifier          object
Start               object
Stop                object
Phrase              object
Annotation_Type     object
A1c_Value          float64
A1c_Flag            object
dtype: object

In [23]:
A1c_Value_Results["Identifier"] = A1c_Value_Results["Identifier"].apply(pd.to_numeric)

In [24]:
Merged_Manual_and_Machine = pd.merge(Manual_Results, A1c_Value_Results, on=['Identifier'], how = 'outer')
Merged_Manual_and_Machine.head(8)

Defaulting to column, but this will raise an ambiguity error in a future version
  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,Identifier,File,Sentence,HbA1c,Start,Stop,Phrase,Annotation_Type,A1c_Value,A1c_Flag
0,1028562,1028562.txt,No mentions of diabetes or HbA1c,,,,,,,
1,1032480,1032480.txt,No mentions of diabetes or HbA1c,,,,,,,
2,1037568,1037568.txt,No mentions of diabetes or HbA1c,,,,,,,
3,1129782,1129782.txt,No mentions of diabetes or HbA1c,,,,,,,
4,114118,114118.txt,No mentions of diabetes or HbA1c,,,,,,,
5,1156963,1156963.txt,No mentions of diabetes or HbA1c,,,,,,,
6,1185261,1185261.txt,No mentions of diabetes or HbA1c,,,,,,,
7,1245072,1245072.txt,insulin-dependent diabetes,,,,,,,


If both HbA1c and A1c_Value are NaN = True Negative
<br>If both HbA1c and A1c_Value are Numbers (must be the same) = True Positive
<br>If HbA1c is NaN and A1c_Value is a number = False Positive
<br>If HbA1c is a number and A1c_Value is Nan = False Negative
<br>If Both columns give a number, but it doesn't match, give it False Positive even though this isn't technically correct. Luckily, I don't think I have any of those. 

In [25]:
def get_category(manual, machine):
    if math.isnan(manual):
        if math.isnan(machine):
            return "True_Negative"
        else:
            return "False_Positive"
    else:
        if math.isnan(machine):
            return "False_Negative"
        elif manual == machine:
            return "True_Positive"
        else:
            return "Non_Matching_Values"


Merged_Manual_and_Machine["Category"] = Merged_Manual_and_Machine.apply(lambda x: get_category(x["HbA1c"], x["A1c_Value"]), axis = 1)
#["HbA1c", "A1c_Value"].apply(get_category, axis = 1)

In [26]:
Merged_Manual_and_Machine.head()

Unnamed: 0,Identifier,File,Sentence,HbA1c,Start,Stop,Phrase,Annotation_Type,A1c_Value,A1c_Flag,Category
0,1028562,1028562.txt,No mentions of diabetes or HbA1c,,,,,,,,True_Negative
1,1032480,1032480.txt,No mentions of diabetes or HbA1c,,,,,,,,True_Negative
2,1037568,1037568.txt,No mentions of diabetes or HbA1c,,,,,,,,True_Negative
3,1129782,1129782.txt,No mentions of diabetes or HbA1c,,,,,,,,True_Negative
4,114118,114118.txt,No mentions of diabetes or HbA1c,,,,,,,,True_Negative


In [27]:
Merged_Manual_and_Machine.to_csv("Output_Files/Test_Dataset_A1c_Result_Comparison.csv")

In [28]:
results = Merged_Manual_and_Machine.groupby(["Category"]).size()
results

Category
True_Negative    58
True_Positive    17
dtype: int64

Again, this never gave anything where it grabbed a value that was the wrong value. This is pretty robust to only grab the correct value if it grabs one. I did not see any of these in the dataset, but if there was one that said "HbA1c last week of about 6.8%" it wouldn't pick it up. I have deleted all MIMIC added comments, and the only one I saw in the training dataset with a time in the middle was a MIMIC comment. Anyways, again I got 100%

I only have two possibilities here (Yes, A1c value or no, no A1c value)
<br>Below, TP = True positives, TN = True Negatives, FP = False Positives, FN = False Negatives

Recall Yes A1c = TP/(TP+FP) = 17/(17+0) = 1
<br>Precision Yes A1c = TP/(TP+FN) = 17/(17+0) = 1
<br>F-Measure = 2xPrecisionxRecall/(Precision + Recall) = 2x1x1/(1+1) = 1

Accuracy = (TP + TN)/total = (58+17)/75 = 1
<br>NPV = TN/(FN+TN) = 58/(0+58) = 1
<br>PPV = TP/(TP+FP) = 17/(17+0) = 1
<br>Sens = TP/(TP+FN) = 17/(17+0) = 1
<br>Spec = TN/(FP+TN) = 58/(0+58) = 1
