# Extract the Diabetes Mention from MIMIC-III dataset documents

In this notebook I will be developing a python script for finding the mention of Diabetes in text documents. I am actually going to do this notebook with a custom csv file that I generated that I wanted to use to make sure I can pick up all of the things I should look for. 
 


First step, import PyConText and define the functions (taken from the PyConText github page and modified with help from Jeff Ferraro) so that I can run the actual text parsing. 

In [1]:
import pyConTextNLP.pyConText as pyConText
# itemData has been rewritten, so that it can take relative local path, where you can redirect it to your customized yml files later
import os
import itemData
import re
import glob
import pandas as pd
from xml.etree import ElementTree
import math
reasonable_distance = 50 # Set the maximum distance that you want for modifiers 

In [2]:
os.listdir('Yaml_Files')

['A1c_modifiers.yml',
 'Diabetes_modifiers.tab_delimited.txt',
 'Diabetes_targets.yml',
 'A1c_targets.yml',
 'Diabetes_modifiers.yml']

In [3]:
my_targets=itemData.get_items('Yaml_Files/Diabetes_targets.yml')
my_modifiers=itemData.get_items('Yaml_Files/Diabetes_modifiers.yml')

The functions *markup_sentence* and *markup_doc* were both ones that we went over in the NLP lab.

In [4]:
## This one is the same, it just doesn't split it into sentences. 
def markup_sentence(s, modifiers, targets, prune_inactive=True):
    """
    """
    markup = pyConText.ConTextMarkup()
    markup.setRawText(s)
    markup.cleanText()
    markup.markItems(my_modifiers, mode="modifier")
    markup.markItems(my_targets, mode="target")
    markup.pruneMarks()
    markup.dropMarks('Exclusion')
    # apply modifiers to any targets within the modifiers scope
    markup.applyModifiers()
    markup.pruneSelfModifyingRelationships()
    if prune_inactive:
        markup.dropInactiveModifiers()
    return markup

def markup_doc(doc_text:str)->pyConText.ConTextDocument:
    rslts=[]
    context = pyConText.ConTextDocument()
    #for s in doc_text.split('.'):
    m = markup_sentence(doc_text, modifiers=my_modifiers, targets=my_targets)
    rslts.append(m)

    for r in rslts:
        context.addMarkup(r)
    return context

def get_output(something):
    context=markup_doc(something)
    output = context.getDocumentGraph()
    return output

Ok, I have figured out how to get the pieces of a node that I can use for every node. I can put these into lists and then add the lists into a dataframe, then transpose the dataframe and I can have something to work with. The next step is going to be reading in the documents and figuring out how to apply 

In [5]:
os.listdir('Text_Files/')

['.DS_Store',
 'Training_Dataset',
 'test_files.txt',
 'Testing_Dataset',
 'list_of_Files.txt']

In [6]:
tmp_text = "DMII but not crazy t2dm"
from xml.etree import ElementTree
context=markup_doc(tmp_text)
print(context.getXML())


<ConTextDocument>
DMII but not crazy t2dm <section>
<sectionLabel> document </sectionLabel>
<sentence>
<sentenceNumber> 0 </sentenceNumber>
<sentenceOffset> 0 </sentenceOffset></sentence>

<ConTextMarkup>
<rawText> DMII but not crazy t2dm </rawText>
<cleanText> DMII but not crazy t2dm </cleanText>
<nodes>

<node>
    <category> target </category>

<tagObject>
<id> 59383491076813773259727402539711746879 </id>
<phrase> DMII </phrase>
<literal> DMII </literal>
<category> ['diabetes_type_2'] </category>
<spanStart> 0 </spanStart>
<spanStop> 4 </spanStop>
<scopeStart> 0 </scopeStart>
<scopeStop> 23 </scopeStop>
</tagObject>
<modified_by>
<modifyingNode> 59383055321919944805870638047984898879 </modifyingNode>
<modifyingCategory> ['diabetes_negated'] </modifyingCategory>
</modified_by>

</node>

<node>
    <category> modifier </category>

<tagObject>
<id> 59383055321919944805870638047984898879 </id>
<phrase> not </phrase>
<literal> NOT </literal>
<category> ['diabetes_negated'] </category>
<

In [39]:
diabetes_phrases = pd.read_csv("Diabetes.Phrases.For.Training.csv", index_col=False)
# I have manually typed up 37 phrases that I want to test with the system to see what kind of annotations I get.
# Many of these aren't in my dataset, but they will be in a larger dataset if it is tested.
diabetes_phrases.head()

Unnamed: 0,Line,Text
0,1,Diabetes Mellitus
1,2,Diabetes
2,3,Diabetic
3,4,DM
4,5,Insulin dependent diabetes mellitus


In [8]:
diabetes_phrases["Text"][28]

'IDDM'

In [9]:
# output_array = []
# raw_text = diabetes_phrases["Text"][28]
# remove_MIMIC_comments = re.sub(r"\[\*\*.*?\*\*\]", "", raw_text)
# remove_times = re.sub(r"\d{1,2}:\d{2}\s?P?A?\.?M\.?", "", remove_MIMIC_comments)
# cleaned_text = re.sub(r"\s{2,}", r" ", remove_times)
    
# context=markup_doc(cleaned_text)
# root = ElementTree.fromstring(context.getDocumentGraph().getXML())
# for node in root.findall('.//node'):
#     phrase = node.find('.//phrase').text
#     tmp1 =  re.sub(r"[A|a]1[C|c]", "", phrase)
#         #A1c_Value = re.sub(r"[^\d{1,2}\.?\d{0,1}]", "", tmp1)
#         #A1c_Flag = get_a1c_flag(A1c_Value)
#     literal = node.find('.//literal').text
#     Start = node.find('.//spanStart').text
#     Stop = node.find('.//spanStop').text
#     Node_ID = node.find('.//id').text
#     category = node.find('./category').text #This picks up target or modifier, not useful
#     try:
#         modified_by = node.find('.//modifyingNode').text
#     except:
#         modified_by = "None"
#     try:
#         modifying_category = node.find('.//modifyingCategory').text
#     except:
#         modifying_category = "None"
#     try:
#         node_modified = node.find('.//modifiedNode').text
#     except:
#         node_modified = "None"
#     output_array.append([diabetes_phrases["Line"][28], Start, Stop, phrase, literal, Node_ID, modifying_category, modified_by, node_modified])

In [10]:
# output_array

In [11]:
i = 0
output_array = []
while i < len(diabetes_phrases):
    raw_text = diabetes_phrases["Text"][i]
    remove_MIMIC_comments = re.sub(r"\[\*\*.*?\*\*\]", "", raw_text)
    remove_times = re.sub(r"\d{1,2}:\d{2}\s?P?A?\.?M\.?", "", remove_MIMIC_comments)
    cleaned_text = re.sub(r"\s{2,}", r" ", remove_times)
    
    context=markup_doc(cleaned_text)
    root = ElementTree.fromstring(context.getDocumentGraph().getXML())
    for node in root.findall('.//node'):
        phrase = node.find('.//phrase').text
        #tmp1 =  re.sub(r"[A|a]1[C|c]", "", phrase)
        #A1c_Value = re.sub(r"[^\d{1,2}\.?\d{0,1}]", "", tmp1)
        #A1c_Flag = get_a1c_flag(A1c_Value)
        literal = node.find('.//literal').text
        Start = node.find('.//spanStart').text
        Stop = node.find('.//spanStop').text
        Node_ID = node.find('.//id').text
        category = node.find('./category').text #This picks up target or modifier, not useful
        try:
            modified_by = node.find('.//modifyingNode').text
        except:
            modified_by = "None"
        try:
            modifying_category = node.find('.//modifyingCategory').text
        except:
            modifying_category = "None"
        try:
            node_modified = node.find('.//modifiedNode').text
        except:
            node_modified = "None"
        output_array.append([diabetes_phrases["Line"][i], Start, Stop, phrase, literal, Node_ID,
                             modifying_category, modified_by, node_modified])
    i += 1
            
#output_array

In [12]:
output_array
train_df = pd.DataFrame(output_array, columns=("Identifier", "Start", "Stop", "Phrase", "Annotation_Type", "Node_ID", "Modifying_Category", "Modified_By", "Node_Modified"))
train_df

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,Node_ID,Modifying_Category,Modified_By,Node_Modified
0,1,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59432374853085074356022619157069058879,,,
1,2,0,8,Diabetes,DIABETES_OR_DIABETIC,59433083152857951879200705439985062719,,,
2,3,0,8,Diabetic,DIABETES_OR_DIABETIC,59433459486629894634804274773749158719,,,
3,4,0,2,DM,DM,59433888903270721947514031781959979839,,,
4,5,0,17,Insulin dependent,DIABETES_TYPE_1,59434210569610529860724661570398343999,,,59434263652479414417830849244845069119
5,5,18,35,diabetes mellitus,DIABETES_OR_DIABETIC,59434263652479414417830849244845069119,['diabetes_type_1'],59434210569610529860724661570398343999,
6,6,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59434990174729670221806582042869650239,['diabetes_type_1'],59434806365392637128543365020904870719,
7,6,19,25,type 1,DIABETES_TYPE_1,59434806365392637128543365020904870719,,,59434990174729670221806582042869650239
8,7,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59435674706053793465683390262600553279,['diabetes_type_1'],59435619246340033480647074781835318079,
9,7,18,24,Type I,DIABETES_TYPE_1,59435619246340033480647074781835318079,,,59435674706053793465683390262600553279


In [13]:
# modifier_columns = train_df[train_df["Node_Modified"]!="None"]
# modifier_columns
modifier_columns = train_df[train_df["Node_Modified"]!="None"]
Diabetes_Results = train_df.drop(modifier_columns.index, axis = 0)
print(len(Diabetes_Results))
print(len(modifier_columns))

42
37


In [14]:
modifier_columns

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,Node_ID,Modifying_Category,Modified_By,Node_Modified
4,5,0,17,Insulin dependent,DIABETES_TYPE_1,59434210569610529860724661570398343999,,,59434263652479414417830849244845069119
7,6,19,25,type 1,DIABETES_TYPE_1,59434806365392637128543365020904870719,,,59434990174729670221806582042869650239
9,7,18,24,Type I,DIABETES_TYPE_1,59435619246340033480647074781835318079,,,59435674706053793465683390262600553279
10,8,0,6,Type 1,DIABETES_TYPE_1,59436333092084287002328792612827845439,,,59436386174953171559434980287274570559
14,11,0,6,Type 1,DIABETES_TYPE_1,59438050758647596253167820645671129919,,,59438109387487856808777639868194378559
17,12,3,9,Type 1,DIABETES_TYPE_1,59438682999384460082581817126394811199,,,59438736874534969782331380736281039679
19,13,18,24,Type 1,DIABETES_TYPE_1,59439246311619936502022107223881700159,,,59439298602207195916484918962888921919
21,14,11,18,Insulin,DIABETES_TYPE_1,59440009278824948867593133052123435839,,,59440092468395588845147606273271288639
23,14,37,43,Type 1,DIABETES_TYPE_1,59440029085865577433677531438111019839,,,59440092468395588845147606273271288639
25,15,18,25,Type II,DIABETES_TYPE_2,59441333973702187367317697106973053759,,,59441377549191570212703373556145738559


In [15]:
Diabetes_Results

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,Node_ID,Modifying_Category,Modified_By,Node_Modified
0,1,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59432374853085074356022619157069058879,,,
1,2,0,8,Diabetes,DIABETES_OR_DIABETIC,59433083152857951879200705439985062719,,,
2,3,0,8,Diabetic,DIABETES_OR_DIABETIC,59433459486629894634804274773749158719,,,
3,4,0,2,DM,DM,59433888903270721947514031781959979839,,,
5,5,18,35,diabetes mellitus,DIABETES_OR_DIABETIC,59434263652479414417830849244845069119,['diabetes_type_1'],59434210569610529860724661570398343999,
6,6,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59434990174729670221806582042869650239,['diabetes_type_1'],59434806365392637128543365020904870719,
8,7,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59435674706053793465683390262600553279,['diabetes_type_1'],59435619246340033480647074781835318079,
11,8,7,15,Diabetes,DIABETES_OR_DIABETIC,59436386174953171559434980287274570559,['diabetes_type_1'],59436333092084287002328792612827845439,
12,9,0,3,DM1,DM1,59437005739184033106554961800966198079,,,
13,10,0,3,DMI,DMI,59437666502059402071130491957512000319,,,


Testing to see if this might work for limiting distance

In [16]:
node_locations = Diabetes_Results[["Start", "Stop", "Node_ID"]]

In [17]:
node_locations.head()

Unnamed: 0,Start,Stop,Node_ID
0,0,17,59432374853085074356022619157069058879
1,0,8,59433083152857951879200705439985062719
2,0,8,59433459486629894634804274773749158719
3,0,2,59433888903270721947514031781959979839
5,18,35,59434263652479414417830849244845069119


In [18]:
node_locations.rename(columns={"Start":"Node_Start", "Stop":"Node_Stop", "Node_ID":"Node_Modified"}, inplace = True)

# pd.merge(modifier_columns, node_locations, on=''

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [19]:
node_locations.head()

Unnamed: 0,Node_Start,Node_Stop,Node_Modified
0,0,17,59432374853085074356022619157069058879
1,0,8,59433083152857951879200705439985062719
2,0,8,59433459486629894634804274773749158719
3,0,2,59433888903270721947514031781959979839
5,18,35,59434263652479414417830849244845069119


In [20]:
modifier_columns = pd.merge(modifier_columns, node_locations, on='Node_Modified') #, how='right'
modifier_columns = modifier_columns[pd.notnull(modifier_columns['Identifier'])] # Drop the ones that weren't modifier nodes

In [21]:
modifier_columns

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,Node_ID,Modifying_Category,Modified_By,Node_Modified,Node_Start,Node_Stop
0,5,0,17,Insulin dependent,DIABETES_TYPE_1,59434210569610529860724661570398343999,,,59434263652479414417830849244845069119,18,35
1,6,19,25,type 1,DIABETES_TYPE_1,59434806365392637128543365020904870719,,,59434990174729670221806582042869650239,0,17
2,7,18,24,Type I,DIABETES_TYPE_1,59435619246340033480647074781835318079,,,59435674706053793465683390262600553279,0,17
3,8,0,6,Type 1,DIABETES_TYPE_1,59436333092084287002328792612827845439,,,59436386174953171559434980287274570559,7,15
4,11,0,6,Type 1,DIABETES_TYPE_1,59438050758647596253167820645671129919,,,59438109387487856808777639868194378559,7,9
5,12,3,9,Type 1,DIABETES_TYPE_1,59438682999384460082581817126394811199,,,59438736874534969782331380736281039679,0,2
6,13,18,24,Type 1,DIABETES_TYPE_1,59439246311619936502022107223881700159,,,59439298602207195916484918962888921919,0,17
7,14,11,18,Insulin,DIABETES_TYPE_1,59440009278824948867593133052123435839,,,59440092468395588845147606273271288639,0,8
8,14,37,43,Type 1,DIABETES_TYPE_1,59440029085865577433677531438111019839,,,59440092468395588845147606273271288639,0,8
9,15,18,25,Type II,DIABETES_TYPE_2,59441333973702187367317697106973053759,,,59441377549191570212703373556145738559,0,17


In [22]:
modifier_columns["Distance"] = modifier_columns.apply(lambda x: max((int(x["Start"]) - int(x["Node_Stop"])), (int(x["Node_Start"])-int(x["Stop"]))), axis = 1)


In [23]:
modifier_columns

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,Node_ID,Modifying_Category,Modified_By,Node_Modified,Node_Start,Node_Stop,Distance
0,5,0,17,Insulin dependent,DIABETES_TYPE_1,59434210569610529860724661570398343999,,,59434263652479414417830849244845069119,18,35,1
1,6,19,25,type 1,DIABETES_TYPE_1,59434806365392637128543365020904870719,,,59434990174729670221806582042869650239,0,17,2
2,7,18,24,Type I,DIABETES_TYPE_1,59435619246340033480647074781835318079,,,59435674706053793465683390262600553279,0,17,1
3,8,0,6,Type 1,DIABETES_TYPE_1,59436333092084287002328792612827845439,,,59436386174953171559434980287274570559,7,15,1
4,11,0,6,Type 1,DIABETES_TYPE_1,59438050758647596253167820645671129919,,,59438109387487856808777639868194378559,7,9,1
5,12,3,9,Type 1,DIABETES_TYPE_1,59438682999384460082581817126394811199,,,59438736874534969782331380736281039679,0,2,1
6,13,18,24,Type 1,DIABETES_TYPE_1,59439246311619936502022107223881700159,,,59439298602207195916484918962888921919,0,17,1
7,14,11,18,Insulin,DIABETES_TYPE_1,59440009278824948867593133052123435839,,,59440092468395588845147606273271288639,0,8,3
8,14,37,43,Type 1,DIABETES_TYPE_1,59440029085865577433677531438111019839,,,59440092468395588845147606273271288639,0,8,29
9,15,18,25,Type II,DIABETES_TYPE_2,59441333973702187367317697106973053759,,,59441377549191570212703373556145738559,0,17,1


In [24]:
#reasonable_distance = 50
modifier_columns = modifier_columns[modifier_columns["Distance"] <= reasonable_distance]

In [25]:
modifier_columns.tail()

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,Node_ID,Modifying_Category,Modified_By,Node_Modified,Node_Start,Node_Stop,Distance
27,38,0,6,Mother,DIABETES_IN_OTHER,59455448470854103559059986961725412159,,,59455690116749772065289647270773936959,43,51,37
28,38,10,13,not,NOT,59455642579852263506687091144403735359,,,59455690116749772065289647270773936959,43,51,30
29,38,17,21,risk,HYPOTHETICAL_DIABETES,59455588704701753806937527534517506879,,,59455690116749772065289647270773936959,43,51,22
30,38,36,42,type 2,DIABETES_TYPE_2,59455523737608492110180700828478231359,,,59455690116749772065289647270773936959,43,51,1
31,39,9,18,insipidus,INSIPIDUS,59457545640315856136076088070090806079,,,59457626453041620685700433484920148799,0,8,1


Now I only have the modifiers that are within a specified distance (50 characters) of the node they modify. 

In [26]:
def get_negated(value):
    if value == " DENIES ":
        return "Negated_Diabetes"
    elif value == " NOT ":
        return "Negated_Diabetes"
    else:
        return ""
    
def get_other(value):
    if value == " DIABETES_IN_OTHER ":
        return "Diabetes_in_other"
    else:
        return ""
    
def get_type(value):
    if value == " DIABETES_TYPE_1 ":
        return "Diabetes_Type_1"
    elif value == " DIABETES_TYPE_2 ":
        return "Diabetes_Type_2"
    elif value == " DIABETES_GESTATIONAL ":
        return "Diabetes_Gestational"
    elif value == " INSIPIDUS ":
        return "Diabetes_Insipidus"
    else:
        return "No_Type"
    
    
    

def get_hypothetical(value):
    if value == " HYPOTHETICAL_DIABETES ":
        return "Diabetes_Hypothetical"
    else:
        return ""

modifier_columns["Negated"] = modifier_columns["Annotation_Type"].apply(get_negated)
modifier_columns["Diabetes_in_other"] = modifier_columns["Annotation_Type"].apply(get_other)
modifier_columns["Type"] = modifier_columns["Annotation_Type"].apply(get_type)
modifier_columns["Hypothetical"] = modifier_columns["Annotation_Type"].apply(get_hypothetical)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [27]:
# def get_negated_start(anno_type, neg_start):
#     if anno_type == " DENIES " or anno_type == " NOT ":
#         return neg_start
#     else:
#         return 0

# def get_negated_stop(anno_type, neg_stop):
#     if anno_type == " DENIES " or anno_type == " NOT ":
#         return neg_stop
#     else:
#         return 0

# modifier_columns["Negated_Start"] = modifier_columns.apply(lambda x: get_negated_start(x["Annotation_Type"], x["Start"]), axis = 1)
# modifier_columns["Negated_Stop"] = modifier_columns.apply(lambda x: get_negated_start(x["Annotation_Type"], x["Stop"]), axis = 1)

# def get_in_other_start(anno_type, other_start):
#     if anno_type == " DIABETES_IN_OTHER ":
#         return other_start
#     else:
#         return 0

# def get_in_other_stop(anno_type, other_stop):
#     if anno_type == " DIABETES_IN_OTHER ":
#         return other_stop
#     else:
#         return 0

# modifier_columns["Other_Start"] = modifier_columns.apply(lambda x: get_in_other_start(x["Annotation_Type"], x["Start"]), axis = 1)
# modifier_columns["Other_Stop"] = modifier_columns.apply(lambda x: get_in_other_stop(x["Annotation_Type"], x["Stop"]), axis = 1)

# def get_hypothetical_start(anno_type, hypo_start):
#     if anno_type == " HYPOTHETICAL_DIABETES ":
#         return hypo_start
#     else:
#         return 0

# def get_hypothetical_stop(anno_type, hypo_stop):
#     if anno_type == " HYPOTHETICAL_DIABETES ":
#         return hypo_stop
#     else:
#         return 0

# modifier_columns["Hypo_Start"] = modifier_columns.apply(lambda x: get_hypothetical_start(x["Annotation_Type"], x["Start"]), axis = 1)
# modifier_columns["Hypo_Stop"] = modifier_columns.apply(lambda x: get_hypothetical_stop(x["Annotation_Type"], x["Stop"]), axis = 1)


# def get_type_start(anno_type, type_start):
#     if anno_type == " DIABETES_TYPE_1 " or anno_type == " DIABETES_TYPE_2 " or anno_type == " DIABETES_GESTATIONAL " or anno_type == " INSIPIDUS ":
#         return type_start
#     else:
#         return 0

# def get_type_stop(anno_type, type_stop):
#     if anno_type == " DIABETES_TYPE_1 " or anno_type == " DIABETES_TYPE_2 " or anno_type == " DIABETES_GESTATIONAL " or anno_type == " INSIPIDUS ":
#         return type_stop
#     else:
#         return 0

# modifier_columns["Type_Start"] = modifier_columns.apply(lambda x: get_type_start(x["Annotation_Type"], x["Start"]), axis = 1)
# modifier_columns["Type_Stop"] = modifier_columns.apply(lambda x: get_type_stop(x["Annotation_Type"], x["Stop"]), axis = 1)


In [28]:
modifier_columns

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,Node_ID,Modifying_Category,Modified_By,Node_Modified,Node_Start,Node_Stop,Distance,Negated,Diabetes_in_other,Type,Hypothetical
0,5,0,17,Insulin dependent,DIABETES_TYPE_1,59434210569610529860724661570398343999,,,59434263652479414417830849244845069119,18,35,1,,,Diabetes_Type_1,
1,6,19,25,type 1,DIABETES_TYPE_1,59434806365392637128543365020904870719,,,59434990174729670221806582042869650239,0,17,2,,,Diabetes_Type_1,
2,7,18,24,Type I,DIABETES_TYPE_1,59435619246340033480647074781835318079,,,59435674706053793465683390262600553279,0,17,1,,,Diabetes_Type_1,
3,8,0,6,Type 1,DIABETES_TYPE_1,59436333092084287002328792612827845439,,,59436386174953171559434980287274570559,7,15,1,,,Diabetes_Type_1,
4,11,0,6,Type 1,DIABETES_TYPE_1,59438050758647596253167820645671129919,,,59438109387487856808777639868194378559,7,9,1,,,Diabetes_Type_1,
5,12,3,9,Type 1,DIABETES_TYPE_1,59438682999384460082581817126394811199,,,59438736874534969782331380736281039679,0,2,1,,,Diabetes_Type_1,
6,13,18,24,Type 1,DIABETES_TYPE_1,59439246311619936502022107223881700159,,,59439298602207195916484918962888921919,0,17,1,,,Diabetes_Type_1,
7,14,11,18,Insulin,DIABETES_TYPE_1,59440009278824948867593133052123435839,,,59440092468395588845147606273271288639,0,8,3,,,Diabetes_Type_1,
8,14,37,43,Type 1,DIABETES_TYPE_1,59440029085865577433677531438111019839,,,59440092468395588845147606273271288639,0,8,29,,,Diabetes_Type_1,
9,15,18,25,Type II,DIABETES_TYPE_2,59441333973702187367317697106973053759,,,59441377549191570212703373556145738559,0,17,1,,,Diabetes_Type_2,


In [29]:
def max_len(s):
    return max(s, key=len)
def max_val(s):
    return max(s, key=int)
subset = modifier_columns.groupby("Node_Modified").agg({'Diabetes_in_other': max_len, "Hypothetical": max_len, "Negated": max_len, "Type": max_len, "Distance": max_val})
#                                                        'Negated_Start' : max_val, 'Negated_Stop': max_val, "Other_Start": max_val,
#                                                        "Other_Stop":max_val, "Hypo_Start":max_val, "Hypo_Stop":max_val,
#                                                        "Type_Start":max_val, "Type_Stop":max_val})

In [30]:
Diabetes_Results

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,Node_ID,Modifying_Category,Modified_By,Node_Modified
0,1,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59432374853085074356022619157069058879,,,
1,2,0,8,Diabetes,DIABETES_OR_DIABETIC,59433083152857951879200705439985062719,,,
2,3,0,8,Diabetic,DIABETES_OR_DIABETIC,59433459486629894634804274773749158719,,,
3,4,0,2,DM,DM,59433888903270721947514031781959979839,,,
5,5,18,35,diabetes mellitus,DIABETES_OR_DIABETIC,59434263652479414417830849244845069119,['diabetes_type_1'],59434210569610529860724661570398343999,
6,6,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59434990174729670221806582042869650239,['diabetes_type_1'],59434806365392637128543365020904870719,
8,7,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59435674706053793465683390262600553279,['diabetes_type_1'],59435619246340033480647074781835318079,
11,8,7,15,Diabetes,DIABETES_OR_DIABETIC,59436386174953171559434980287274570559,['diabetes_type_1'],59436333092084287002328792612827845439,
12,9,0,3,DM1,DM1,59437005739184033106554961800966198079,,,
13,10,0,3,DMI,DMI,59437666502059402071130491957512000319,,,


In [31]:
subset = subset.reset_index()
subset.rename(columns={"Node_Modified":"Node_ID"}, inplace = True)

In [32]:
subset

Unnamed: 0,Node_ID,Diabetes_in_other,Hypothetical,Negated,Type,Distance
0,59434263652479414417830849244845069119,,,,Diabetes_Type_1,1
1,59434990174729670221806582042869650239,,,,Diabetes_Type_1,2
2,59435674706053793465683390262600553279,,,,Diabetes_Type_1,1
3,59436386174953171559434980287274570559,,,,Diabetes_Type_1,1
4,59438109387487856808777639868194378559,,,,Diabetes_Type_1,1
5,59438736874534969782331380736281039679,,,,Diabetes_Type_1,1
6,59439298602207195916484918962888921919,,,,Diabetes_Type_1,1
7,59440092468395588845147606273271288639,,,,Diabetes_Type_1,29
8,59441377549191570212703373556145738559,,,,Diabetes_Type_2,1
9,59442707790040184710931569159071879999,,,,Diabetes_Type_2,1


In [33]:
Final_table = pd.merge(Diabetes_Results, subset, on='Node_ID', how = "left") # how='right'
Final_table

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,Node_ID,Modifying_Category,Modified_By,Node_Modified,Diabetes_in_other,Hypothetical,Negated,Type,Distance
0,1,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59432374853085074356022619157069058879,,,,,,,,
1,2,0,8,Diabetes,DIABETES_OR_DIABETIC,59433083152857951879200705439985062719,,,,,,,,
2,3,0,8,Diabetic,DIABETES_OR_DIABETIC,59433459486629894634804274773749158719,,,,,,,,
3,4,0,2,DM,DM,59433888903270721947514031781959979839,,,,,,,,
4,5,18,35,diabetes mellitus,DIABETES_OR_DIABETIC,59434263652479414417830849244845069119,['diabetes_type_1'],59434210569610529860724661570398343999,,,,,Diabetes_Type_1,1.0
5,6,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59434990174729670221806582042869650239,['diabetes_type_1'],59434806365392637128543365020904870719,,,,,Diabetes_Type_1,2.0
6,7,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59435674706053793465683390262600553279,['diabetes_type_1'],59435619246340033480647074781835318079,,,,,Diabetes_Type_1,1.0
7,8,7,15,Diabetes,DIABETES_OR_DIABETIC,59436386174953171559434980287274570559,['diabetes_type_1'],59436333092084287002328792612827845439,,,,,Diabetes_Type_1,1.0
8,9,0,3,DM1,DM1,59437005739184033106554961800966198079,,,,,,,,
9,10,0,3,DMI,DMI,59437666502059402071130491957512000319,,,,,,,,


In [34]:
# Final_table["Distance_Type"] = Final_table.apply(lambda x: max((int(x["Type_Start"]) - int(x["Stop"])), (int(x["Start"])-int(x["Type_Stop"]))) if x["Type"] != "No_Type" else 0, axis = 1)
# Final_table["Distance_Negated"] = Final_table.apply(lambda x: max((int(x["Negated_Start"]) - int(x["Stop"])), (int(x["Start"])-int(x["Negated_Stop"]))) if x["Negated"] == "Negated_Diabetes" else 0, axis = 1)
# Final_table["Distance_Hypo"] = Final_table.apply(lambda x: max((int(x["Hypo_Start"]) - int(x["Stop"])), (int(x["Start"])-int(x["Hypo_Stop"]))) if x["Hypothetical"] == "Diabetes_Hypothetical" else 0, axis = 1)
# Final_table["Distance_Other"] = Final_table.apply(lambda x: max((int(x["Other_Start"]) - int(x["Stop"])), (int(x["Start"])-int(x["Other_Stop"]))) if x["Diabetes_in_other"] == "Diabetes_in_other" else 0, axis = 1)



In [35]:
Final_table

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,Node_ID,Modifying_Category,Modified_By,Node_Modified,Diabetes_in_other,Hypothetical,Negated,Type,Distance
0,1,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59432374853085074356022619157069058879,,,,,,,,
1,2,0,8,Diabetes,DIABETES_OR_DIABETIC,59433083152857951879200705439985062719,,,,,,,,
2,3,0,8,Diabetic,DIABETES_OR_DIABETIC,59433459486629894634804274773749158719,,,,,,,,
3,4,0,2,DM,DM,59433888903270721947514031781959979839,,,,,,,,
4,5,18,35,diabetes mellitus,DIABETES_OR_DIABETIC,59434263652479414417830849244845069119,['diabetes_type_1'],59434210569610529860724661570398343999,,,,,Diabetes_Type_1,1.0
5,6,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59434990174729670221806582042869650239,['diabetes_type_1'],59434806365392637128543365020904870719,,,,,Diabetes_Type_1,2.0
6,7,0,17,Diabetes Mellitus,DIABETES_OR_DIABETIC,59435674706053793465683390262600553279,['diabetes_type_1'],59435619246340033480647074781835318079,,,,,Diabetes_Type_1,1.0
7,8,7,15,Diabetes,DIABETES_OR_DIABETIC,59436386174953171559434980287274570559,['diabetes_type_1'],59436333092084287002328792612827845439,,,,,Diabetes_Type_1,1.0
8,9,0,3,DM1,DM1,59437005739184033106554961800966198079,,,,,,,,
9,10,0,3,DMI,DMI,59437666502059402071130491957512000319,,,,,,,,


In [36]:
def get_new_type(anno_type, modify_type):
    if anno_type == " DMII " or anno_type == " DM2 " or anno_type == " T2DM " or anno_type == " NIDDM ":
        return "Diabetes_Type_2"
    elif anno_type == " DMI " or anno_type == " DM1 " or anno_type == " T1DM " or anno_type == " IDDM ":
        return "Diabetes_Type_1"
    elif anno_type == " GDM ":
        return "Diabetes_Gestational"
    else:
        if modify_type == "Diabetes_Type_1":
            return "Diabetes_Type_1"
        elif modify_type == "Diabetes_Type_2":
            return "Diabetes_Type_2"
        elif modify_type == "Diabetes_Gestational":
            return "Diabetes_Gestational"
        elif modify_type == "Diabetes_Insipidus":
            return "Diabetes_Insipidus"
        else:
            return "Diabetes_Type_Not_Specified"
        
        

Final_table["Diabetes_Type"] = Final_table.apply(lambda x: get_new_type(x["Annotation_Type"], x["Type"]), axis = 1)


In [37]:
       

Final_table["Diabetes_Negated"] = Final_table.apply(lambda x: "Negated_Diabetes" if x["Negated"] == "Negated_Diabetes" else None, axis = 1)
Final_table["Diabetes_Hypothetical"] = Final_table.apply(lambda x: "Diabetes_Hypothetical" if x["Hypothetical"] == "Diabetes_Hypothetical" else None, axis = 1)
Final_table["Diabetes_In_Other_Person"] = Final_table.apply(lambda x: "Diabetes_in_other" if x["Diabetes_in_other"] == "Diabetes_in_other" else None, axis = 1)


Final_table.tail()

Unnamed: 0,Identifier,Start,Stop,Phrase,Annotation_Type,Node_ID,Modifying_Category,Modified_By,Node_Modified,Diabetes_in_other,Hypothetical,Negated,Type,Distance,Diabetes_Type,Diabetes_Negated,Diabetes_Hypothetical,Diabetes_In_Other_Person
37,37,17,25,diabetic,DIABETES_OR_DIABETIC,59454511993973184954589631272232440639,['hypothetical_diabetes'],59454466833920551823917202952180749119,,,Diabetes_Hypothetical,,No_Type,1.0,Diabetes_Type_Not_Specified,,Diabetes_Hypothetical,
38,38,43,51,diabetes,DIABETES_OR_DIABETIC,59455690116749772065289647270773936959,['diabetes_in_other'],59455448470854103559059986961725412159,,Diabetes_in_other,Diabetes_Hypothetical,Negated_Diabetes,Diabetes_Type_2,37.0,Diabetes_Type_2,Negated_Diabetes,Diabetes_Hypothetical,Diabetes_in_other
39,39,0,8,Diabetes,DIABETES_OR_DIABETIC,59457626453041620685700433484920148799,['diabetes_insipidus'],59457545640315856136076088070090806079,,,,,Diabetes_Insipidus,1.0,Diabetes_Insipidus,,,
40,40,75,83,Diabetes,DIABETES_OR_DIABETIC,59458634235268802128074623363968422719,['diabetes_type_2'],59458566099049039860744292916171133759,,,,,,,Diabetes_Type_Not_Specified,,,
41,41,106,114,diabetes,DIABETES_OR_DIABETIC,59459353626984431648259972743037473599,['diabetes_in_other'],59459169025365773412353379785633190719,,,,,,,Diabetes_Type_Not_Specified,,,


In [38]:
Final_table.to_csv("tmp.csv")

Well that took forever to figure out. What a pain. At least now I have the info about which annotations were "Mention of Diabetes" and info about if they are negated, hypothetical, a specified type, or in a person other than the patient. In my example above, one of them is "Mother is not at risk of developing Type 2 diabetes". Now it says that it is Diabetes_in_other, Diabetes_Hypothetical, Negated_Diabetes, and Diabetes_Type2. 

One more problem: My program did not pick up my examples that said "NIDDM", "IDDM", or "t2dm". I need to double check my targets file. 

Ok, now it is fixed. I had put NIDDM and IDDM in the modifiers file when I should not have, and I did not incldue t2dm or t1dm, but now they are included. This got everything right for my fake dataset that I made. 

The only problem with this is that it takes the longer of two if two things are given. This is only a problem in the type. For instance, if something says "type 1 diabetes type 2" it will take "Type 1". I guess it shouldn't matter because nothing should give two types of diabetes to one mention. 

Additionally, if there are two mentions of diabetes in one document, this will return both of them along with the info about them. 

Now to try this on actual MIMIC data.

This missed a couple of the A1c values that it should not have. It turned out I missed anything with a dash between the mention of A1c and the result. I have added one entry to the targets file that should fix this:

Comments: ''
<br>Direction: ''
<br>Lex: A1C_DASH
<br>Regex: a1c-\s?\d{1,2}\.?\d{0,1}
<br>Type: HbA1c_Value

Luckily for me, there were no instances where a value was picked up but it was incorrect. 

Adding the new entry with a dash made it pick up all of the results. In fact, there are only true positives and true negatives. This means 100% for everything.

I only have two possibilities here (Yes, A1c value or no, no A1c value)
<br>Below, TP = True positives, TN = True Negatives, FP = False Positives, FN = False Negatives

Recall Yes A1c = TP/(TP+FP) = 22/(22+0) = 1
<br>Precision Yes A1c = TP/(TP+FN) = 22/(22+0) = 1
<br>F-Measure = 2xPrecisionxRecall/(Precision + Recall) = 2x1x1/(1+1) = 1

Accuracy = (TP + TN)/total = (53+22)/75 = 1
<br>NPV = TN/(FN+TN) = 53/(0+53) = 1
<br>PPV = TP/(TP+FP) = 22/(22+0) = 1
<br>Sens = TP/(TP+FN) = 22/(22+0) = 1
<br>Spec = TN/(FP+TN) = 53/(0+53) = 1
