# Processing Multisentence Documents

In [1]:
import pyConTextNLP.pyConTextGraph as pyConText
import pyConTextNLP.itemData as itemData
from textblob import TextBlob
import networkx as nx
import pyConTextNLP.display.html as html
from IPython.display import display, HTML
import os

In [2]:
user_domain = os.path.expanduser('~')

## Define ``markup_sentence``

We are putting the functionality we went through in the previous two notebooks ([BasicSentenceMarkup](./BasicSentenceMarkup.ipynb) and [BasicSentenceMarkupPart2](BasicSentenceMarkupPart2.ipynb)) into a function ``markup_sentence``. We add one step to the function: ``dropInactiveModifiers`` will delete any modifier node that does not get attached to a target node.

In [3]:
def markup_sentence(s, modifiers, targets, prune_inactive=True):
    """
    """
    markup = pyConText.ConTextMarkup()
    markup.setRawText(s)
    markup.cleanText()
    markup.markItems(modifiers, mode="modifier")
    markup.markItems(targets, mode="target")
    markup.pruneMarks()
    markup.dropMarks('Exclusion')
    # apply modifiers to any targets within the modifiers scope
    markup.applyModifiers()
    markup.pruneSelfModifyingRelationships()
    if prune_inactive:
        markup.dropInactiveModifiers()
    return markup

### Test on a couple of single sentences

In [4]:
#You'll have to change the file name so that it works on your computer
modifiers = itemData.instantiateFromCSVtoitemData("/Users/alec/Box Sync/Bucher_Surgical_MIMICIII/pyConText_implement/modifiers.tsv")
targets = itemData.instantiateFromCSVtoitemData(
    "file:///Users/alec/Box Sync/Bucher_Surgical_MIMICIII/pyConText_implement/targets.tsv")


sentences = ['please evaluate for fistula, perforation, loculated collection, fluid collections, fluid collection, abscess in the liver or abdomen, no contraindications for iv contrast ',
'There is no evidence of abscess or abscesses in the abdomen and it is without fluid collection.',
'There is a fluid collection and heterogeneous collection and heterogeneous collections in the abdomen.',
'There is a fluid collection and a collection of fluid.']

#example 1
markup1 = markup_sentence(sentences[0], modifiers, targets)

print(markup1.nodes(),'\n','_'*42,'\n')

#relationship between 'abdomen' and 'abscess'
print(markup1.isModifiedByCategory(markup1.nodes()[-1],'anatomy'))
#is an indication
print(markup1.isModifiedByCategory(markup1.nodes()[-1],'indication'))

#example 2
markup2 = markup_sentence(sentences[1],modifiers,targets)

#example 3
markup3 = markup_sentence(sentences[2], modifiers, targets)

#modified by anatomy
markup3.edges()

#example 4, a mention of fluid collection but without any modifiers
markup4 = markup_sentence(sentences[3],modifiers,targets)

#no relationship
markup4.edges()
markup4.nodes()[0]




[<id> 284301838825675093358783961289817775056 </id> <phrase>  abdomen </phrase> <category> ['anatomy'] </category> , <id> 284309041299472940101896920974426037200 </id> <phrase> abscess </phrase> <category> ['fluid_collection'] </category> , <id> 284308737063328885327816998459866012624 </id> <phrase> fluid collection </phrase> <category> ['fluid_collection'] </category> , <id> 284246182467615747919461748409206496208 </id> <phrase> liver </phrase> <category> ['anatomy'] </category> , <id> 284307422826569098708079612043210447824 </id> <phrase> evaluate for </phrase> <category> ['indication'] </category> , <id> 284309240320617175935705670083499449296 </id> <phrase> loculated collection </phrase> <category> ['fluid_collection'] </category> , <id> 284308715830181331503893940954495114192 </id> <phrase> fluid collections </phrase> <category> ['fluid_collection'] </category> ] 
 __________________________________________ 

True
True


<id> 284313017127124230916446255992848964560 </id> <phrase> collection of fluid </phrase> <category> ['fluid_collection'] </category> 

In [5]:
markups = [markup1, markup2, markup3,markup4]
#markups = [markup2]
context_short = pyConText.ConTextDocument()
for m in markups:
    context_short.addMarkup(m)

clrs = {\
    "anatomy": "orange",
    "fluid_collection": "green",
    "indication": "purple",
    "definite_negated_existence":"blue"
    }
display(HTML(html.mark_document_with_html(context_short,colors = clrs, default_color="black")))

#NOTE:this is not working with "without" or "there is no", so I'll have to work on that 





Looking at the above example, I would identify the following annotations:

**Fluid Collection, Definitive Existence**: 1, in sentence 3

**Indication**: 1, the entire first sentence. There are multiple targets there, but since we are annotating at a sentence level there should only be one.

**Negated**: 1, the second sentence.

Then there should be one left over: the final sentence, which does not have anatomy.

### Apply rules about definitive, negated and indication

Perhaps there's a better way that you can suggest, but for now I've written this function to apply the rules of whether a fluid collection is definite, negated, indicated, or should be ignored (I'm calling it ``false positive``):

1. A definitive mention of fluid collection must be modified by an anatomical modifier and not be modified by a negated existence modifier.
        
2. A negative mention must be modified by a negated existence modifier.
        
3. An indicated mention must be modified by an indication modifier.
        
 

In [6]:

def fluid_collection_classifier(document):
    """Takes a ConTextDocument and returns the following unique sets:
    definitive_evidence: markups that are classified as definitive evidence of fluid collection.
    negated_evidence: markups that are classified as negated evidence of fluid collection.
    indication: markups that are classified as indication of fluid collection.
    false_positives: markups that contain target words but are unmodified by any modifiers"""
    definite_evidence = []
    negated_evidence = []
    indication = []
    ignored = []
    
    #what about different sections?
    markups = [m[1] for m in document.getSectionMarkups()]
    
    for m in markups:
        for tO in m.nodes():
            if tO.getCategory() == ['fluid_collection']:
                #negated
                if m.isModifiedByCategory(tO,"definite_negated_existence"):
                    negated_evidence.append(m)
                #indication
                elif m.isModifiedByCategory(tO, "indication"):
                    indication.append(m)
                else:
                    if m.isModifiedByCategory(tO,'anatomy'):
                        definite_evidence.append(m)
                    else:
                        ignored.append(m)
    #eliminate markups that are in the same sentence
    definite_evidence = set(definite_evidence)
    negated_evidence = set(negated_evidence)
    indication = set(indication)
    ignored = set(ignored)
    print('definitive evidence:',len(definite_evidence))
    print('negated evidence:', len(negated_evidence))
    print('indication:',len(indication))
    print('ignored:',len(ignored))
    return definite_evidence, negated_evidence, indication, ignored

In [21]:
fluid_collection_classifier(context_short)[0]

definitive evidence: 1
negated evidence: 1
indication: 1
ignored: 1


{__________________________________________
 rawText: There is a fluid collection and heterogeneous collection and heterogeneous collections in the abdomen.
 cleanedText: There is a fluid collection and heterogeneous collection and heterogeneous collections in the abdomen.
 ********************************
 TARGET: <id> 67223191162141001461114722518375521232 </id> <phrase> fluid collection </phrase> <category> ['fluid_collection'] </category> 
 ----MODIFIED BY: <id> 67222766499189925003891422353729969104 </id> <phrase>  abdomen </phrase> <category> ['anatomy'] </category> 
 ********************************
 TARGET: <id> 67223269598021890583749910974030076880 </id> <phrase> heterogeneous collection </phrase> <category> ['fluid_collection'] </category> 
 ----MODIFIED BY: <id> 67222766499189925003891422353729969104 </id> <phrase>  abdomen </phrase> <category> ['anatomy'] </category> 
 ********************************
 TARGET: <id> 67223291148082094463170666007623031760 </id> <phrase> hete

## Test on an an entire document

The function worked on the above example. Now let's try it on an actual note.

In [17]:
input_report = '/Users/alec/Box Sync/Bucher_Surgical_MIMICIII/Radiology_Annotation/Adjudication/Batch_3/corpus/Yes_28226_116465_05-29-93.txt'
report = ''
with open(input_report,'r') as f0:
    report += f0.read()
type(report)

str

In [18]:
#Here the annotations from eHOST:
"""Fluid collection-definitive	
Fluid collection-definitive	
Fluid collection-definitive	
Fluid collection-definitive	
fluid collection-indication	
fluid collection-indication	
fluid collection-indication	
fluid collection-indication	
"""
print("Hand annotations \n \
4 definitive \n \
4 indications")

Hand annotations 
 4 definitive 
 4 indications


In [19]:
context = pyConText.ConTextDocument()

blob = TextBlob(report.lower())
print(type(blob.sentences))

<class 'list'>


In [20]:
count = 0
rslts = []
for s in blob.sentences:
    m = markup_sentence(s.raw, modifiers=modifiers, targets=targets)
    rslts.append(m)

for r in rslts:
    context.addMarkup(r)

In [21]:
context.getSectionMarkups()
bleeding = []
negated_bleeding = []
for finding in getSectionNodes:
    if finding.isModifiedBy('negation'):
        negated_bleeding.append(finding)
context.getSectionNodes()

KeyError: '__sectionNumber'

In [9]:
clrs = {\
    "anatomy": "orange",
    "fluid_collection": "green",
    "indication": "purple",
    "definite_negated_existence":"blue"
    }
display(HTML(html.mark_document_with_html(context,colors = clrs, default_color="black")))

In [27]:
fluid_collection_classifier(context)




#should be:
#4 definitive 
#4 indications

definitive evidence: 3
negated evidence: 0
indication: 4
ignored: 2


({__________________________________________
  rawText: new areas of fluid collection are
   noted in the anterior abdominal wall.
  cleanedText: new areas of fluid collection are noted in the anterior abdominal wall.
  ********************************
  TARGET: <id> 108749154938761762982416865889713839056 </id> <phrase> fluid collection </phrase> <category> ['fluid_collection'] </category> 
  ----MODIFIED BY: <id> 108748803324176524678504849440865967056 </id> <phrase>  abdominal wall </phrase> <category> ['anatomy'] </category> 
  __________________________________________,
  __________________________________________
  rawText: the multiloculated fluid
   collection in the lesser sac and in the left anterior pararenal space
   extending into the left lower pelvis have midly increased in size.
  cleanedText: the multiloculated fluid collection in the lesser sac and in the left anterior pararenal space extending into the left lower pelvis have midly increased in size.
  ***************

In [15]:
context.getXML()

'\n<ConTextDocument>\n<section>\n<sectionLabel> document </sectionLabel>\n</section>\n\n</ConTextDocument>\n'

**February 2**: Now gets all but one of the definitive evidence. Will debug soon.
The first run achieved:

    **Definite**:
        Sensitivity: .5
        PPV: 1.0
        Specificity: 1.0
    **Indication**:
        Sensitivity: 1.0
        PPV: 1.0
        Specificity: 1.0

Examples of misses:

In [14]:
#across sentences:
'''there is also a new collection within the anterior abdominal wall.'''
#missing 'abdominal wall' 
'''New areas of of fluid collection are noted in the anterior abdominal wall'''
''

''

## Try with another document

In [15]:
#this one contains a 'negated' class, but is classified as 'present'
input_report = '/Users/alec/Box Sync/Adjudication/Batch_3/corpus/No_69411_129942_04-30-58.txt'
report2 = ''
with open(input_report,'r') as f0:
    report2 += f0.read().lower()
report2
#len(report2)

'[**2604-3-11**] 11:18 am\n ct abdomen w/contrast; ct pelvis w/contrast                     clip # [**clip number (radiology) 53688**]\n reason: r/o obstruction  only one bottle of oral contrast pleaseplea\n admitting diagnosis: metastastic breast cancer to liver/sda\n ______________________________________________________________________________\n underlying medical condition:\n  only one bottle of oral contrast please   60f s/p r trisegmentectomy, r.\n  adrenalectomy, ccy, r-y hepaticojej, rfa x4 l. lat segment lesions [**3-2**] for\n  metastatic breast ca, w/ hepatic encephalopathy\n reason for this examination:\n  r/o obstruction  only one bottle of oral contrast pleaseplease use iv contrast\n no contraindications for iv contrast\n ______________________________________________________________________________\n provisional findings impression (pfi): aglc sun [**2604-3-11**] 8:42 pm\n  pfi:  status post right hepatic trisegmentectomy, cholecystectomy and right\n  adrenalectomy.  two

In [16]:
context2 = pyConText.ConTextDocument()

blob = TextBlob(report2.lower())
count = 0
rslts = []
for s in blob.sentences:
    m = markup_sentence(s.raw, modifiers=modifiers, targets=targets)
    rslts.append(m)

for r in rslts:
    context2.addMarkup(r)

#display(HTML(html.mark_document_with_html(context2,colors = clrs, default_color="black")))

In [17]:
fluid_collection_classifier(context2)
#print(context2.getRawText())
#should be:
#3 definitive 
#1 negated

definitive evidence: 2
negated evidence: 1
indication: 1
ignored: 0


({__________________________________________
  rawText: trace fluid along the resection bed is unchanged, but
   there is now new nonenhancing, non-hemorrhagic fluid collection seen in the
   posteromedial subdiaphragmatic location, with trace fluid tracking along the
   dome of the liver.
  cleanedText: trace fluid along the resection bed is unchanged, but there is now new nonenhancing, non-hemorrhagic fluid collection seen in the posteromedial subdiaphragmatic location, with trace fluid tracking along the dome of the liver.
  ********************************
  TARGET: <id> 67551583972946375712720425144508931024 </id> <phrase> fluid collection </phrase> <category> ['fluid_collection'] </category> 
  ----MODIFIED BY: <id> 67550760792337852509638127767763999696 </id> <phrase> resection bed </phrase> <category> ['anatomy'] </category> 
  ----MODIFIED BY: <id> 67551126034167043267495280716532866000 </id> <phrase> subdiaphragmatic </phrase> <category> ['anatomy'] </category> 
  ----MODIFIE

**February 2**: This has gotten better. It got the negated but is still classifying one as indication instead of definitive.

## Observations and next steps.

On these two documents, pyConText is doing quite well with definitive evidence and indication. The negation was working in the earlier examples, but not on this final documents. Here are what I think should be the next steps:

    - Add more modifiers and targets, particularly negation modifiers
    - Add modifiers for historical and probable
    - Beef up targets and modifiers with regular expressions
    - What should pyConText output? XML files? A sqlite database? Writing out to an Excel file?
    - Make it easier/automated to compare the output of my classifier function with annotations. Perhaps an XML parser 
    would help
    - Test on all the adjudicated batches