# From Mention-Level Annotations to Document Classification

## 1. Why do we need document classification?

Think about a case with multiple mentions in one document. How do we decide the document level conclusion when these mentions have "conflicted" information? For example, 

>Small **left pleural effusion**. **Right pleural effusion can be excluded**.

In this example, should we conclude that the report indicates pneumonia or does not indicate pneumonia?

There are many other situations that we need to draw a document level conclusion based on multiple mention level annotations. Certainly, we can train a machine learning classifier to accomplish this task, which you will learn in another class. But here we are going to learn how to do it in rule-based way.

## 2. Restore from where we are using pyConText

In [None]:
#import everything that we will need
import pyConTextNLP
from pyConTextNLP import pyConTextGraph
from pyConTextNLP.itemData import itemData
from pyConTextNLP.utils import get_document_markups
from pyConTextNLP.display.html import mark_document_with_html
import os
import os.path
from nlp_pneumonia_utils import Annotation
from nlp_pneumonia_utils import AnnotatedDocument
from nlp_pneumonia_utils import read_brat_annotations
from nlp_pneumonia_utils import read_annotations
from nlp_pneumonia_utils import calculate_prediction_metrics
from nlp_pneumonia_utils import markup_context_document
from DocumentClassifier import DocumentClassifier
from IPython.display import display, HTML, Image

from visual import Vis
from visual import snippets_markup
from visual import view_pycontext_output
from visual import convertMarkups2DF


In [None]:
# Let's just consider the example at the beginning as a document,
# and run pyConText to get markups

report = "Right pleural effusion can be excluded. Likely small left pleural effusion. "

targets = itemData(["effusion", "EVIDENCE_OF_PNEUMONIA", r"effusion[s]?", ""])

modifiers = pyConTextNLP.itemData.instantiateFromCSVtoitemData(os.path.join(os.getcwd(),'KB/pneumonia_modifiers.tsv'))

markups=markup_context_document(report,modifiers,targets)

In [None]:
# To confirm what we get from pyConText
view_pycontext_output(markups)
        


## 3. Use DocumentClassifier to define the rules for document classification

After we processed a document, we will get a list of pyConText output. But this is not the end. We want to conclude whether this document is pneumonia positive or not. That's where the DocumentClassifier comes in.

### 3.1 Simple document classification rules
First, let's start from two simple cases:
1. If there is a true mention of pneumonia evidence, we should conclude "PNEUMONIA_DOC_YES"
2. Otherwise, we should conclude "PNEUMONIA_DOC_NO"


We can easily write some python code for that. Here we do something more----externalize the rule definitions, so that we can directly reuse our code for other projects. Here is an example rule file [doc_inferences.csv](../../../edit/work/decart_rule_based_nlp/KB/doc_inferences.csv):

```python
DocConclusion,EvidenceTypes
# The rule in document inferences are processed from top to bottom.
# If any one of the rules is matched, the rest rules below it will be skipped.
# if the document has a EVIDENCE_OF_PNEUMONIA annotation, conclude PNEUMONIA_DOC_YES.
PNEUMONIA_DOC_YES,EVIDENCE_OF_PNEUMONIA
# if no above rule matched, conclude NEG_COLON_CA_DOC (default conclusion)
PNEUMONIA_DOC_NO
```

### 3.2 How to integrate pyConText processed results
However, the rules above haven't make any use of pyConText outputs. We definitely want to exclude the evidences that are negated. A simple idea to fix it is to go over all the annotations, change the negated "EVIDENCE_OF_PNEUMONIA" annotations to another annotation type, and then our DocumentClassifier can use the rules above to draw the conclusions correctly.

Thus, we need another inference component (feature inferencer) to make this change. Its rule ([featurer_inferences.csv](../../../edit/work/decart_rule_based_nlp/KB/featurer_inferences.csv)) format is like following:

```python
ConclusionType,SourceType,ModifierValues
#if an annotation is 'EVIDENCE_OF_PNEUMONIA', only has a modifier "DEFINITE_NEGATED_EXISTENCE', change this annotation type to NEG_EVIDENCE
NEG_EVIDENCE,EVIDENCE_OF_PNEUMONIA,DEFINITE_NEGATED_EXISTENCE
```

### 3.3 Dissect DocumentClassifier step by step
To use these two sets of rules, we need to initiate our DocumentClassifier. This DocumentClassifier also wrapped pyConText inside to make it easier to use:



In [None]:
pos_doc_type='PNEUMONIA_DOC_YES'
TARGETS_FILE_PATH = 'KB/pneumonia_targets.tsv'
MODIFIERS_FILE_PATH = 'KB/pneumonia_modifiers.tsv'
FEATURE_INFERENCER_FILE_PATH = 'KB/featurer_inferences.csv'
DOC_INFERENCER_FILE_PATH = 'KB/doc_inferences.csv'
# clear just in case files/regular expressions have been updated
classifier = DocumentClassifier(TARGETS_FILE_PATH, MODIFIERS_FILE_PATH,
                               FEATURE_INFERENCER_FILE_PATH, DOC_INFERENCER_FILE_PATH,
                               {pos_doc_type})

Now let's check how this "classifier" gets to the conclusion step by step:

In [None]:
# 1st use pyConText to process the input text, 
context_doc = markup_context_document(report, classifier.modifiers, classifier.targets)
view_pycontext_output(context_doc)

In [None]:
# read out the pyConText output into dataframe format
annotations, relations, doc_txt = convertMarkups2DF(get_document_markups(context_doc))

In [None]:
# Let's see what is in annotations:
annotations

In [None]:
# And what is in "relations":
relations

Then we use feature inference rules to change our negated annotations' type:

In [None]:
matched_conclusion_types = classifier.feature_inferencer.process(annotations, relations)
matched_conclusion_types

Now the 1st annotation type has been changed to 'neg_evidence'. Next, we can draw document level conclusion by simply check if there still is any annotation of 'EVIDENCE_OF_PNEUMONIA':

In [None]:
doc_conclusion = classifier.document_inferencer.process(matched_conclusion_types)
doc_conclusion

## 3.4 Wrap up
To make the call even simpler, DocumentClassifier has already wrapped up all the codes above into a single function:

In [None]:
doc_conclusion = classifier.classify_doc(report)
doc_conclusion

To visualize the last document that has been processed by it:

In [None]:
view_pycontext_output(classifier.get_last_context_doc())

## 4. Excercise
Let's try to switch the sentences in the example
See what happens. Does the order of mention-level annotation affects final conclusion?

In [None]:
# Your code goes here:



## 4. Quiz
Let's try a few questions, see if you've understood the content of this notebook:

In [None]:
from quiz_utils import doc_classify_1
doc_classify_1()

In [None]:
from quiz_utils import doc_classify_2
doc_classify_2()

In [None]:
from quiz_utils import doc_classify_3
doc_classify_3()

<br/><br/>This material presented as part of the DeCART Data Science for the Health Science Summer Program at the University of Utah in 2018.<br/>
Presenters : Dr.Wendy Chapman, Jianlin Shi.