# From Mention-Level Annotations to Document Classification

## 1. Why do we need document classification?

Think about a case with multiple mentions in one document. How do we decide the document level conclusion when these mentions have "conflicted" information? For example, 

>Small **left pleural effusion**. **Right pleural effusion can be excluded**.

In this example, should we conclude that the report indicates pneumonia or does not indicate pneumonia?

There are many other situations that we need to draw a document level conclusion based on multiple mention level annotations. Certainly, we can train a machine learning classifier to accomplish this task, which you will learn in another class. But here we are going to learn how to do it in rule-based way.

## 2. Restore from where we are using pyConText

In [None]:
#import everything that we will need
import pyConTextNLP
from pyConTextNLP import pyConTextGraph
from pyConTextNLP.itemData import itemData
from pyConTextNLP.display.html import mark_document_with_html
import os
import os.path
import radnlp
import radnlp.rules as rules
import radnlp.utils as utils
import radnlp.split as split
import radnlp.view as rview
import radnlp.schema as schema
import radnlp.classifier as classifier
from radnlp.data import classrslts
from nlp_pneumonia_utils import Annotation
from nlp_pneumonia_utils import AnnotatedDocument
from nlp_pneumonia_utils import read_brat_annotations
from nlp_pneumonia_utils import read_annotations
from nlp_pneumonia_utils import calculate_prediction_metrics
from nlp_pneumonia_utils import mark_text
from nlp_pneumonia_utils import pneumonia_html_markup
from IPython.display import display, HTML, Image

In [None]:
# Let's just consider the example at the beginning as a document,
# and run pyConText to get markups

report = "Right pleural effusion can be excluded. Likely small left pleural effusion. "

targets = itemData(["effusion", "SPAN_POSITIVE_PNEUMONIA_EVIDENCE", r"effusion[s]?", ""])

modifiers = pyConTextNLP.itemData.instantiateFromCSVtoitemData(os.path.join(os.getcwd(),'KB/lexical_kb_05042016.tsv'))

markup = utils.mark_report(split.get_sentences(report),
                         modifiers,
                         targets)

In [None]:
# To confirm what we get from pyConText
print(markup.getDocumentGraph())
        

context_html = pyConTextNLP.display.html.\
    mark_document_with_html(markup, colors = {"span_positive_pneumonia_evidence": "blue"})
display(HTML(context_html))

## 3. Use RadNLP to define the rules for document classification

RadNLP is a simple rule based document classification package. Let's consider the document classification problem as a **Q**(question)&**A**(answer) task. For example, to answer the question "does the report indicates pneumonia?", we can do it in two steps:

1. For each mention-level annotation, what answer we can get . (Let's name it as classification rules)
2. Among all the answers we get from step 1, which one should we prioritize as the document level conclusion. (Let's name it as schema rules)

Now let's start from something really simple:

### (1) Define classification rules

In the example above, we found both a positive indication of pneumonia and a negated one. Let represent the question with a name 'DISEASE_STATE'. If it is a positive mention, we conclude 'DISEASE_STATE=1', otherwise, 'DISEASE_STATE=0'. 

Intuitively, let's define: whenever we find a positive indication, conclude the report as pneumonia positive, no matter whether a negated indication exists or not.

```Python
@CLASSIFICATION_RULE,DISEASE_STATE,RULE,0,DEFINITE_NEGATED_EXISTENCE,,,,,,,,,
```
Let's break down this rule to interpret it:


| Element | Meaning |
|:--- |:--- |
| @CLASSIFICATION_RULE | Indicate this is a classification rule (another two types of rules are not covered here)|
| DISEASE_STATE | This rule is to sign 'DISEASE_STATE' a value|
| 0 | 'DISEASE_STATE' will be signed value '0' when the rule matched|
| DEFINITE_NEGATED_EXISTENCE | If find a 'DEFINITE_NEGATED_EXISTENCE' modifier, consider a match|
| ,,,,, | You can add more match conditions separted by comma |



Then we need one more rule to give a default value, when no rules are matched. Because the pyConText is designed to identify context clues that not positive, so by default (no clue is found) the value should be positive.
```Python
@CLASSIFICATION_RULE,DISEASE_STATE,DEFAULT,1,,,,,,,,,,
```

### (2) Define schema rules

Since we have two mention-level annotations in the example, we will get two answers for question 'DISEASE_STATE': 0 and 1. Let's define rules to priorities '1' over '0'.
```Python
1,Negative,DISEASE_STATE == 0
2,Positive,DISEASE_STATE == 1
```
Again, let's break down one of the rules to interpret it:


| Element | Meaning |
|:--- |:--- |
| 1 | priority score is '1'|
| Negative | Let's name this answer as 'Negative'. It won't affect the output, but help us to read the answer |
| DISEASE_STATE == 0 | If 'DISEASE_STATE == 0' |


These two rules give each answer of 'DISEASE_STATE' a score. If 'DISEASE_STATE == 0', score=1. If 'DISEASE_STATE == 1', score=2. The answer with a higher score will be prioritized.


#### Question: Why don't we just use  the  1 and 0 as the score?

### (3) Let's put rules into code

These rules have already been saved in two files for testing. Classification rules are saved in [classificationRules.csv](KB/classificationRules.csv). Schema rules are saved in [schema.csv](KB/schema.csv). We will try to edit them later. For now, let's just read them as they are.

In [None]:
# Let add these two rules into code
cls_rules = rules.read_rules(os.path.join(os.getcwd(),'KB/classificationRules.csv'))
print('Classification Rules:\n\n'+str(cls_rules[0]))


schema_rules = schema.read_schema(os.path.join(os.getcwd(),'KB/schema.csv'))
print("\nSchema:\n\n"+str(_schema))


In [None]:
# This function illustrates how the classification answers will be drawn for each mention-level annotation,
# and how document schema answer will be updated from each mention-level answer
def classify_document_targets(doc,
                              classification_rules,
                              category_rules,
                              severity_rules,schema,
                              neg_filters=["definite_negated_existence",
                                           "probable_negated_existence"],
                              exclusion_categories=["QUALITY_FEATURE",
                                                    "ARTIFACT"]):
    """
    Look at the targets and their modifiers to get an overall
    classification for the document_markup
    """
    rslts = {}

    qualityInducedUncertainty = False
    try:
        g = doc.getDocumentGraph()
    except:
        g = doc
    targets = [n[0] for n in g.nodes(data=True)
               if n[1].get("category", "") == 'target']
    

    if targets:        
        for t in targets:   
#           print a target annotation
            print(t)
            severity_values = []
            current_rslts = {}
            current_category = t.getCategory()
            

#           for rk in classification_rules:
#           iterate across all the classification_rules
            
            current_rslts = \
                {rk:utils.generic_classifier(g,
                                             t,
                                             classification_rules[rk])
                 for rk in classification_rules}  
#           print current results
            print('The classification answer for this mention: \n\t'+str(current_rslts))

            current_category = t.categoryString()
            # now need to compare current_rslts to rslts
            # to select most Positive
            docr = classifier.classify_result(current_rslts, _schema)
            print('The schema answer for this mention = '+str(docr))
            trslts = rslts.get(current_category, (-1, '', []))
            if trslts[0] < docr:
                trslts = (docr, t.getXML(), severity_values)
            rslts[current_category] = trslts
            print ('Current document level schema answer ='+str(trslts[0])+'\n')
            
        else:
            if t.isA('QUALITY_FEATURE'):
                qualityInducedUncertainty = True
            else:
                # if non-negated artifacts call uncertain
                if not utils.modifies(g, t, neg_filters):
                    qualityInducedUncertainty = True

    return rslts



In [None]:
# let's see how the answer is gotten step by step

clssfy =   classify_document_targets(markup, cls_rules[0],cls_rules[1],cls_rules[2],schema_rules)

print('Finally we got answer: '+str(clssfy['span_positive_pneumonia_evidence'][0]))                                

In [None]:
# without step by step print, you can use RadNLP's classifier function directly:

clssfy =   classifier.classify_document_targets(markup, cls_rules[0],cls_rules[1],cls_rules[2],schema_rules)
print('Finally we got answer: '+str(clssfy['span_positive_pneumonia_evidence'][0]))                                

### (4) Let's try to switch the sentences in the example
See what happens. Does the order of mention-level annotation affects final conclusion?

### (5) Let's try to add one more question 

Is the document's conclusion certain or uncertain?