In [66]:
import warnings
warnings.filterwarnings("ignore")

# 4. Fixing Errors and Handling Special Cases

# Overview
We've walked through the basic workflow for classifying documents as positive or not for COVID-19. However, applying this to millions of actual clinical documents showed us that this is a challenging problem. Clinical text is extremely messy, and we need some flexibility for handling data issues or challenging cases.

This notebook will review 2 additional capabilities our pipeline provides:
- **Preprocessing** for cleaning up text
- **Postprocessing** for defining special business logic

To illustrate these, we'll use some examples which are current pipeline classifies incorrectly and see how we can make specific improvements to fix these issues.

In [67]:
import cov_bsv
from cov_bsv import visualize_doc

from medspacy.visualization import visualize_dep

In [68]:
nlp = cov_bsv.load(enable=["tagger", "parser", "concept_tagger", "target_matcher", "context", "document_classifier"])

In [69]:
nlp.pipe_names

['tagger',
 'parser',
 'concept_tagger',
 'target_matcher',
 'context',
 'document_classifier']

We'll use these texts as our examples. Note that with the components we've looked at so far, each of these texts is classified incorrectly:
**"diagnosed with COVID-19"** makes our system think this is positive, but this phrase may occur in many patients' notes and does not indicate actual COVID-19 status
- **"The patient presented with suspicion for COVID-19 and later tested positive."**: Our system identifies that **"COVID-19"** is uncertain 
We'll need to handle some special cases to fix them.

In [70]:
texts = [
    "PATIENT SAFETY CHECK: The patient has been diagnosed with COVID-19: Y/N. Patient presents for routine exam.",
    "++ COVID-19 Safety Check:++"
    "The patient presented with suspicion for COVID-19 and later tested positive.",
    "Patient tested today for COVID-19. Results are positive.",    
]

In [71]:
docs = list(nlp.pipe(texts))

In [72]:
for doc in docs:
    visualize_doc(doc)

# Preprocessing
In preprocessing, we clean up the text before it is processed by any spaCy components. Preprocessing is implemented in medSpaCy in the `Preprocessor` class, and rules are defined using `PreprocessRule`. **Note that this preprocessor is *destructive***, and does not preserve the original text.

Let's do some error analysis on our first two examples:

---
**"PATIENT SAFETY CHECK: The patient has been diagnosed with COVID-19: Y/N. Patient presents for routine exam."**

This is an EHR template text which was likely copy-and-pasted are autopopulated in the note. The phrase may occur in many patients' notes and does not indicate actual COVID-19 status. This was one of the most common reasons for false positives in our operations.

In [73]:
doc = nlp("PATIENT SAFETY CHECK: The patient has been diagnosed with COVID-19: Y/N. Patient presents for routine exam.")

visualize_doc(doc)
visualize_dep(doc)

---
**"++ COVID-19 Safety Check:++"**

Here the **+** symbol is used as a header formatting. This causes our system to incorrectly classify this as positive:

In [74]:
doc = nlp("++ COVID-19 Safety Check:++")

visualize_doc(doc)
visualize_dep(doc)

To fix these issues, we'll use preprocessing to modify the text before it's processed by the doc. First, we'll remove the entire template questionnaire. Then we'll modify the other section title to remove the ++ symbols.

We instantiate the `Preprocessor` class by passing the model's tokenizer:

In [75]:
from medspacy.preprocess import Preprocessor, PreprocessingRule

In [76]:
preprocessor = Preprocessor(nlp.tokenizer)

Preprocess rules contain a compiled regular expression, an optional replacement (default is a blank string), and an optional description. We then add these rules to the `preprocessor` component.

In [77]:
import re

In [78]:
preprocess_rules = [
    PreprocessingRule(re.compile("PATIENT SAFETY CHECK: The patient has been diagnosed with COVID-19: Y/N."),
                         desc="Remove templated questionnaire."),
    PreprocessingRule(re.compile("\+\+ COVID-19 Safety Check:\+\+"), repl="COVID-19 Safety Check:",
                     desc="Remove '+' symbols around section header.")
]

In [79]:
preprocessor.add(preprocess_rules)

To add this to the pipeline, we set the preprocessor to be the model tokenizer (this is because this technically occurs before the regular spaCy pipeline):

In [80]:
nlp.tokenizer = preprocessor

Now let's see how our model performs on these texts:

In [81]:
text = "PATIENT SAFETY CHECK: The patient has been diagnosed with COVID-19: Y/N. Patient presents for routine exam."
print("Original text:", text)
doc = nlp(text)

visualize_doc(doc)
visualize_dep(doc)

Original text: PATIENT SAFETY CHECK: The patient has been diagnosed with COVID-19: Y/N. Patient presents for routine exam.


In [82]:
text = "++ COVID-19 Safety Check:++"
print("Original text:", text)
doc = nlp(text)

visualize_doc(doc)
visualize_dep(doc)

Original text: ++ COVID-19 Safety Check:++


We've now fixed two of our system errors! Let's move on to the next case.

# Postprocessing
The postprocessor iterates through each entity and checks a series of conditions on each. If all conditions evaluate as `True`, then some action is taken on the entity. Some use cases of this include removing an entity or changing an attributes.

Let's look at the next two texts:

---
**"The patient presented with suspicion for COVID-19 and later tested positive."**

Here, the mention of COVID-19 has both an **"uncertain"** and **"positive"** modifier. Since our case definition requires that any positive entity not be uncertain, our system misclassifies this.

---
**"Patient tested today for COVID-19. Results are positive."**

The positive results are given in the next sentence, but the scope of the ConText algorithm is explicitly defined to be *within the same sentence*. Here we need to make an exception for this entity to look for positive results in the following sentence.

MedSpaCy provides the `Postprocessor` for this. The design pattern for a postprocessing rule is as follows:
- A `PostprocessingRule` contains a list of `patterns` and an `action` to take if all of the `patterns` evaluate as `True`
- Each `PostprocessingPattern` takes a `condition`, which evaluates as `True` or `False`. If all patterns return `True`, the action is taken
- Each pattern can take option `condition_args` to pass into the condition check, and each rule takes optional `action_args`
- The module `postprocessing_functions` offer utility functions for the `condition` and `action` arguments, but you can also write your own functions

This framework allows very flexible and specific rules for handling cases like this.

---
Example:
```python
PostprocessRule(
    patterns=[
        PostprocessingPattern(function_to_test_condition1),
        PostprocessingPattern(function_to_test_condition2),
        ...,
    ],
    action=function_act_on_ent,
    description="Description of the rule"
    
)
```

In [83]:
from medspacy.postprocess import Postprocessor, PostprocessingRule, PostprocessingPattern
# Also import utility functions for rules
from medspacy.postprocess import postprocessing_functions

In [84]:
postprocessor = Postprocessor(debug=True) # Set to be verbose

In [85]:
# Needs to go *before* the document classifier
nlp.add_pipe(postprocessor, before="document_classifier")

## Postprocessing example 1
Let's write a rule to fix our first example:
**"The patient presented with suspicion for COVID-19 and later tested positive."**

We will write rule which says:
- If an entity:
    - Has a label of **"COVID-19"**; AND
    - `is_uncertain is True`; AND
    - is modified by "test"; AND
    - is modified by "positive"
- Then set `is_uncertain` to `False`

First we'll define a function which will set the entity to uncertain (must take the entity and index of the entity in doc.ents as arguments):

In [86]:
def set_is_uncertain(ent, i, value=True):
    ent._.is_uncertain = value

In [87]:
postprocess_rule1 = PostprocessingRule(
    patterns=[
        PostprocessingPattern(lambda ent: ent.label_ == "COVID-19"), # Condition function can be a lambda or some other func
        PostprocessingPattern(lambda ent: ent._.is_uncertain is True),
        # Check if it is modified by a phrase like "tested positive"
        PostprocessingPattern(postprocessing_functions.is_modified_by_text,
                                  # Provide optional positional arguments to the condition func
                                  # which will call func(ent, *args)
                                  condition_args=("test(ed)? positive",) 
                             ),
        
    ],
    action=set_is_uncertain, # Function which will call func(ent, i, *args)
    action_args=(False,), # Optional argument
    
)

In [88]:
postprocessor.add([postprocess_rule1])

Now let's see if we've fixed the false negative:

In [89]:
text = "The patient presented with suspicion for COVID-19 and later tested positive."
doc = nlp(text)

visualize_doc(doc)
visualize_dep(doc)

COVID-19
Passed: PostprocessingRule: None - None on ent: COVID-19 The patient presented with suspicion for COVID-19 and later tested positive.



## Postprocessing 2
Now let's fix our second false negative:

**"Patient tested today for COVID-19. Results are positive."**

We'll write a rule which looks in the next sentence for any tests which don't have results:
- If an entity:
    - Has a label of **"COVID-19"**; AND
    - Is modified by **"test"**; AND
    - `is_positive` is `False`; AND
    - `is_negative` is `False`; AND
    - the following sentence contains the phrase **"results are positive"**
- Then set `is_positive` to `True`

First, we'll need to define some functions to check the contents of the next sentence, and to set `is_positive` to True.

In [90]:
def next_sentence_contains(ent, target):
    "Returns True if the sentence following an ent contains a target phrase (regex syntax)"
    next_sent = get_next_sentence(ent)
    if next_sent is None:
        return False
    return postprocessing_functions.span_contains(next_sent, target, regex=True)


In [91]:
def get_next_sentence(ent):
    "Return the sentence following a span. If the span is in the last sentence, return None."
    sent = ent.sent
    try:
        return ent.doc[sent.end].sent
    except IndexError:
        return None

In [92]:
def set_is_positive(ent, i, value=True):
    ent._.is_positive = value

In [93]:
postprocess_rule2 = PostprocessingRule(
        patterns=[
            PostprocessingPattern(lambda ent: ent.label_ == "COVID-19"),
            
            PostprocessingPattern(
                postprocessing_functions.is_modified_by_category,
                condition_args=("TEST",),
            ),
            
            PostprocessingPattern(lambda ent: ent._.is_positive is False),
        
            PostprocessingPattern(
                next_sentence_contains,
             condition_args=("results? (are|is) positive",), # regex
            ),

        ],
        action=set_is_positive,
        action_args=(True,),
        description="If a test does not have any results within the same sentence, check the next sentence.",
    )

In [94]:
postprocessor.add([postprocess_rule2])

In [95]:
text = "Patient tested today for COVID-19. Results are positive."
doc = nlp(text)

visualize_doc(doc)
visualize_dep(doc)

COVID-19
Passed: PostprocessingRule: None - If a test does not have any results within the same sentence, check the next sentence. on ent: COVID-19 Patient tested today for COVID-19.



We've now fixed the 4 errors we identified! Let's take one last look at them with our pre- and postprocessing rules in place:

In [96]:
texts = [
    "PATIENT SAFETY CHECK: The patient has been diagnosed with COVID-19: Y/N. Patient presents for routine exam.",
    "++ COVID-19 Safety Check:++",
    "The patient presented with suspicion for COVID-19 and later tested positive.",
    "Patient tested today for COVID-19. Results are positive.",    
]

In [97]:
docs = list(nlp.pipe(texts))

COVID-19

COVID-19
Passed: PostprocessingRule: None - None on ent: COVID-19 The patient presented with suspicion for COVID-19 and later tested positive.

COVID-19
Passed: PostprocessingRule: None - If a test does not have any results within the same sentence, check the next sentence. on ent: COVID-19 Patient tested today for COVID-19.



In [98]:
for doc in docs:
    visualize_doc(doc)

# Conclusion
These notebooks have walked through some of the details of the COVID-19 NLP Biosurveillance model. Hopefully this gave a sufficient explanation of our system was designed and gives you the tools needed for similar tasks. However, there are many more details which can't be covered or explained here. If you have questions or comments, or if you've applied NLP to combating the COVID-19 pandemic, we'd love to hear from you! Feel free to contact us using the contact information in the repository `README`.