In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.insert(0, "..")

In [3]:
from medspacy.visualization import visualize_ent, visualize_dep
from helpers import ENT_COLORS

# 3. Postprocessing
The language and documentation around housing status is very complex and noisy. Some reasons for this include templates (e.g., questionnaire's which contain housing-related terms but don't actually contain useful information), complex linguistic structures, and the actual complexity involved in defining what someone's housing status is. Now that we've extracted concepts and set some important attributes, the next component implements complex, highly specific, and sometimes brittle logic through **postprocessing**.

In [4]:
from rehoused_nlp import build_nlp

In [5]:
%%capture
nlp = build_nlp()

for pipe in ('document_classifier',):
    nlp.remove_pipe(pipe)

In [6]:
nlp.pipe_names

['tagger',
 'parser',
 'concept_tagger',
 'target_matcher',
 'context',
 'sectionizer',
 'postprocessor']

## Postprocessing

See medspaCy's [notebook on Postprocessing](https://github.com/medspacy/medspacy/blob/master/notebooks/08-Preprocessing-Postprocessing.ipynb) for more information.

Each Postprocessing rule has a description which documents what the rule is meant to do./

In [7]:
postprocessor = nlp.get_pipe("postprocessor")
postprocessor.debug = True # print out debugging info

In [8]:
postprocessor.rules[:10]

[PostprocessingRule: None - If the generic phrase 'housing' occurs in patient goals, allow it to be used as hypothetical housing,
 PostprocessingRule: None - If the generic phrase 'housing' is preceded by 'found' or modified by 'positive housing', allow it to be used as evidence of housing.,
 PostprocessingRule: None - Require a modifier for the exact phrase 'home',
 PostprocessingRule: None - If a patient 'does not have stable housing', count that as evidence of homelessness,
 PostprocessingRule: None - Ignore entities overlapping with 'housing situation',
 PostprocessingRule: None - If housing is being discussed in the same sentence as 'housing options', the housing should be hypothetical.,
 PostprocessingRule: None - Consider 'rental assistance' to be 'evidence of housing' only if it is being received,
 PostprocessingRule: None - If evidence of housing is modified by 'need', change to 'EVIDENCE' or 'RISK_OF_HOMELESSNESS',
 PostprocessingRule: None - If evidence of housing occurs in 

Let's consider the first rule. Note the description at the bottom:

---
```python
PostprocessingRule(
        patterns=[
            PostprocessingPattern(lambda ent: ent.lower_ == "housing"),
            PostprocessingPattern(lambda ent: ent.label_ == "EVIDENCE_OF_HOUSING"),
            (
                PostprocessingPattern(postprocessing_functions.is_modified_by_category, condition_args=("HYPOTHETICAL",)),
                PostprocessingPattern(postprocessing_functions.is_modified_by_text, condition_args=(r"(goal|secure|await|obtain|find|worried|need)",)),
                PostprocessingPattern(lambda ent: ent._.section_category == "patient_goals"),
            ),
        ],
        action=change_hypothetical_phrase_housing,
        # action_args=("RISK_OF_HOMELESSNESS",),
        description="If the generic phrase 'housing' occurs in patient goals, allow it to be used as hypothetical housing"
    ),
```

---

The purpose for this rule is that "housing" is an extremely generic word. When a patient is either homeless or in the process of being **"rehoused"**, the word **"housing"** will show up often without referring to anything specific about the patient. By default, the NLP will set `is_ignored` to `True` for this phrase. However, sometimes this isn't what we want. When "housing" is listed in a section like **"Patient goals"**, it perhaps implies that the patient is working towards finding stable housing.

In the first example below, the postprocessing rule passess all of the conditions shown in `patterns` (ie., the phrase **"housing"** occurs in the patient goals section), and so it changes `is_ignored` to `False` and `is_hypothetical` to `True`. The debugging information shows this logic. In te

In [9]:
doc = nlp("Patient goals: housing")
ent = doc.ents[-1]
print(ent)
print("is_ignored:", ent._.is_ignored)
print("is_hypothetical", ent._.is_hypothetical)

housing
Passed: PostprocessingRule: None - If the generic phrase 'housing' occurs in patient goals, allow it to be used as hypothetical housing on ent: housing Patient goals: housing
Passed: PostprocessingRule: None - If evidence of housing occurs in the goals section, set to hypothetical on ent: housing Patient goals: housing

housing
is_ignored: False
is_hypothetical True


In this second example, these conditions aren't met, so `is_ignored` stays `True` and `is_hypothetical` is `False`.

In [10]:
doc = nlp("Will talk about housing")
ent = doc.ents[-1]
print(ent)
print("is_ignored:", ent._.is_ignored)
print("is_hypothetical:", ent._.is_hypothetical)

housing

housing
is_ignored: True
is_hypothetical: False


That example is fairly straightforward and generalizable. Let's look at a more complex and, to be honest, problematic example:

---
```python
PostprocessingRule(
        [
            PostprocessingPattern(lambda ent: ent.label_ == "EVIDENCE_OF_HOUSING"),
            PostprocessingPattern(lambda ent: ent.lower_ in ("house", "apartment", "apartment complex", "apartment building", "apt")),
            PostprocessingPattern(postprocessing_functions.is_modified_by_text,
                                  condition_args=(r"(apply|applied|visit|available|look)",),
                                  success_value=False),
            PostprocessingPattern(postprocessing_functions.is_modified_by_category, condition_args=("HYPOTHETICAL",),
                                  success_value=False),
            PostprocessingPattern(postprocessing_functions.is_modified_by_category, condition_args=("RESIDES_IN",),
                                  success_value=False),
            PostprocessingPattern(postprocessing_functions.is_modified_by_category, condition_args=("ACCEPTED",),
                                  success_value=False),
            PostprocessingPattern(is_preceded_by, condition_args=("maintain", 5),
                                  success_value=False),
            PostprocessingPattern(is_preceded_by, condition_args=(r"has ?(an|a)?", 3),
                                  success_value=False),
            PostprocessingPattern(lambda ent:ent._.window(5)._.contains(r"(his|her)( own)?"),
                                  success_value=False),
            # PostprocessingPattern(lambda ent:ent._.window(5, left=True, right=False)._.contains(r"transition")),
            PostprocessingPattern(postprocessing_functions.is_modified_by_category, condition_args=("POSITIVE_HOUSING",), success_value=False),
            # PostprocessingPattern(is_preceded_by, condition_args=(["got"],3), success_value=False),
        ],
        action=set_ignored, action_args=(True,),
        description="Require a modifier for 'house' or 'apartment' to be considered housing"
    ),
```
---

This rule was one of the most difficult to implement and probably one of the main causes of false negatives in the paper. The basic motivation for this is that similar to the world **"housing"**, words like **"house"** and  **"apartment"** are often not specifically referring to the patient's housing. So to avoid false positives, we may want to ignore these phrases. However, there are obviously many scenarios in which these entities are very specifically about the patient. So this rule tries to make distincitons between these two scenarios by looking for contextual information. For example:
- **"maintain her apartment"** implies that the patient is living in an apartment
- **"applied to an apartment"** implies that the patient is trying to get housing
- **"his own house**" implies a patient has a house

This rule works by checking for each of these variations which indicate a mention is relevant, and if none of the conditions are met, the entity is set to be ignored. (So unlike the previous rule we looked at, if this rule passes the entity is ignored.)  However, the problem is that it's very difficult/impossible the enumerate and capture all the different situations in which an entity should be considered relevant. When the NLP misses such a situation, this can cause a false negative.

If we are looking at just the phrase **"house"**, it will be ignored:

In [11]:
doc = nlp("house")
print("is_ignored", doc.ents[0]._.is_ignored)

house
Passed: PostprocessingRule: None - Require a modifier for 'house' or 'apartment' to be considered housing on ent: house house

is_ignored True


However, here are a few examples which should not be ignored, so the rule won't pass:

In [12]:
texts = [
    "He has his own house",
    "applied to an apartment",
    "Lives in an apartment",
    "She has a house on the North Side",
    "He denies any concerns about his house."
]

for text in texts:
    nlp(text)

house

apartment

apartment

She has a house

his house.



Postprocessing is useful because it's flexible and allows for implementing fairly specific and complex logic. However, it can also be very brittle and difficult to manage. You can undoubtedly find many scenarios where a rule should work but doesn't, or where a rule ends up causing a false positive/negative. It's a trade-off and we found that generally the rules brought up performance, even though they were a common (and frustrating) cause of mistakes.

This notebook won't go through rules in any more detail, but a good way to get a sense of what each of the rules do is by the looking at the tests which check that their logic generally works: `rehoused/tests/test_postprocessrules.py`