In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.insert(0, "..")

In [3]:
from medspacy.visualization import visualize_ent, visualize_dep
from helpers import ENT_COLORS

#  4. Document Classification

The main goal of the NLP system is to assign a document classification to each document. Each document is classified as either:
- **Stably Housed**
- **Unstably Housed**: This includes various levels of housing instability, including homelessness, staying in temporary shelter, and at risk of losing housing)
- **Unknown**: No clear classification can be determined

The image below shows the logic for assigning document labels. In summary:

- The note is parsed for specific templates using the section detection logic. For example, **"Where are you currently living? Stable Housing"** gives a very clear answer as to the patient's housing situation. If this section contains an entity, a document classification is made according to the entity in that section..
- If no such template is found, the system next looks in the remainder of the note for an asserted mention of **"EVIDENCE_OF_HOUSING"** (ie., not hypothetical/negated/historical/ignored).  

<img src="../images/ssvf_document_classification.png" alt="Document classification logic" width="700" height="300"/>

In [4]:
from rehoused_nlp import build_nlp, visualize_doc_classification

In [5]:
%%capture
nlp = build_nlp()

In [6]:
clf = nlp.get_pipe("document_classifier")
clf.debug = True # Print out classification logic

In [7]:
nlp.pipe_names

['tagger',
 'parser',
 'concept_tagger',
 'target_matcher',
 'context',
 'sectionizer',
 'postprocessor',
 'document_classifier']

## 1. Structured template
Structured templates are known semi-structured questionnaires or note sections which can be treated as a "magic bullet". This is highly specific to the data and sections will need to be defined according to your EHR.

Using a simple example of **"Where are you currently living?"**, here is a document which would be matched in this step. We can visualize it and also access the document classification using the `document_classification` attribute:

In [8]:
text = "Where are you currently living? Stable Housing"
doc = nlp(text)
visualize_doc_classification(doc, colors=ENT_COLORS)

Found form answer: Where are you currently living? Stable Housing EVIDENCE_OF_HOUSING
Parsed template for answer: STABLY_HOUSED


In [9]:
print(doc._.document_classification)

STABLY_HOUSED


## 2. Stable housing in note
If a determination can't be made through a template, we must infer from entities in the rest of the note. The classification logic described in the paper looks for *any* other asserted mention of stable housing. If one is found, the document will be classified as **"Stably Housed"**, *no matter* what else is in the note. This is of course not an ideal rule and will not always lead to the correct classification, particularly if the mention of stable housing was extracted by mistake. But generally the logic works.

In the example below, there are two entities which could potentially be used for classification: **"homelessness"** and **"apartment"**. The mention of homelessness is implied to be historical, but since this is rather implicit the NLP does not recognize this. However, a later mention of stable housing supercedes this and correctly leads to a classification of **"Stably Housed"**:

In [10]:
text = "The patient is a 30-year-old gentleman who has experienced homeless. He is currently living in an apartment."
doc = nlp(text)
visualize_doc_classification(doc, colors=ENT_COLORS)

defaultdict(<class 'set'>, {'EVIDENCE_OF_HOMELESSNESS': {homeless}, 'EVIDENCE_OF_HOUSING': {apartment}})
Found evidence of housing:
{apartment}


## 3. Unstable housing in note
If there is no asserted mention of stable housing, we'll next look for mentions of unstable housing. This could include either any entity class which implies unstable housing ("Homelessness", "Doubling Up", or "Temporary Housing") as well as hypothetical mentions of stable housing ("would like an apartment")

In [11]:
texts = [
    "The patient is currently homeless.",
    "He is staying at his mother's house",
    "She has a bed at the local shelter",
    "He is looking for an apartment"
]

In [12]:
for text in texts:
    doc = nlp(text)
    visualize_doc_classification(doc, colors=ENT_COLORS)
    print("---"*5)
    print()

defaultdict(<class 'set'>, {'EVIDENCE_OF_HOMELESSNESS': {homeless}})
Found unstable housing concept:
[('EVIDENCE_OF_HOMELESSNESS', {homeless})]


---------------

defaultdict(<class 'set'>, {'DOUBLING_UP': {mother's house}})
Found unstable housing concept:
[('DOUBLING_UP', {mother's house})]


---------------

defaultdict(<class 'set'>, {'TEMPORARY_HOUSING': {shelter}})
Found unstable housing concept:
[('TEMPORARY_HOUSING', {shelter})]


---------------

defaultdict(<class 'set'>, {})
Found hypothetical evidence of housing
{apartment}


---------------



## 4. Negated homelessness
If there are no mentions of stable or unstable housing, we next look for negated mentions of homelessness. We'll imply from this that the patient is stably housed.

In [13]:
doc = nlp("He is not currently homeless.")
visualize_dep(doc)
visualize_doc_classification(doc, colors=ENT_COLORS)

defaultdict(<class 'set'>, {})
Found negated evidence of homelessness
{homeless}


However, if there is some other mention of unstable housing, that will take precedence, as we saw in Step #3:

In [14]:
doc = nlp("He is not currently homeless. He is living in a shelter.")
visualize_dep(doc)
visualize_doc_classification(doc, colors=ENT_COLORS)

defaultdict(<class 'set'>, {'TEMPORARY_HOUSING': {shelter}})
Found unstable housing concept:
[('TEMPORARY_HOUSING', {shelter})]


## 5. "UNKNOWN" classifications
If the 4 previous steps failed to produce a classification, then we will assign a confident classification to this document. Instead we will call it **"UNKNOWN"**. Most **"UNKNOWN"** documents will contain a keyword related to housing but not say anything specific. Some, however, will be false negatives caused by NLP mistakes in one of the previous steps.

The following texts will all be classified as **"UNKNOWN"**:

In [15]:
texts = [
    "We discussed where he is living.",
    "Patient with a history of homelessness",
    "Prior to that, he lived in an apartment",
    "Apartment, house, or shared space?"
]
for text in texts:
    doc = nlp(text)
    visualize_doc_classification(doc, colors=ENT_COLORS)
    print("---"*5)
    print()

defaultdict(<class 'set'>, {})
No relevant entities for document classification


---------------

defaultdict(<class 'set'>, {})
No relevant entities for document classification


---------------

defaultdict(<class 'set'>, {})
No relevant entities for document classification


---------------

defaultdict(<class 'set'>, {})
No relevant entities for document classification


---------------

