In [1]:
%load_ext autoreload
%autoreload 2

In [25]:
import sys
sys.path.insert(0, "..")

In [3]:
from medspacy.visualization import visualize_ent
from helpers import ENT_COLORS

In [None]:
import warnings
warnings.filterwarnings("ignore")

# 1. Entity Extraction
This notebook will explain the logic underlying the first step of the NLP process: extracting entities and target concepts.

Let's load our NLP model. For this notebook, we'll disable some of the later components and just focus on the ones related to entity extraction.

In [4]:
from rehoused_nlp import build_nlp

In [6]:
%%capture
nlp = build_nlp()

for pipe in ('medspacy_context', 'medspacy_sectionizer', 'medspacy_postprocessor', 'document_classifier'):
    nlp.remove_pipe(pipe)

In [7]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'medspacy_concept_tagger',
 'medspacy_target_matcher']

The first two components here are the standard spaCy POS tagger and dependency parser. We won't talk much about those. Instead we'll focus on the two remaining components, which work in tandem to extract entities from the text:

## ConceptTagger
The `ConceptTagger` identifies patterns in the text and then assigns semantic labels to each token. Rules are stored as medspaCy `TargetRules` and defined by a `literal`, `category` and an optional `pattern`. Each rule will match a phrase defined by the pattern argument if it is not `None`, and the `literal` exact phrase otherwise. Patterns can be either be a regular expression string or a spaCy dictionary-based pattern.

When a pattern is matched in the string, each token in the matched span will be assigned an attribute according to the `category` argument: `token._.concept_tag`. This semantic tag will be used in the next step to simplify entity extraction.

In [9]:
concept_tagger = nlp.get_pipe("medspacy_concept_tagger")

In [10]:
concept_tagger.rules[:10]

[TargetRule(literal="apartment", category="RESIDENCE", pattern=[{'IS_TITLE': True, 'OP': '+'}, {'LOWER': {'REGEX': 'apartment'}}], attributes=None, on_match=None),
 TargetRule(literal="<RESIDES>", category="RESIDES", pattern=[{'LEMMA': {'IN': ['reside', 'stay', 'live', 'sleep']}}, {'LOWER': {'IN': ['in', 'at']}, 'OP': '?'}], attributes=None, on_match=None),
 TargetRule(literal="move in", category="RESIDES", pattern=[{'LEMMA': 'move'}, {'LOWER': 'in'}], attributes=None, on_match=None),
 TargetRule(literal="current living situation:", category="RESIDES", pattern=None, attributes=None, on_match=None),
 TargetRule(literal="veteran", category="PATIENT", pattern=[{'LOWER': {'REGEX': '^vet(eran)?'}}, {'LOWER': "'s", 'OP': '?'}], attributes=None, on_match=None),
 TargetRule(literal="patient", category="PATIENT", pattern=[{'LOWER': {'IN': ['patient', 'pt', 'pt.']}}, {'LOWER': "'s", 'OP': '?'}], attributes=None, on_match=None),
 TargetRule(literal="patient", category="PATIENT", pattern=[{'LOWER'

Let's look at an example. Here you can see that **"Veteran"** is assigned a concept tag of **"PATIENT"**, **"staying at"** is **"RESIDES"**, **"father"** is assigned **"FAMILY"**, and **"apartment"** is assigned **"RESIDENCE"**. Any token not matched by a pattern has a concept tag of `None`.

In [11]:
text = "The Veteran is staying at his father's apartment."
doc = nlp(text)

In [12]:
for token in doc:
    print(token, token._.concept_tag)

The 
Veteran PATIENT
is 
staying RESIDES
at RESIDES
his 
father FAMILY
's 
apartment RESIDENCE
. 


To illustrate how the `ConceptTagger` makes entity extraction simpler, let's process a similar but slightly different sentence. Although the text itself is slightly different, the concept tagger helps normalize it and pick out simpler patterns which will be useful later.

In [13]:
text = "The patient is sleeping in his mother's house."
doc = nlp(text)

In [14]:
for token in doc:
    print(token, token._.concept_tag)

The 
patient PATIENT
is 
sleeping RESIDES
in RESIDES
his 
mother FAMILY
's 
house RESIDENCE
. 


Here are the counts of each ConceptTag category:

In [15]:
from collections import Counter
c = Counter([rule.category for rule in concept_tagger.rules])
c.most_common()

[('FAMILY', 46),
 ('RESIDENCE', 11),
 ('HOMELESS_LOCATION', 11),
 ('TEMPORARY_HOUSING', 6),
 ('RESIDES', 3),
 ('PATIENT', 3),
 ('HOMELESSNESS', 2),
 ('EMPLOYMENT', 1)]

## TargetMatcher
Target concepts are stored in a spaCy doc as Spans in `doc.ents`. There are numerous ways to add entities to a `Doc`. `TargetMatcher` is a medspaCy class which wraps together [spaCy's rule-based methods](https://spacy.io/usage/rule-based-matching/) and general regular expressions.

Similar to the `ConceptTagger` component, we can access the `TargetMatcher` and its rules:

In [16]:
target_matcher = nlp.get_pipe("medspacy_target_matcher")
target_matcher

<medspacy.target_matcher.target_matcher.TargetMatcher at 0x7fd12338c910>

In [17]:
target_matcher.rules[:10]

[TargetRule(literal="homeless", category="EVIDENCE_OF_HOMELESSNESS", pattern=[{'LOWER': {'REGEX': 'homeless'}}], attributes=None, on_match=None),
 TargetRule(literal="chronic homelessness", category="EVIDENCE_OF_HOMELESSNESS", pattern=[{'LOWER': {'REGEX': '^chronic'}}, {'LOWER': {'REGEX': '^homeless'}}], attributes=None, on_match=None),
 TargetRule(literal="literally homeless", category="EVIDENCE_OF_HOMELESSNESS", pattern=None, attributes=None, on_match=None),
 TargetRule(literal="homeless veteran", category="EVIDENCE_OF_HOMELESSNESS", pattern=None, attributes=None, on_match=None),
 TargetRule(literal="sleep in <HOMELESS_LOCATION>", category="EVIDENCE_OF_HOMELESSNESS", pattern=[{'_': {'concept_tag': 'RESIDES'}, 'OP': '+'}, {'OP': '?'}, {'_': {'concept_tag': 'HOMELESS_LOCATION'}}], attributes=None, on_match=None),
 TargetRule(literal="sleep in park", category="EVIDENCE_OF_HOMELESSNESS", pattern=[{'LEMMA': 'sleep'}, {'LOWER': 'in'}, {'POS': 'DET', 'OP': '?'}, {'LOWER': 'park'}], attribut

Let's look at a few specific rules. The first rule will match any token which contains "homeless", as defined in the `pattern` argument:

```python
TargetRule(literal="homeless", category="EVIDENCE_OF_HOMELESSNESS", pattern=[{'LOWER': {'REGEX': 'homeless'}}], attributes=None, on_match=None)
```

When an entity is added using a target rule, we can see the original rule in the `span._.target_rule` attribute.

In [18]:
doc = nlp("Homelessness")
print(doc.ents)
print(doc.ents[0].label_)
print(doc.ents[0]._.target_rule)


(Homelessness,)
EVIDENCE_OF_HOMELESSNESS
TargetRule(literal="homeless", category="EVIDENCE_OF_HOMELESSNESS", pattern=[{'LOWER': {'REGEX': 'homeless'}}], attributes=None, on_match=None)


Here is a slightly different pattern:

In [19]:
doc = nlp("The Veteran is chronically homeless")
print(doc.ents)
print(doc.ents[0].label_)
print(doc.ents[0]._.target_rule)

(chronically homeless,)
EVIDENCE_OF_HOMELESSNESS
TargetRule(literal="chronic homelessness", category="EVIDENCE_OF_HOMELESSNESS", pattern=[{'LOWER': {'REGEX': '^chronic'}}, {'LOWER': {'REGEX': '^homeless'}}], attributes=None, on_match=None)


Let's look at one of the sentences we saw earlier. We already saw how the concept tagger assigns tags to each token. Let's see what entity is extracted and what the rule looks like. Here you can see that the `pattern` looks at the `concept_tag` attribute and looks for token sequences of **"FAMILY" "'s" "RESIDENCE"**, which will match both **"father's apartment"** and **"mother's house"**:

In [20]:
text = "The Veteran is staying at his father's apartment."
doc = nlp(text)
print(doc.ents)
print(doc.ents[0].label_)
doc.ents[0]._.target_rule

(father's apartment,)
DOUBLING_UP


TargetRule(literal="at <FAMILY> <RESIDENCE>", category="DOUBLING_UP", pattern=[{'LOWER': 'at', 'OP': '?'}, {'_': {'concept_tag': 'FAMILY'}, 'OP': '+'}, {'LOWER': "'s"}, {'_': {'concept_tag': 'RESIDENCE'}, 'OP': '+'}], attributes=None, on_match=None)

In [21]:
text = "The patient is sleeping in his mother's house."
doc = nlp(text)
print(doc.ents)
print(doc.ents[0].label_)
doc.ents[0]._.target_rule

(mother's house,)
DOUBLING_UP


TargetRule(literal="at <FAMILY> <RESIDENCE>", category="DOUBLING_UP", pattern=[{'LOWER': 'at', 'OP': '?'}, {'_': {'concept_tag': 'FAMILY'}, 'OP': '+'}, {'LOWER': "'s"}, {'_': {'concept_tag': 'RESIDENCE'}, 'OP': '+'}], attributes=None, on_match=None)

Let's go through the texts used in the introduction, visualize the docs, and see what target rules were used:

In [22]:
texts = [
    "The veteran is doing well in her new apartment.",
    "He has paid his rent.",
    "He signed a lease",
    "Veteran slept on the streets.",
    "The patient is currently literally homeless.",
    "Spent last night at the Mission.",
    "Got a bed at a shelter downtown.",
    "He stayed at his mother's house",
    "Cannot pay the upcoming rent",
    "Got an eviction notice.",
    "Patient with a history of homelessness",
    "Are you in a house, apartment, or room?",
    "Here to discuss his housing situation.",
    "She lives in an apartment building",
    "The patient is not currently homeless"
]

In [23]:
docs = list(nlp.pipe(texts))

In [26]:
for doc in docs:
    visualize_ent(doc, colors=ENT_COLORS)
    for ent in doc.ents:
        print(ent._.target_rule)

TargetRule(literal="<HIS/HER> residence", category="EVIDENCE_OF_HOUSING", pattern=[{'LOWER': {'IN': ['his', 'her']}}, {'POS': 'ADJ', 'OP': '*', 'LOWER': {'NOT_IN': ['temporary']}}, {'_': {'concept_tag': 'RESIDENCE'}, 'OP': '+', 'LOWER': {'NOT_IN': ['housing', 'home']}}, {'POS': 'NOUN', 'OP': '!'}], attributes=None, on_match=None)


TargetRule(literal="rent", category="EVIDENCE_OF_HOUSING", pattern=None, attributes=None, on_match=None)


TargetRule(literal="signed a lease", category="EVIDENCE_OF_HOUSING", pattern=[{'LEMMA': 'sign'}, {'POS': 'DET'}, {'LOWER': 'lease'}], attributes=None, on_match=None)


TargetRule(literal="sleep in <HOMELESS_LOCATION>", category="EVIDENCE_OF_HOMELESSNESS", pattern=[{'_': {'concept_tag': 'RESIDES'}, 'OP': '+'}, {'OP': '?'}, {'_': {'concept_tag': 'HOMELESS_LOCATION'}}], attributes=None, on_match=None)


TargetRule(literal="literally homeless", category="EVIDENCE_OF_HOMELESSNESS", pattern=None, attributes=None, on_match=None)


TargetRule(literal="at the <NOUN> mission", category="TEMPORARY_HOUSING", pattern=[{'LOWER': {'IN': ['at', 'in']}}, {'LOWER': 'the', 'OP': '?'}, {'POS': {'IN': ['PROPN', 'NOUN', 'ADJ']}, 'OP': '*'}, {'LOWER': 'mission'}], attributes=None, on_match=None)


TargetRule(literal="shelter", category="TEMPORARY_HOUSING", pattern=[{'LOWER': {'IN': ['homeless', 'community']}, 'OP': '?'}, {'LOWER': 'shelter'}], attributes=None, on_match=None)


TargetRule(literal="at <FAMILY> <RESIDENCE>", category="DOUBLING_UP", pattern=[{'LOWER': 'at', 'OP': '?'}, {'_': {'concept_tag': 'FAMILY'}, 'OP': '+'}, {'LOWER': "'s"}, {'_': {'concept_tag': 'RESIDENCE'}, 'OP': '+'}], attributes=None, on_match=None)


TargetRule(literal="rent", category="EVIDENCE_OF_HOUSING", pattern=None, attributes=None, on_match=None)


TargetRule(literal="eviction notice", category="RISK_OF_HOMELESSNESS", pattern=None, attributes=None, on_match=None)


TargetRule(literal="homeless", category="EVIDENCE_OF_HOMELESSNESS", pattern=[{'LOWER': {'REGEX': 'homeless'}}], attributes=None, on_match=None)


TargetRule(literal="house", category="EVIDENCE_OF_HOUSING", pattern=None, attributes=None, on_match=None)
TargetRule(literal="apartment", category="EVIDENCE_OF_HOUSING", pattern=None, attributes=None, on_match=None)


TargetRule(literal="housing situation", category="IGNORE", pattern=None, attributes=None, on_match=None)


TargetRule(literal="apartment", category="EVIDENCE_OF_HOUSING", pattern=None, attributes=None, on_match=None)


TargetRule(literal="homeless", category="EVIDENCE_OF_HOMELESSNESS", pattern=[{'LOWER': {'REGEX': 'homeless'}}], attributes=None, on_match=None)


Here are the counts of each entity category in the rules. Note that some spans are matched specifically to be ignored - these are to disambiguate certain phrases or problematic texts.

In [27]:
c = Counter([rule.category for rule in target_matcher.rules])
c.most_common()

[('EVIDENCE_OF_HOUSING', 69),
 ('IGNORE', 59),
 ('EVIDENCE_OF_HOMELESSNESS', 51),
 ('TEMPORARY_HOUSING', 37),
 ('RISK_OF_HOMELESSNESS', 33),
 ('DOUBLING_UP', 9),
 ('HOMELESSNESS_HEALTHCARE_SERVICE', 7),
 ('HEADER', 3)]