# 1. Target Matching & Concept Extraction

# Overview
In this notebook, we'll look at how we can extract mentions of COVID-19 from clinical text. 

There are multiple methods for extract concepts from clinical text. We implemented a rule-based system, in which we curate patterns to match spans of text which represent our clinical document. SpaCy offers useful methods for [rule-based matching](https://spacy.io/usage/rule-based-matching/). We'll use extensions of these methods to identify COVID-19 texts. 

In [1]:
import cov_bsv
from cov_bsv import visualize_doc

In [2]:
nlp = cov_bsv.load(enable=["tagger", "parser"], load_rules=False)

In [3]:
nlp.pipe_names

['tagger', 'parser']

In [4]:
text = """Patient presents to be tested for COVID-19. 
His wife recently tested positive for novel coronavirus.

SARS-COV-2 results came back positive. 
"""

In [5]:
print(text)

Patient presents to be tested for COVID-19. 
His wife recently tested positive for novel coronavirus.

SARS-COV-2 results came back positive. 



MedSpaCy extends these using the `TargetMatcher` class, which we'll use to extract our target concepts. Rules defined by a class called `TargetRule`. Extracted concepts will be stored as `Span` objects in `doc.ents`.

In [6]:
from medspacy.ner import TargetMatcher, TargetRule

In [7]:
target_matcher = TargetMatcher(nlp)
target_matcher

<target_matcher.target_matcher.TargetMatcher at 0x11981bf98>

In [8]:
nlp.add_pipe(target_matcher)

Target rules require two positional arguments:
- `literal`: A span of text to match in the text (case insensitive)
- `category`: The label to assign to extracted concepts
Let's define rules now to extract COVID-19 from our example text. Note that there are several forms and synonyms.

You can also define Token patterns in the `patterns` argument, which will override the phrase in `literal` (see the [spaCy documentation](https://spacy.io/usage/rule-based-matching/) on rule-based matching for more details):

In [9]:
target_rules = [
    TargetRule(literal="SARS-COV-2", category="COVID-19"),
    TargetRule(literal="novel coronavirus", category="COVID-19"),
    TargetRule(literal="COVID-19", category="COVID-19",
              pattern=[{"LOWER": {"REGEX": "covid-?19"}}]),
    
]

In [10]:
target_matcher.add(target_rules)

In [11]:
doc = nlp(text)

In [12]:
cov_bsv.visualize_doc(doc)

# Concept Tagging
There are many slight variations and combinations of concepts. To help reduce the complexity of the rules, one further step we took in our pipeline was to add a `concept_tag` attribute to each token. This acts essentially as a semantic label and allows us to define simpler rules to match variations and synonyms.

For example, let's say that we want to match spans of text which contain COVID-19 along with an associated diagnosis, such as **"pneumonia"** or **"respiratory failure"**. Rather than enumerating every combination of **"COVID-19"**/**"SARS-COV-2"** + **"pneumonia"**/**"pna"**/**"respiratory failure"** ..., we will first assign tokens in those shorter phrases `concept_tags` representing these labels. MedSpaCy provides the `ConceptTagger` for this, and it behaves very similarly to the `TargetMatcher`:

In [13]:
text2 = """Patient admitted due to COVID-19 pneumonia.
Diagnoses:
- SARS-COV-2 respiratory failure
"""

In [14]:
from medspacy.ner import ConceptTagger

In [15]:
concept_tagger = ConceptTagger(nlp)

In [16]:
concept_tag_rules = [
    TargetRule("COVID-19", "COVID-19"),
    TargetRule("SARS-COV-2", "COVID-19"),
    TargetRule("novel coronavirus", "COVID-19"),
    TargetRule("pneumonia", "ASSOCIATED_DIAGNOSIS",
              pattern=[{"LOWER": {"IN": ["pneumonia", "pna"]}}]),
    TargetRule("respiratory failure", "ASSOCIATED_DIAGNOSIS"),
]

In [17]:
concept_tagger.add(concept_tag_rules)

We'll add it to our pipeline preceding the target matcher:

In [18]:
nlp.add_pipe(concept_tagger, before="target_matcher")

In [19]:
nlp.pipe_names

['tagger', 'parser', 'concept_tagger', 'target_matcher']

In [20]:
doc = nlp(text2)

Now let's look at the concept tags assigned to each token:

In [21]:
for token in doc:
    print(token, token._.concept_tag, sep=" -> ")

Patient -> 
admitted -> 
due -> 
to -> 
COVID-19 -> COVID-19
pneumonia -> ASSOCIATED_DIAGNOSIS
. -> 

 -> 
Diagnoses -> 
: -> 

 -> 
- -> 
SARS -> COVID-19
- -> COVID-19
COV-2 -> COVID-19
respiratory -> ASSOCIATED_DIAGNOSIS
failure -> ASSOCIATED_DIAGNOSIS

 -> 


Now we can define rules which look for **one or more COVID-19 tokens** followed by **one or more associated diagnosis tokens** (and reversed):

In [22]:
new_target_rules = [
    TargetRule(literal="<COVID-19>", category="COVID-19",
              pattern=[
                  {"_": {"concept_tag": "COVID-19"}, "OP": "+"},
              ]),
    TargetRule(literal="<ASSOCIATED_DIAGNOSIS> <COVID-19>", category="COVID-19",
              pattern=[
                  {"_": {"concept_tag": "ASSOCIATED_DIAGNOSIS"}, "OP": "+"},
                  {"_": {"concept_tag": "COVID-19"}, "OP": "+"},
              ]),
    TargetRule(literal="novel coronavirus", category="COVID-19",
              pattern=[
                  {"_": {"concept_tag": "COVID-19"}, "OP": "+"},
                  {"_": {"concept_tag": "ASSOCIATED_DIAGNOSIS"}, "OP": "+"},
                  
              ]),
    
]

In [23]:
target_matcher.add(new_target_rules)

In [24]:
doc2 = nlp(text2)

In [25]:
cov_bsv.visualize_doc(doc2)

In [26]:
for ent in doc2.ents:
    print(ent)
    for token in ent:
        print("\t", token, token._.concept_tag)
    print()

COVID-19 pneumonia
	 COVID-19 COVID-19
	 pneumonia ASSOCIATED_DIAGNOSIS

SARS-COV-2 respiratory failure
	 SARS COVID-19
	 - COVID-19
	 COV-2 COVID-19
	 respiratory ASSOCIATED_DIAGNOSIS
	 failure ASSOCIATED_DIAGNOSIS



# Marking spans to ignore
Some spans of text are going to be problematic for our pipeline. For example, a statement such as **"COVID-19 infection protocols"** or **"COVID-19 pandemic"** are not referring to a diagnosis of COVID-19. This also applies to modifiers, such as **"negative pressure room"** which could be confused with a negative status.

The `target_matcher` will extract the longest matching span of text. To exclude these phrases, then, we will write rules using an **"IGNORE"** label to differentiate these longer, more specific spans from normal mentions of COVID-19.


In [27]:
ignore_rules = [
    TargetRule("COVID-19 infection protocols", "IGNORE"),
    TargetRule("COVID-19 pandemic", "IGNORE"),
    TargetRule("negative pressure room", "IGNORE"),
]

In [28]:
target_matcher.add(ignore_rules)

In [29]:
texts = [
    "Hospital is following COVID-19 infection protocols.",
    "He was laid off of his job due to the COVID-19 pandemic.",
    "Patient placed in negative pressure room due to positive COVID status."
]

In [30]:
docs = list(nlp.pipe(texts))

In [31]:
for doc in docs:
    visualize_doc(doc)

# Next Steps
We can now identify phrases signifying COVID-19 in our text. However, even though a text mentions COVID-19, that doesn't mean the patient is diagnosed with COVID-19. In the next notebook we'll see how we can used the context of a mention to assert attributes like positive status or negation.

[02-attribute-assertion.ipynb](02-attribute-assertion.ipynb)