# The goal of this notebook is to demonstrate how we can set up rules for concepts related to pneumonia using powerful matching in spacy.
## Since this matcher uses tokens, but can also use regular expressions we can do most of what we have done so far in class (a la pyConTextNLP).
## However, we can also leverage other text processing deeper than 'surface level' processing to take advantage of lemmas (stems), parts-of-speech, etc
## Finally, this [Matcher](https://spacy.io/api/matcher) is not the fastest method if we have a very large vocabulary and we're only using surface forms (rather than token attributes).  If you are working with a large dictionary, spacy authors recommend the [PhraseMatcher](https://spacy.io/api/phrasematcher).

In [None]:
import spacy
from spacy.matcher import Matcher

# Let's start working with some simple documents.  Let's see what happens when we create a Matcher without adding rules to it

In [None]:
# Load our default English pipeline
nlp = spacy.load("en_core_web_sm")

In [None]:
def print_matches(doc, matches):
    print('Total matches found: {}'.format(len(matches)))
    
    # Iterate over the matches and print the span text
    for match_id, start, end in matches:
        print("Match found in text [from tokens {0} to {1}] : {2}".format(start, end, doc[start : end].text))

In [None]:
matcher = Matcher(nlp.vocab)

print('Total rules in matcher: {}'.format(len(matcher)))

no_rules_doc = nlp("Patchy consolidation in the left lower lobe")

matches = matcher(no_rules_doc)

print_matches(no_rules_doc, matches)

# Before we move ahead, let's point out some documentation that will be useful to know the valid pattern types we'll use below:
Spacy Rule based Matching:

https://spacy.io/usage/rule-based-matching

# Pattern syntax
## Each pattern is a list of 1 or more dictionaries, where each dictionary contains the key/value attributes that must be met for each token in the list to be considered a match.   Please note that since spacy has already run a tokenizer before we do matching, we do not need to worry about word boundaries (\b in regular expressions) since this has already been handled.  Not all tokenizers are perfect, but this can mean one less thing we need to worry about

For example this pattern matches a single word 'Hello' but only the exact case:

[{"TEXT": "Hello"}]

But this pattern pattern matches two words 'hello world' regardless of the case of any of the characters in these tokens:

[{"LOWER": "hello"}, {"LOWER": "world"}]

# Here are some of the "token attributes" that can be used to match:
* 'TEXT' - The exact, case sensitive text of the token
* 'LOWER' - The exact text of the token, but case insensitive
* 'POS' - The part of speech of the token (e.g. )
* 'TAG' - Same as above
* 'LEMMA' - The lemma or "stem" of a world (e.g. the lemma of "walking" is "walk")
* 'ENT_TYPE' - The entity type of the token (we will not be using this today, but this can be powerful if you have a good entity component in your pipeline)

# So let's start out simply.  Update the pattern below using 'LOWER' to match both cases in this example sentence:

In [None]:
consolidation_case_text = 'Is consolidation the same as Consolidation and CONSOLIDATION?'

In [None]:
consolidation_lower_pattern = [{"CHANGE_ME": "CHANGE_ME"}]

matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("CONSOLIDATION", None, consolidation_lower_pattern)

print('Total rules in matcher: {}'.format(len(matcher)))

case_doc = nlp(consolidation_case_text)

matches = matcher(case_doc)

print_matches(case_doc, matches)

# For each token attribute, it must match a value.  This can either be a string like our examples above or a dictionary containing other matching criterialike a regular expression.  For example, this will match a single token whether it contains "walk" or "talk":

[{"TEXT": {"REGEX": "(walk|talk)"}}]

# So instead of a single string value, let's update regular expression criteria to match all of the forms of 'infiltrate' below:

In [None]:
infiltrate_variants_text = 'There exist not just one infiltrate but two infiltrates.  We have been infiltrated!'

In [None]:
infiltrate_lower_pattern = [{"TEXT": {"CHANGE_ME" : "CHANGE_ME"}}]

matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("CONSOLIDATION", None, infiltrate_lower_pattern)

print('Total rules in matcher: {}'.format(len(matcher)))

case_doc = nlp(infiltrate_variants_text)

matches = matcher(case_doc)

print_matches(case_doc, matches)

# Now let's use some of the deeper power of spacy.  All of the information we've seen so far is at a surface level, meaning that we can see it in the literal text, but there is more to language like syntax, semantics, grammar, etc.  
## Let's use Part-of-Speech ('POS') to match a single adjectives modifying another concept.  Note, some of the POS tags that might be useful include:
* 'NOUN'
* 'ADV'
* 'ADJ'
* 'PROPN'

# Update the pattern below to match consolidation and a single adjective before it 

In [None]:
consolidation_doc = nlp("Patchy consolidation in the left upper lobe")

In [None]:
# Write a pattern for adjective plus one or two nouns
consolidation_pattern = [{"CHANGE_ME": "CHANGE_ME"}, {"LOWER": "consolidation"}]

matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_CONSOLIDATION", None, consolidation_pattern)

print('Total rules in matcher: {}'.format(len(matcher)))

matches = matcher(consolidation_doc)

print_matches(consolidation_doc, matches)

# Matcher also allows operators like quantifiers.  These are much like regular expression quantifiers.  To add these to your pattern, these serve as additional key/values in the dictionary for each token besides other token attributes you might want to match.  To use these operators, the key is 'OP'
* '!'	Negate the pattern, by requiring it to match exactly 0 times.
* '?'	Make the pattern optional, by allowing it to match 0 or 1 times.
* '+'	Require the pattern to match 1 or more times.
* '*'	Allow the pattern to match zero or more times.

# Now modify the pattern below to match 1 or more adjectives before the word lobe

In [None]:
# Now let's use OP to capture multiple ADJECTIVES
lobe_pattern = [{"POS": "ADJ", "CHANGE_ME" : "CHANGE_ME"}, {"LOWER": "lobe"}]

matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("LOBE", None, lobe_pattern)

print('Total rules in matcher: {}'.format(len(matcher)))

matches = matcher(consolidation_doc)

print_matches(consolidation_doc, matches)

# Now let's experiment with another linguistic feature below the surface using LEMMA (i.e. stem, root)

In [None]:
infiltrate_text = """A pulmonary infiltrate is a substance denser than air,
    such as pus, blood, or protein, which lingers within the parenchyma of the lungs.
    Pulmonary infiltrates are associated with pneumonia, tuberculosis, and nocardiosis."""

In [None]:
infiltrate_doc = nlp(infiltrate_text)

In [None]:
infiltrate_pattern = [{"CHANGE_ME": "CHANGE_ME"}]

matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("INFILTRATE", None, infiltrate_pattern)

print('Total rules in matcher: {}'.format(len(matcher)))

matches = matcher(infiltrate_doc)

print_matches(infiltrate_doc, matches)

# OK, let's go a bit further and combine what has been learned so far to match as much of you can of the following entities (or anatomical location) and any adjectives that modify them:
* Consolidation
* Infiltrate
* Opacity
* Any anatomical location (e.g. lobe)

In [None]:
doc_1 = 'patchy Consolidations present around the left lunG.'
doc_2 = 'Hazy opacities in the left lung and streaky opaciTy in the right lower lobe'
doc_3 = 'I saw a single alveolar infiltrate as well as bilateral infiltrates'
doc_4 = 'Linear opacities from the upper left lobe down to the lower right lobe.'
doc_5 = 'I was too busy studying anatomy like lobes, parenchyma, pleural cavity and diaphragm'

example_texts = [doc_1, doc_2, doc_3, doc_4, doc_5]

In [None]:
my_matcher = Matcher(nlp.vocab)

# TODO -- add as many patterns as you would like by calling 
# my_matcher.add("my_match_id", None, your_pattern_variable)

for i, example_text in enumerate(example_texts):
    print('Matching for Doc {0}: [{1}]'.format(i, example_text))
    example_doc = nlp(example_text)
    my_matches = my_matcher(example_doc)
    print_matches(example_doc, my_matches)