# NER with spaCy
**"Named Entity Recognition"** is a subtask of NLP where we extract specific named entities from the text. The definition of a "named entity" changes depending on the domain we're working on. We'll look at clinical NER later, but first we'll look at some examples in more general domains.

NER is often performed using news articles as source texts. In this case, named entities are typically proper nouns, such as:
- People
- Geopolitical entities, like countries
- Organizations

We won't go into the details of how NER is implemented in spaCy. If you want to learn more about NER and various way it's implemented, a great resource is [Chapter 17.1 of Jurafsky and Martin's textbook "Speech and Language Processing."](https://web.stanford.edu/~jurafsky/slp3/17.pdf)

In [1]:
import spacy
from spacy import displacy

In [2]:
nlp = spacy.load("en_core_web_sm")

Here is an excerpt from an article in the Guardian. We'll process this document with our nlp object and then look at what entities are extracted. One way to do this is using spaCy's `displacy` package, which visualizes the results of a spaCy pipeline.

In [3]:
text = """Germany will fight to the last hour to prevent the UK crashing out of the EU without a deal and is willing 
to hear any fresh ideas for the Irish border backstop, the country’s ambassador to the UK has said.
Speaking at a car manufacturers’ summit in London, Peter Wittig said Germany cherished its relationship 
with the UK and was ready to talk about solutions the new prime minister might have for the Irish border problem."""

In [4]:
doc = nlp(text)

In [5]:
displacy.render(doc, style="ent")

We can use spaCy's `explain` function to see definitions of what an entity type is. Look up any entity types that you're not familiar with:

In [6]:
spacy.explain("GPE")

'Countries, cities, states'

The last example comes from a political news article, which is pretty typical for what NER is often trained on and used for. Let's look at another news article, this one with a business focus:

In [7]:
# Example 2
text = """Taco Bell’s latest marketing venture, a pop-up hotel, opened at 10 a.m. Pacific Time Thursday. 
The rooms sold out within two minutes.
The resort has been called “The Bell: A Taco Bell Hotel and Resort.” It’s located in Palm Springs, California."""

In [8]:
doc = nlp(text)

In [9]:
displacy.render(doc, style="ent")

## Discussion
Compare how the NER performs on each of these texts. Can you see any errors? Why do you think it might make those errors?

## Coding Exercise
Write a function to that takes a doc as an argument and returns a dictionary mapping each entity type label to a list of that entity in the doc. Try creating a few different doc instances and testing this function out.

**Note**: A doc's entities can be accessed in the attribute `doc.ents`. An entity's label can be accessed in the attribute `ent.label_`.

In [None]:
from collections import defaultdict

def collect_entities(doc):
    """
    """
    d = defaultdict(list)
    # Your code here
    return d

In [None]:
collect_entities(doc)

# Clinical Text
Let's now try using spaCy's built-in NER model on clinical text.

In [None]:
clin_text = "76 year old man with hypotension, CKD Stage 3, status post RIJ line placement and Swan.  "

In [None]:
doc = nlp(clin_text)

In [None]:
displacy.render(doc, style="ent")

**Discussion**
- How did spaCy do with this sentence?
- What do you think caused it to make errors in the classifications?

General purpose NER models are typically made for extracting entities out of news articles. As we saw before, this includes mainly people, organizations, and geopolitical entities. 

**Discussion**
- What are some entity types we are interested in in clinical domain?
- Does spaCy's out-of-the-box NER handle any of these types?

In [None]:
ner = nlp.pipeline[-1][1]

In [None]:
ner.labels

# Pattern Matching
It's clear that spaCy's out-of-the-box NER is not going to fit our needs. In that case, we need to take matters into our own hands. SpaCy has several methods which enable us to do rule-based matching, while still having access to the many linguistic attributes which are classified by spaCy's statistical models. 

One such method is called the `Matcher`. This is a class which allows us to write rules which will match tokens based on various attributes in order to extract information according to our own needs. The simplext form of this is going to be matching the exact string: for example, match the strings "hypotension" and "CKD Stage 3". However, there may be lots of different variations, and using spaCy's Matcher allows us to be flexible. For example, we may want to match not only "CKD Stage 3", but also stages 1-6. We may also want to handle different abbreviations or capitalizations.

Here is a good demonstration of the Matcher: https://explosion.ai/demos/matcher

Let's demonstrate this by writing a few rules with a Matcher. We'll first write a list of **patterns**. Each pattern is a list of dicts, and each dict represents a single token. The dict maps a certain attribute to a value. The simplest form of this is to just look at the "TEXT" attribute, which matches an exact string:



In [None]:
pattern1 = [{'TEXT': 'hypotension'}]
pattern2 = [{'TEXT': 'CKD'}, {'TEXT': 'Stage'}, {'TEXT': '3'}]

We then add each pattern to the Matcher object and run it on a doc:

In [None]:
from spacy.matcher import Matcher

In [None]:
matcher = Matcher(nlp.vocab)

In [None]:
doc = nlp(clin_text)

In [None]:
doc

In [None]:
matcher.add('CLINICAL_PATTERN1', None, pattern1)
matcher.add('CLINICAL_PATTERN2', None, pattern2)

In [None]:
matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

However, we can match on a lot more than just the text, and this is where those linguistic attributes we looked at yesterday come in handy. Open up spaCy's documentation to see more about this:
https://spacy.io/usage/rule-based-matching#adding-patterns-attributes

Now, you'll write a slightly more comlex pattern. Try writing a single pattern which matches both 'stage 4 ckd' and 'Stage 3 CKD'.

In [None]:
clin_text2 = "the pt presents for stage 4 ckd. He previously had Stage 3 CKD."

In [None]:
pattern = [
    {'': 'stage'},
    {'': ''},
    {'LOWER': ''},
]

In [None]:
matcher.add('CLINICAL_PATTERN', None, pattern)

In [None]:
doc = nlp(clin_text2)

In [None]:
matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

# Assignment: Write your own rule-based matcher
Use the `Matcher` class to extract the following concepts from these texts:
- "Procedure"
- "Condition"

You'll first have to identify all of the instances of these concepts in the text below. Then add to the `patterns` list to match all of them.

In [None]:
long_text = (
    "There is continued mild-to-moderate congestive heart failure. "
    
    "87-year-old man with htn and end-stage renal disease. "
    
    "His wife recently died from end stage renal disease. "
    
    "The patient is s/p median sternotomy and right thoracotomy "
    
    "The pt presents for stage 4 ckd " 
    
    "He previously had stage 3 CKD."
    
    )

In [None]:
patterns = [
     [{'TEXT': 'htn'}],
    # Add the ckd pattern which you wrote earlier
    [{'': 'stage'}, {'': ''}, {'ower': 'ckd'}],
    
    # Add any other patterns

]

In [None]:
matcher = Matcher(nlp.vocab)

In [None]:
for pattern in patterns:
    matcher.add('CLINICAL_PATTERN', None, pattern)

In [None]:
doc = nlp(long_text)

In [None]:
matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

# A quick note about negation
With pyConText, we looked at how we can detect modifiers in texts such as negation, certainty, and experiencer. These are very important concepts in clinical text, but aren't necessarily as much of a focus in other domains. There is (currently) no specific ConText module in spaCy, but we can do some basic negation detection by using dependency parsing.

Let's see how spaCy parses the following sentence. We can then look for negated terms and children of the negated word.

In [None]:
doc = nlp("There is not cancer.")

In [None]:
displacy.render(doc, style="dep")

In [None]:
#  https://stackoverflow.com/questions/54849111/negation-and-dependency-parsing-with-spacy
negation_tokens = [tok for tok in doc if tok.dep_ == 'neg']
negation_head_tokens = [token.head for token in negation_tokens]

In [None]:
# Here are the negated terms
for token in negation_head_tokens:
    print(f"Negated: {token.text}")
    print("Children:")
    # And here are all of its children
    for child in token.children:
        print('\t', child, child.pos_)
#     print([child for child in token.children])

Let's write a very simple negation rule: We'll say that if a term is negated (such as "**is** not"), then any of its children which are nouns should be considered a negated concept.

In [None]:
def get_negated_concepts(doc):
    negation_concepts = []
    negation_tokens = [tok for tok in doc if tok.dep_ == 'neg']
    negation_head_tokens = [token.head for token in negation_tokens]
    
    for token in negation_head_tokens:
        # And here are all of its children
        for child in token.children:
            if child.pos_ == 'NOUN':
                negation_concepts.append(child)
    return negation_concepts

In [None]:
get_negated_concepts(doc)

This worked! Now lets' try it on another sentence:

In [None]:
doc = nlp("The patient does not have pneumonia, hypertension, or heart disease.")

In [None]:
displacy.render(doc, style="dep")

In [None]:
get_negated_concepts(doc)

**Discussion**: How did our negation function do? Were there any concepts which should have been negated, but weren't? Were they any concepts which were negated and shouldn't have been? What does that tell you about this method of negation, and how does it compare to pyConText?