# NER with spaCy
**"Named Entity Recognition"** is a subtask of NLP where we extract specific named entities from the text. The definition of a "named entity" changes depending on the domain we're working on. We'll look at clinical NER later, but first we'll look at some examples in more general domains.

NER is often performed using news articles as source texts. In this case, named entities are typically proper nouns, such as:
- People
- Geopolitical entities, like countries
- Organizations

We won't go into the details of how NER is implemented in spaCy. If you want to learn more about NER and various way it's implemented, a great resource is [Chapter 17.1 of Jurafsky and Martin's textbook "Speech and Language Processing."](https://web.stanford.edu/~jurafsky/slp3/17.pdf)

In [1]:
import spacy
from spacy import displacy

In [2]:
nlp = spacy.load("en_core_web_sm")

Here is an excerpt from an article in the Guardian. We'll process this document with our nlp object and then look at what entities are extracted. One way to do this is using spaCy's `displacy` package, which visualizes the results of a spaCy pipeline.

In [8]:
text = """Germany will fight to the last hour to prevent the UK crashing out of the EU without a deal and is willing 
to hear any fresh ideas for the Irish border backstop, the country’s ambassador to the UK has said.
Speaking at a car manufacturers’ summit in London, Peter Wittig said Germany cherished its relationship 
with the UK and was ready to talk about solutions the new prime minister might have for the Irish border problem."""

In [6]:
doc = nlp(text)

In [7]:
displacy.render(doc, style="ent")

We can use spaCy's `explain` function to see definitions of what an entity type is. Look up any entity types that you're not familiar with:

In [9]:
spacy.explain("GPE")

'Countries, cities, states'

The last example comes from a political news article, which is pretty typical for what NER is often trained on and used for. Let's look at another news article, this one with a business focus:

In [10]:
# Example 2
text = """Taco Bell’s latest marketing venture, a pop-up hotel, opened at 10 a.m. Pacific Time Thursday. 
The rooms sold out within two minutes.
The resort has been called “The Bell: A Taco Bell Hotel and Resort.” It’s located in Palm Springs, California."""

In [11]:
doc = nlp(text)

In [12]:
displacy.render(doc, style="ent")

## Discussion
Compare how the NER performs on each of these texts. Can you see any errors? Why do you think it might make those errors?

## Coding Exercise
Write a function to that takes a doc as an argument and returns a dictionary mapping each entity type label to a list of that entity in the doc. Try creating a few different doc instances and testing this function out.

**Note**: A doc's entities can be accessed in the attribute `doc.ents`. An entity's label can be accessed in the attribute `ent.label_`.

In [15]:
from collections import defaultdict

def collect_entities(doc):
    """
    """
    d = defaultdict(list)
    for ent in doc.ents:
        d[ent.label_].append(ent)
    return d

In [16]:
collect_entities(doc)

defaultdict(list,
            {'ORG': [Taco Bell’s, Pacific Time],
             'TIME': [10 a.m., two minutes],
             'DATE': [Thursday],
             'WORK_OF_ART': [The Bell:],
             'PERSON': [Resort],
             'GPE': [Palm Springs, California]})

# Clinical Text
Let's now try using spaCy's built-in NER model on clinical text.

In [21]:
clin_text = "76 year old man with hypotension, CKD Stage 3, status post RIJ line placement and Swan.  "

In [22]:
doc = nlp(clin_text)

In [23]:
displacy.render(doc, style="ent")

**Discussion**
- How did spaCy do with this sentence?
- What do you think caused it to make errors in the classifications?

General purpose NER models are typically made for extracting entities out of news articles. As we saw before, this includes mainly people, organizations, and geopolitical entities. 

**Discussion**
- What are some entity types we are interested in in clinical domain?
- Does spaCy's out-of-the-box NER handle any of these types>

In [24]:
ner = nlp.pipeline[-1][1]

In [25]:
ner.labels

('DATE',
 'GPE',
 'LOC',
 'EVENT',
 'ORG',
 'PERCENT',
 'CARDINAL',
 'NORP',
 'LAW',
 'TIME',
 'WORK_OF_ART',
 'QUANTITY',
 'MONEY',
 'PERSON',
 'FAC',
 'ORDINAL',
 'PRODUCT',
 'LANGUAGE')

# Pattern Matching
TODO write this part later.
https://explosion.ai/demos/matcher

In [49]:
from spacy.matcher import Matcher

In [83]:
matcher = Matcher(nlp.vocab)

In [84]:
doc = nlp(clin_text)

In [85]:
doc

76 year old man with hypotension, CKD Stage 3, status post RIJ line placement and Swan.  

In [92]:
patterns = [
    [{'TEXT': 'hypotension'}],
    [{'LOWER': 'CKD'}, {'TEXT': 'Stage'}, {'TEXT': '3'}],
]

In [93]:
for pattern in patterns:
    print(pattern)
    matcher.add('CONDITION_PATTERN', None, pattern)

[{'TEXT': 'hypotension'}]
[{'LOWER': 'CKD'}, {'TEXT': 'Stage'}, {'TEXT': '3'}]


In [94]:
matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

hypotension
CKD Stage 3


Let's look at a slightly different text now. Try to write one single pattern to capture "stage 4 ckd" and " Stage 3 CKD"

In [97]:
clin_text2 = "the pt presents for stage 4 ckd. He previously had Stage 3 CKD."

In [98]:
pattern = [
    {'LOWER': 'stage'},
    {'POS': 'NUM'},
    {'LOWER': 'ckd'},
]

In [99]:
matcher.add('CONDITION_PATTERN', None, pattern)

In [100]:
doc = nlp(clin_text2)

In [101]:
matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

stage 4 ckd
Stage 3 CKD


# EntityRuler

In [26]:
from spacy.pipeline import EntityRuler

In [27]:
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner')

[('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f8ca05f0be8>)]

In [28]:
nlp.pipe_names

['tagger', 'parser']

In [29]:
ruler = EntityRuler(nlp)

In [30]:
patterns = [
    {"label": "CONDITION", "pattern": [{'TEXT': 'stage'}, {'POS': 'NUM'}, {'LOWER': 'ckd'}]}
]

In [31]:
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, last=True, name='CONDITION_NER')

In [32]:
nlp.pipe_names

['tagger', 'parser', 'CONDITION_NER']

In [33]:
doc = nlp(clin_text2)

In [34]:
displacy.render(doc, style="ent")

# Assignment: Write your own NER
Use the `EntityRuler` class to extract the following concepts from these texts:
- "PROCEDURE"
- "CONDITION"

In [35]:
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner')

[('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f8ca48eae28>)]

In [92]:
long_text = '\n'.join([
    """There is continued mild-to-moderate congestive heart failure.""",
    
    """87-year-old man with htn and end-stage renal disease.""",
    
    """His wife recently died from end stage renal disease""",
    
    "The patient is s/p median sternotomy and right thoracotomy.",
    
    "The pt presents for stage 4 ckd.", 
    
    "He previously had stage 3 CKD."
    
    
])

In [93]:
long_text

'There is continued mild-to-moderate congestive heart failure.\n87-year-old man with htn and end-stage renal disease.\nHis wife recently died from end stage renal disease\nThe patient is s/p median sternotomy and right thoracotomy.\nThe pt presents for stage 4 ckd.\nHe previously had stage 3 CKD.'

In [82]:
patterns = [
    {"label": "CONDITION", 
         "pattern":  [{'TEXT': "end"}, {"TEXT": "-", "OP": "?"},
                    {"TEXT": "stage"}, {"TEXT": "renal"}, 
                    {"TEXT": "disease"}
                                        ]},
    {"label": "CONDITION", "pattern": "congestive heart failure"},
    {"label": "CONDITION", "pattern": [{'TEXT': 'htn'}]},
    {"label": "CONDITION", "pattern": [{'TEXT': 'stage'}, 
                                       {'POS': 'NUM'}, {'LOWER': 'ckd'}]},
    
    {"label": "PROCEDURE", "pattern": "sternotomy"},
    {"label": "PROCEDURE", "pattern": "thoracotomy"}
    
]

In [83]:
ruler = EntityRuler(nlp)

In [84]:
ruler.add_patterns(patterns)
# nlp.add_pipe(ruler, last=True, name='CONDITION_NER')

In [94]:
doc = nlp(long_text)

In [86]:
options = {"colors": {"CONDITION": "#f5a742", 
                      "PROCEDURE": "#42f5e6"}}

In [87]:
displacy.render(doc, style="ent", options=options)

In [56]:
def get_conditions(doc):
    conditions = []
    for ent in doc.ents:
        if ent.label_ == 'CONDITION':
            conditions.append(ent)
    return conditions
            
def get_procedures(doc):
    procedures = []
    for ent in doc.ents:
        if ent.label_ == 'PROCEDURE':
            procedures.append(ent)
    return procedures