# Natural Language Processing

## Part 3: SpaCy + Custom Pipeline + Regex

Here, let's see how we can create some custom pipeline.  We shall also visit Regex.

## 1. Custom Pipeline

In [1]:
import spacy
from spacy.language import Language

In [2]:
# Create the nlp object
nlp = spacy.load('en_core_web_sm')

# Define a custom component
@Language.component("show_length")
def show_length(doc):

    # Print the doc's length
    print('Doc length:', len(doc))

    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe("show_length", first=True)

# Process a text
doc = nlp("Hello world!")

Doc length: 3


Let's try another one

In [3]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Britain is a place.  Mary is a doctor.")

In [4]:
for ent in doc.ents:
    print (ent.text, ent.label_)

Britain GPE
Mary PERSON


Let's create a sample pipe that remove all `GPE` from our ents.

I know this does not make sense, but only for the sake of simplicity

In [5]:
@Language.component("remove_gpe")
def remove_gpe(doc):
    original_ents = list(doc.ents)  #convert generator to list
    for ent in doc.ents:
        if ent.label_ == "GPE":
            original_ents.remove(ent)
    doc.ents = original_ents
    return doc

In [6]:
#add this
nlp.add_pipe("remove_gpe", after="ner")  #you don't need "after"...this is just for the sake of demonstration

<function __main__.remove_gpe(doc)>

In [7]:
doc = nlp("Britain is a place.  Mary is a doctor.")
for ent in doc.ents:
    print (ent.text, ent.label_)

Mary PERSON


### A more complex one

In [8]:
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
list(nlp.pipe(animals))

[Golden Retriever, cat, turtle, Rattus norvegicus]

In [9]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

@Language.component("animal_component")
# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label 'ANIMAL'
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the 'ner' component
nlp.add_pipe("animal_component", after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


## 2. RegEx

**Strengths**:  good for complex syntax; fast, and universally supported

**Weakness**:   it's not easy!

In [10]:
#import the (famous) regular expression library
import re

Now that we have it imported, we can begin to write out some RegEx rules. Let's say we want to find an occurrence of a date in a text. As noted in an earlier notebook, there are a finite number of ways this can be represented. Let's try to grab all instances of a day followed by a month first.

In [11]:
#(\d) means any digit (0-9)
#{1, 2} means 1 or 2 times
#| means or
pattern = r"((\d){1,2} (January|February|March|April|May|June|July|August|September|October|November|December))"

text = "This is a date 2 February. Another date would be 14 August."
matches = re.findall(pattern, text)
print (matches) #return individual and groups as well...

[('2 February', '2', 'February'), ('14 August', '4', 'August')]


`findall` gives you a list of matches.  Let's use `finditer` which return an iterator that we can loop

In [12]:
text = "This is a date 2 February. Another date would be 14 August."
iter_matches = re.finditer(pattern, text)
print (iter_matches)
for hit in iter_matches:
    print (hit)

<callable_iterator object at 0x13af48e50>
<re.Match object; span=(15, 25), match='2 February'>
<re.Match object; span=(49, 58), match='14 August'>


Within each of these is some very salient information, such as the start and end location (inside the span) and the text itself (match). We can use the start and end location to grab the text within the string.

In [13]:
text = "This is a date 2 February. Another date would be 14 August."
iter_matches = re.finditer(pattern, text)
for hit in iter_matches:
    start = hit.start()
    end = hit.end()
    print (text[start:end])

2 February
14 August


### RegEx + spaCy

spaCy has easy ways to implement RegEx in three pipes: Matcher, PhraseMatcher, and EntityRuler. One of the major drawbacks to the Matcher and PhraseMatcher, is that they do not align the matches as `doc.ents`.

In [14]:
import spacy

#Sample text
text = "This is a sample number 555-5555."

#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {"label": "PHONE_NUMBER", "pattern": [{"SHAPE": "ddd"},
                {"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]}
            ]
#add patterns to ruler
ruler.add_patterns(patterns)

#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

555-5555 PHONE_NUMBER


This method worked well for grabbing the phone number. But what if we wanted to use RegEx as opposed to linguistic features, such as shape? First, let's write some RegEx to capturee 555-5555.

In [15]:
pattern = r"((\d){3}-(\d){4})"
text = "This is a sample number 555-5555."
matches = re.findall(pattern, text)
print (matches)

[('555-5555', '5', '5')]


Okay. So, now we know that we have a RegEx pattern that works. Let's try and implement it in the spaCy EntityRuler. We can do that with the code below. When we execute the code below, we have no output.

In [16]:
import spacy

text = "This is a sample number (555) 555-5555."

#Build upon the spaCy small model
nlp = spacy.blank("en")

#Create the ruler and add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {
                    "label": "PHONE_NUMBER", 
                    "pattern": [{"TEXT": {"REGEX": "((\d){3}-(\d){4})"}}]
                }
            ]
#add patterns to ruler
ruler.add_patterns(patterns)

#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

This is for one very important reason. SpaCy's EntityRuler cannot use RegEx to pattern match across tokens. The dash in the phone number throws off the EntityRuler. So, what are we to do in this scenario? Well, we have a few different options that we will explore in the next notebook. But before we get to that, let's try and use RegEx to capture the phone number with no hyphen.

In [17]:
import spacy

#Sample text
text = "This is a sample number 5555555."

#Build upon the spaCy small model
nlp = spacy.blank("en")

#add the pipe
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {
                    "label": "PHONE_NUMBER", 
                    "pattern": [{"TEXT": {"REGEX": "((\d){7})"}}]
                }
            ]
#add patterns to ruler
ruler.add_patterns(patterns)


#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

5555555 PHONE_NUMBER


This is rather silly that SpaCy cannot do this....let's do a more elegant way...using `Span`

## 3. Span

wWe are going to try and grab a multi-word token whose first name begins with Paul.

In [18]:
import re

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."

#w+ --> another word
pattern = r"Paul [A-Z]\w+"

matches = re.finditer(pattern, text)

for match in matches:
    print (match)

<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(39, 53), match='Paul Hollywood'>


### Reconstruct Spans

We can use spaCy `Span` to contain the results return by `re`.

In [19]:
from spacy.tokens import Span

Here, we will create a blank spaCy English model and create the doc object of the text. It will have no entities in it because we are working with a blank model that does not have an "ner" component.

In [20]:
nlp = spacy.blank("en")
doc = nlp(text)

Even though this part is unnecessary, it is good to do it here because in other situations you will have entities. If you do, you need to store them as a separate list to which we will append things.

In [21]:
original_ents = list(doc.ents)
original_ents

[]

Now, let's iterate over the results from `re.finditer()`. In this cell, we are going to grab the start and end from each match. we will then create a temporary span that will be equal to where the characters start and end in the doc object. This is important because tokens and characters do not always align correctly. Finally, we append to `my_ents`, the `start`, `end`, and `text`. The `text` is not necessary but it will help with debugging.

In [22]:
my_ents = []
for match in re.finditer(pattern, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
        my_ents.append((span.start, span.end, span.text))
        
my_ents

[(0, 2, 'Paul Newman'), (8, 10, 'Paul Hollywood')]

### Inject the Spans into the `doc.ents`

With that data, we can iterate over each entity and identify where it begins and ends in spaCy. Note, we are using the spaCy Span class. This allows us to create a span object and assign it a custom label. With this data, we can append each Span to original_ents.

In [23]:
for ent in my_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="PERSON")
    original_ents.append(per_ent)

And finally, we set `doc.ents` equal to `original_ents`. This effectively loads the spans back into the spaCy `doc.ents`.

In [24]:
doc.ents = original_ents

Let's iterate over the ents as we normally would.

In [25]:
for ent in doc.ents:
    print (ent.text, ent.label_)

Paul Newman PERSON
Paul Hollywood PERSON


Note that these are now properly identified entities in our `doc.ents` class.

The next thing we want is to create a custom pipeline of this....how to do....very simple!  Just copy everything....

In [26]:
from spacy.language import Language

@Language.component("paul_ner")
def paul_ner(doc):
    pattern = r"Paul [A-Z]\w+"
    nlp = spacy.blank("en")
    doc = nlp(text)
    original_ents = list(doc.ents)
    my_ents = []
    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if span is not None:
            my_ents.append((span.start, span.end, span.text))
    for ent in my_ents:
        start, end, name = ent
        per_ent = Span(doc, start, end, label="PERSON")
        original_ents.append(per_ent)
    doc.ents = original_ents
    return doc

In [27]:
nlp = spacy.blank("en")
nlp.add_pipe("paul_ner")
doc = nlp(text)
print(doc.ents)

(Paul Newman, Paul Hollywood)


## 4. Give priority to longer spans

In [28]:
text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host."

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)
for ent in doc.ents:
    print (ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP


Let’s say that we create a new entity. Maybe words associated with Cinema. So, we want to classify Hollywood as a tag “CINEMA”. Now, in the above text, Hollywood is clearly associated with Paul Hollywood, but let’s imagine for a moment that it is not. Let’s try and run the same code as above. If we do, we notice that we get an error.

In [29]:
my_ents = []
pattern = r"Hollywood"
original_ents = list(doc.ents)
for match in re.finditer(pattern, doc.text):
    print (match)
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
        my_ents.append((span.start, span.end, span.text))
for ent in my_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="CINEMA")
    original_ents.append(per_ent)

doc.ents = original_ents

<re.Match object; span=(44, 53), match='Hollywood'>


ValueError: [E1010] Unable to set entity information for token 9 which is included in more than one span in entities, blocked, missing or outside.

This error tells us that one of our tokens from the `finditer()` overlapped with one that our `ner` component found. This is a problem that can be rectified with spaCy’s `filter_spans`. This gives primacy to **longer spans**. Notice how we have allowed the Paul Hollywood entity to be a PERSON, rather than CINEMA. This is because Hollywood is shorter than Paul Hollywood.

In [30]:
original_ents

[Paul Newman, American, Paul Hollywood, British, Hollywood]

In [31]:
from spacy.util import filter_spans
filtered = filter_spans(original_ents)
doc.ents = filtered
for ent in doc.ents:
    print (ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP
