# 8. Using RegEx with spaCy


## 8.1. Key Concepts in this Notebook¶
1. What is RegEx (Regular Expressions)?

2. The Strengths of RegEx

3. The Weaknesses of RegEx

4. How to use RegEx in Python

5. How to use RegEx in spaCy

## 8.2. What is Regular Expressions (RegEx)

Regular Expressions, or RegEx for short, is a way of achieving complex string matching based on simple or complex patterns. It can be used to perform finding and retrieving patterns or replacing matching patterns in a string with some other pattern.

## 8.3. The Strengths of RegEx

There are several strengths to RegEx.

* Due to its complex syntax, it can allow for programmers to write robust rules in short spaces.

* It can allow the researcher to find all types of variance in strings

* It can perform remarkably quickly when compared to other methods.

* It is universally supported

## 8.4. The Weaknesses of RegEx

Despite these strengths, there are a few weaknesses to RegEx.

1. Its syntax is quite difficult for beginners. (I still find myself looking up how to do certain things).

2. It order to work well, it requires a domain-expert to work alongside the programmer to think of all ways a pattern may vary in texts.

## 8.5. How to Use RegEx in Python

In [47]:
import re

In [48]:
pattern = r"((\d){1,2} (January|February|March|April|May|June|July|August|September|October|November|December))"

text = "This is a date 2 February. Another date would be 14 August."
matches = re.findall(pattern, text)
print (matches)

[('2 February', '2', 'February'), ('14 August', '4', 'August')]


In this bit of code, we see a real-life RegEx formula at work. While this looks quite complex, its syntax is fairly straight forward. Let’s break it down. The first ( tells RegEx that I’m looking for something within the ending ). In other words, I’m looking for a pattern that’s going to match the whole pattern, not just components.

Next, we state (\d){1,2}. This means that we are looking for any digit (0-9) that occurs either once or twice ({1,2}).

Next, we have a space to indicate the space in the string that we would expect with a date.

Next, we have (January|February|March|April|May|June|July|August|September|October|November|December) – this indicates another component of the pattern (because it is parentheses). The | indicates the same concept as “or” in English, so either January, or February, or March, etc.

When we bring it together, this pattern will match anything that functions as a set of one or two numbers followed by a month. What happens when we try and do this with a date that is formed the opposite way?

In [49]:
text = "This is a date February 2. Another date would be 14 August."
matches = re.findall(pattern, text)
print (matches)

[('14 August', '4', 'August')]


It fails. But this is no fault of RegEx. Our pattern cannot accommodate that variation. Nevertheless, we can account for it by adding it as a possible variation. Possible variations are accounted for with a *

In [50]:
pattern = r"(((\d){1,2}( (January|February|March|April|May|June|July|August|September|October|November|December)))|(((January|February|March|April|May|June|July|August|September|October|November|December) )(\d){1,2}))"

text = "This is a date February 2. Another date would be 14 August."
matches = re.findall(pattern, text)
print (matches)

[('February 2', '', '', '', '', 'February 2', 'February ', 'February', '2'), ('14 August', '14 August', '4', ' August', 'August', '', '', '', '')]


There are more concise ways to write the same RegEx formula. I have opted here to be more verbose to make it a bit easier to read. You can see that we’ve allowed for two main options for our pattern matcher.

Notice, however, that we have a lot of superfluous information for each match. These are the components of each match. There are several ways we can remove them. One way is to use the command finditer, rather than findall in RegEx.

In [51]:
text = "This is a date February 2. Another date would be 14 August."
iter_matches = re.finditer(pattern, text)
print (iter_matches)

<callable_iterator object at 0x7e5c7b447460>


In [52]:
text = "This is a date February 2. Another date would be 14 August."
iter_matches = re.finditer(pattern, text)
print (iter_matches)
for hit in iter_matches:
    print (hit)

<callable_iterator object at 0x7e5c7b445fc0>
<re.Match object; span=(15, 25), match='February 2'>
<re.Match object; span=(49, 58), match='14 August'>


Within each of these is some very salient information, such as the start and end location (inside the span) and the text itself (match). We can use the start and end location to grab the text within the string.

In [53]:
text = "This is a date February 2. Another date would be 14 August."
iter_matches = re.finditer(pattern, text)
for hit in iter_matches:
    start = hit.start()
    end = hit.end()
    print (text[start:end])

February 2
14 August


## 8.6. How to Use RegEx in spaCy

Things like dates, times, IP Addresses, etc. that have either consistent or fairly consistent structures are excellent candidates for RegEx. Fortunately, spaCy has easy ways to implement RegEx in three pipes: Matcher, PhraseMatcher, and EntityRuler. One of the major drawbacks to the Matcher and PhraseMatcher, is that they do not align the matches as doc.ents. Because this textbook is about NER and our goal is to store the entities in the doc.ents, we will focus on using RegEx with the EntityRuler. In the next notebook, we will examine other methods.

In [54]:
#Import the requisite library
import spacy

#Sample text
text = "This is a sample number 555-5555."

#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {"label": "PHONE_NUMBER", "pattern": [{"SHAPE": "ddd"},
                {"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]}
            ]
#add patterns to ruler
ruler.add_patterns(patterns)

#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

555-5555 PHONE_NUMBER


In [None]:
pattern = r"((\d){3}-(\d){4})"
text = "This is a sample number 555-5555."
matches = re.findall(pattern, text)
print (matches)

In [55]:
#Import the requisite library
import spacy

#Sample text
text = "This is a sample number (555) 555-5555."

#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {
                    "label": "PHONE_NUMBER", "pattern": [{"TEXT": {"REGEX": "((\d){3}-(\d){4})"}}
                                                        ]
                }
            ]
#add patterns to ruler
ruler.add_patterns(patterns)


#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

In [56]:
#Import the requisite library
import spacy

#Sample text
text = "This is a sample number 5555555."
#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {
                    "label": "PHONE_NUMBER", "pattern": [{"TEXT": {"REGEX": "((\d){5})"}}
                                                        ]
                }
            ]
#add patterns to ruler
ruler.add_patterns(patterns)


#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

5555555 PHONE_NUMBER


## 9.1. Key Concepts in this Notebook¶

* Working with Multi-Word Tokens and RegEx in spaCy 3x

* RegEx’s Finditer

* Spans

## 9.2. Problems with Multi-Word Tokens in spaCy as Entities

As we saw in 01.03: Rules-Based NER, we can use spaCy’s Matcher to grab multi-word tokens, or tokens that span multiple tokens. The main problem with this, however, is that these multi-word tokens are not placed into the doc.ents. This means that we cannot access them the same way we would other entities. In this notebook, we will figure out how to solve that problem with a simple workflow:

1. Extract Multi-Word Tokens with re.finditer()

2. Reconstruct the spans in the spaCy doc

3. Give priority to longer spans (Optional)

4. Inject the Spans into doc.ents

## 9.3. Extract Multi-Word Tokens

First, we need to grab the multi-word tokens. In this notebook, we are going to try and grab a multi-word token. In this case, a person whose first name begins with Paul. In the RegEx below, we specify that we are looking for any string that starts with “Paul” and then is followed by a capitalized letter. We then tell it to grab the entire second word until the end of the word.

In [31]:
import spacy
import re

In [32]:
text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."

In [33]:
pattern = r"Paul [A-Z]\w+"

In [34]:
matches = re.finditer(pattern, text)
for match in matches:
  print(match)

<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(39, 53), match='Paul Hollywood'>


## 9.4. Reconstruct Spans



In [35]:
from spacy.tokens import Span

In [36]:
nlp = spacy.blank("en")
doc = nlp(text)
# print(doc.ents)
original_ents = list(doc.ents)
mwt_ents = []
for match in re.finditer(pattern, doc.text):
  start, end = match.span()
  span = doc.char_span(start, end)
  if span is not None:
    mwt_ents.append((span.start, span.end, span.text))

for ent in mwt_ents:
  start, end, name = ent
  per_ent = Span(doc, start, end, label="PERSON")
  original_ents.append(per_ent)

doc.ents = original_ents
print(doc.ents)

for ent in doc.ents:
  print(ent.text, ent.label_)

(Paul Newman, Paul Hollywood)
Paul Newman PERSON
Paul Hollywood PERSON


In [37]:
print(mwt_ents)

[(0, 2, 'Paul Newman'), (8, 10, 'Paul Hollywood')]


## 9.5. Inject the Spans into the doc.ents

In [38]:
from spacy.language import Language

@Language.component("paul_ner")
def paul_ner(doc):
  pattern = r"Paul [A-Z]\w+"
  original_ents = list(doc.ents)
  mwt_ents = []
  for match in re.finditer(pattern, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
      mwt_ents.append((span.start, span.end, span.text))

  for ent in mwt_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="PERSON")
    original_ents.append(per_ent)

  doc.ents = original_ents
  return (doc)

In [39]:
nlp2 = spacy.blank("en")
nlp2.add_pipe("paul_ner")


In [40]:
doc2 = nlp2(text)
print(doc2.ents)

(Paul Newman, Paul Hollywood)


In [44]:
from spacy.language import Language
from spacy.util import filter_spans

@Language.component("cinema_ner")
def cinema_ner(doc):
  pattern = r"Hollywood"
  original_ents = list(doc.ents)
  mwt_ents = []
  for match in re.finditer(pattern, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
      mwt_ents.append((span.start, span.end, span.text))

  for ent in mwt_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="CINEMA")
    original_ents.append(per_ent)

  filtered = filter_spans(original_ents)
  doc.ents = filtered
  return (doc)

In [45]:
nlp3 = spacy.load("en_core_web_sm")
nlp3.add_pipe("cinema_ner")

In [46]:
doc3 = nlp3(text)
for ent in doc3.ents:
  print(ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP
Paul PERSON
