# RegEx in spaCy

spaCy has quick ways to implement RegEx in three pipes: 

- Matcher
- PhraseMatcher
- EntityRuler

Matcher and PhraseMatcher do not align the matched patterns as entities in the doc.ents. Thus we utilize EntityRuler to implement regular expressions.

### 1. RegEx with EntityRuler

In [1]:
# import required libraries
import spacy

In [2]:
# Let us see an example

nlp = spacy.blank('en')

patterns = [{'label':'PHONE_NUMBER', 'pattern': [{'SHAPE':'ddd'},{'ORTH':'-'},{'SHAPE':'ddd'},{'ORTH':'-'},{'SHAPE':'dddd'}]}]

ruler = nlp.add_pipe('entity_ruler')

ruler.add_patterns(patterns)

In [3]:
text = 'Our phone number is 832-123-5555 and their phone number is 425-123-3829.'

doc = nlp(text)

In [4]:
print([(ent.text, ent.label_) for ent in doc.ents])

[('832-123-5555', 'PHONE_NUMBER'), ('425-123-3829', 'PHONE_NUMBER')]


Another Example

In [5]:
text = "Our phone number is 4251234567."

patterns = [{"label": "PHONE_NUMBERS", "pattern": [{"TEXT": {"REGEX": r"(\d){10}"}}]}]

ruler.add_patterns(patterns)

doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])

[('4251234567', 'PHONE_NUMBERS')]


### 2. spaCy Matcher

RegEx patterns are not trivial to read and debug. For these reasons, spaCy provides a readable, production-level, and maintainable alternative, the Matcher class. The Matcher class can match predefined rules to a sequence of tokens in Doc containers. 

In [8]:
# import the Matcher library
from spacy.matcher import Matcher

In [10]:
nlp = spacy.load('en_core_web_sm')

doc = nlp('Good Morning, this is our practice session on spaCy Matcher.')

# initialize a matcher object
matcher = Matcher(nlp.vocab)

In [11]:
# create a pattern
pattern = [{'LOWER':'good'},{'LOWER':'morning'}]

# add the pattern to the matcer object
matcher.add('morning_greeting',[pattern])

In [12]:
# let us see if the pattern matches the text
matches = matcher(doc)

for match_id, start, end in matches:
    print('Start token: {0} | End token: {1} | Matched text: {2}'.format(start,end,doc[start:end].text))

Start token: 0 | End token: 2 | Matched text: Good Morning


The Matcher class allows patterns to be more expressive by allowing some operators inside the curly brackets. These operators are for extended comparison and look similar to Python's in, not in and comparison operators. The table shows a list of supported operators in the Matcher class.

| Attribute        | Value Type | Description                              |
|------------------|------------|------------------------------------------|
| IN               | any type   | Attribute value is a member of a list    |
| NOT_IN           | any type   | Attribute value is not a member of a list|
| ==, >=, <=, >, < | int, float | Comparision operators                    |

In [17]:
# Let us see an example with operators
doc = nlp('Good morning and good evening.')

matcher = Matcher(nlp.vocab)

pattern = [{'LOWER':'good'},{'LOWER': {'IN':['morning','evening']}}]

matcher.add('morning_greeting', [pattern])

matches = matcher(doc)

for match_id, start, end in matches:
    print('Start token: {0} | End token: {1} | Matched text: {2}'.format(start,end,doc[start:end].text))

Start token: 0 | End token: 2 | Matched text: Good morning
Start token: 3 | End token: 5 | Matched text: good evening


In [19]:
# Let us see an example with operators
doc = nlp('Good morning and good evening.')

matcher = Matcher(nlp.vocab)

pattern = [{'LOWER':'good'},{'LOWER': {'NOT_IN':['day','evening']}}]

matcher.add('morning_greeting', [pattern])

matches = matcher(doc)

for match_id, start, end in matches:
    print('Start token: {0} | End token: {1} | Matched text: {2}'.format(start,end,doc[start:end].text))

Start token: 0 | End token: 2 | Matched text: Good morning


### 3. spaCy PhraseMatcher 

While processing unstructured text, we often have long lists and dictionaries that we want to scan and match in given texts. The Matcher patterns are handcrafted and each token needs to be coded individually. If we have a long list of phrases, Matcher is no longer the best option. In this instance, PhraseMatcher class helps us match long dictionaries. 

In [21]:
# import required libraries
from spacy.matcher import PhraseMatcher

In [24]:
#create PhraseMatcher object 
matcher = PhraseMatcher(nlp.vocab)

# create patterns
terms = ['Bill Gates', 'John Smith']
patterns = [nlp.make_doc(term) for term in terms]

matcher.add('PeopleOFInterest', patterns)

In [28]:
doc =  nlp('Bill Gates met John smith for an important discussion regarding importance of AI')

matches = matcher(doc)

for match_id, start, end in matches:
    print('Start token: {0} | End token: {1} | Matched text: {2}'.format(start,end,doc[start:end].text))

Start token: 0 | End token: 2 | Matched text: Bill Gates


Above John Smith is not matched as by default the PhraseMatcher does exact matching of text. If we want to match lower cased patterns or utilize shape of a pattern for matching, we can use the attr (attribute) argument in the PhraseMatcher class. 

In [32]:
# Let us see an example of case insensitive matching

#create matcher object with attr as LOWER
matcher = PhraseMatcher(nlp.vocab, attr = 'lower')

# create patterns
terms = ['Bill Gates', 'John Smith']
patterns = [nlp.make_doc(term) for term in terms]

matcher.add('PeopleOFInterest', patterns)

doc =  nlp('Bill Gates met John smith for an important discussion regarding importance of AI')

matches = matcher(doc)

for match_id, start, end in matches:
    print('Start token: {0} | End token: {1} | Matched text: {2}'.format(start,end,doc[start:end].text))

Start token: 0 | End token: 2 | Matched text: Bill Gates
Start token: 3 | End token: 5 | Matched text: John smith


In [33]:
# Another example

matcher = PhraseMatcher(nlp.vocab, attr = 'shape')

terms = ['9971729811']

patterns = [nlp.make_doc(term) for term in terms]

matcher.add('PhoneNumber', patterns)

doc =  nlp("John's phone number is 9173928731")

matches = matcher(doc)

for match_id, start, end in matches:
    print('Start token: {0} | End token: {1} | Matched text: {2}'.format(start,end,doc[start:end].text))

Start token: 5 | End token: 6 | Matched text: 9173928731
