# 1) Rule-Based Matching

[Demo For Rule based matching](https://explosion.ai/demos/matcher)

[Spacy Docs](https://spacy.io/usage/rule-based-matching)

spaCy’s rule-based matcher engines and components not only let you find the words and phrases you’re looking for
,they also give you access to the tokens within the document and their relationships

This means you can easily access and analyze the surrounding tokens, merge spans into single tokens or add entries to the named entities in doc.ents

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab) # created matcher object and pass nlp.vocab

# Here matcher is an object that pairs to current Vocab object
# We can add and remove specific named matchers to matcher as needed

## Creating patterns

In [4]:
# create a list, and inside that list add series of dictionaries

# Hello World can appear in the following ways,
# 1) Hello World  hello world Hello WORLD
# 2) Hello-World

pattern_1 = [{'LOWER': 'hello'}, {'LOWER': 'world'}]
pattern_2 = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]

# 'LOWER', 'IS_PUNCT' are the attributes
# they has to be written in  that way only

In [6]:
# Add patterns to matcher object

# Add a match rule to matcher, A match rule consists of,
# 1) An ID key
# 2) an on_match callback
# 3) one or more patterns

matcher.add('Hello World', None, pattern_1, pattern_2)

In [7]:
# create a document

doc = nlp(" 'Hello World' are the first two printed words for Hello WORLD most of the programmers, printing 'Hello-World' is most common for beginners")

In [12]:
doc

 'Hello World' are the first two printed words for Hello WORLD most of the programmers, printing 'Hello-World' is most common for beginners

In [20]:
print("[", end="")
for token in doc:
  print(token, end=',')

[ ,',Hello,World,',are,the,first,two,printed,words,for,Hello,WORLD,most,of,the,programmers,,,printing,',Hello,-,World,',is,most,common,for,beginners,

## finding the matches

In [10]:
find_matches = matcher(doc) # passin doc to matcher object and store this in a variable 
print(find_matches)

# it returns output list of tuples
# string ID, index start and index end

[(8585552006568828647, 2, 4), (8585552006568828647, 12, 14), (8585552006568828647, 21, 24)]


In [11]:
# define a function to find the matches

for match_id, start, end in find_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8585552006568828647 Hello World 2 4 Hello World
8585552006568828647 Hello World 12 14 Hello WORLD
8585552006568828647 Hello World 21 24 Hello-World


In [21]:
# Removing the matches
matcher.remove('Hello World')

## Setting pattern options and quantifiers

In [22]:
# Redefine the patterns:
pattern_3 = [{'LOWER': 'hello'}, {'LOWER': 'world'}]
pattern_4 = [{'LOWER': 'hello'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'world'}]
# 'OP':'*' ----> Thisis going to allow this pattern to match zero or more times for any punctuation

# Add the new set of patterns to the 'Hellow World' matcher:
matcher.add('Hello World', None, pattern_3, pattern_4)

In [23]:
doc_2 = nlp("You can print Hello World or hello world or Hello-World")

In [24]:
find_matches = matcher(doc_2)
print(find_matches)

[(8585552006568828647, 3, 5), (8585552006568828647, 6, 8), (8585552006568828647, 9, 12)]


# 2) Phrase Matching - Palavras Compostas

In the above section we used token patterns to perform rule-based matching. An alternative and more efficient method is to match on terminology lists

In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into matcher instead


In [25]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [26]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [28]:
phrase_list = ["Barack Obama", "Angela Merkel", "Washington, D.C."]

In [29]:
# Convert each phrase to a document object
phrase_patterns = [nlp(text) for text in phrase_list] # to do that we are using list comprehension

In [30]:
phrase_patterns
# phrase objects are not strings

[Barack Obama, Angela Merkel, Washington, D.C.]

In [31]:
type(phrase_patterns[0])
# they are the spacy docs
# thats why we don't have any quotes there

spacy.tokens.doc.Doc

In [33]:
# pass each doc object into the matcher
matcher.add("TerminologyList", None, *phrase_patterns)
# thats we have to add asterisk mark before phrase_pattern

In [34]:
doc_3 = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")

In [38]:
print("[", end="")
for token in doc_3:
  print(token, end=',')

[German,Chancellor,Angela,Merkel,and,US,President,Barack,Obama,converse,in,the,Oval,Office,inside,the,White,House,in,Washington,,,D.C.,

In [35]:
find_matches = matcher(doc_3) # passin doc to matcher object and store this in a variable 
print(find_matches)

[(3766102292120407359, 2, 4), (3766102292120407359, 7, 9), (3766102292120407359, 19, 22)]


In [37]:
# define a function to find the matches

for match_id, start, end in find_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc_3[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

3766102292120407359 TerminologyList 2 4 Angela Merkel
3766102292120407359 TerminologyList 7 9 Barack Obama
3766102292120407359 TerminologyList 19 22 Washington, D.C.
