# MATCHER

### TOKEN / RULE BASED MACHING

- Compared to using regular expressions on raw text, spaCy’s rule-based matcher engines and components not only let you find the words and phrases you’re looking for – they also give you access to the tokens within the document and their relationships. This means you can easily access and analyze the surrounding tokens.


In [3]:
import spacy

In [4]:
nlp = spacy.load("en_core_web_sm")


In [6]:
from spacy.matcher import Matcher 

In [7]:
#creating object of matcher class to upload vocab of english core library
matcher = Matcher(nlp.vocab) 

In [9]:
text = nlp("Data Coaster !")
for token in text:
    print(token)

Data
Coaster
!


In [13]:
#Designing Pattern
pat1 = [{'Lower':'data'}]
#LOWER = Lower statnds for to clasrify the case of that token or word that choose a pattern whose Lower case is this
pat2 = [{'Lower':'data'},{'Lower':'coaster'}]
pat3 = [{'Lower':'data'},{'Lower':'coaster'},{'IS_PUNCT':True}]
pat4 = [{'Lower':'data'},{'Lower':'coaster'},{'IS_PUnCT':True,'op':'?'}]
# ? : if there is the punctuation or not

In [14]:
matcher.add('DataCoaster',None,pat1,pat2,pat3,pat4) #parameters : tag name, callback function, patterns
#callback fuction : if the match is found , then there is no call back function

In [15]:
doc = nlp('Data Coaster is a place where we are learning together. Thanks and regards data coaster !')

In [16]:
#find

found_match = matcher(doc)

In [17]:
found_match  #return us :match_id(hash values),start, end

[(4758860955724850976, 0, 1),
 (4758860955724850976, 0, 2),
 (4758860955724850976, 14, 15),
 (4758860955724850976, 14, 16),
 (4758860955724850976, 14, 17)]

In [20]:
for matches_id,start,end in found_match:
    string_id = nlp.vocab.strings[matches_id] #for readable format
    span = doc[start:end] #int (id)form
    print(matches_id,string_id,start,end,span.text)
    

4758860955724850976 DataCoaster 0 1 Data
4758860955724850976 DataCoaster 0 2 Data Coaster
4758860955724850976 DataCoaster 14 15 data
4758860955724850976 DataCoaster 14 16 data coaster
4758860955724850976 DataCoaster 14 17 data coaster !


# - If we want to add new pattern then we have to remove oldler one

In [23]:
#matcher.remove('DataCoaster') #removed

In [24]:
pattern1 = [{'Lower':'datascience'}]
pattern2 = [{'Lower':'data'},{'Lower':'science'},{'IS_PUNCT':True,'op':'?'}]

In [25]:
matcher.add('DataScience',None,pattern1,pattern2)

In [30]:
txt=nlp('Data Science !! is data science-')
matching_found=matcher(txt)

In [31]:
matching_found

[(2139155487204529523, 0, 2), (2139155487204529523, 0, 3)]

# PHRASE MATCHER

-  The PhraseMatcher lets you efficiently match large terminology lists. While the Matcher lets you match sequences based on lists of token descriptions, the PhraseMatcher accepts match patterns in the form of Doc objects.

In [13]:
import spacy

In [16]:
nlp = spacy.load("en_core_web_sm")

In [19]:
from spacy.matcher import PhraseMatcher

In [20]:
matcher = PhraseMatcher(nlp.vocab)

In [21]:
doc = nlp("""Cricket is a bat-and-ball game played between two teams of eleven players on a field at the centre of
            which is a 22-yard pitch with a wicket at each end, each comprising two bails balanced on three stumps. 
            The batting side scores runs by striking the ball bowled at the wicket with the bat,
            while the bowling and fielding side tries to prevent this and dismiss each batter (so they are "out"). 
            Means of dismissal include being bowled, when the ball hits the stumps and dislodges the bails, 
            and by the fielding side catching the ball after it is hit by the bat, but before it hits the ground. 
            When ten batters have been dismissed, the innings ends and the teams swap roles.""")
           

In [22]:
doc

Cricket is a bat-and-ball game played between two teams of eleven players on a field at the centre of
            which is a 22-yard pitch with a wicket at each end, each comprising two bails balanced on three stumps. 
            The batting side scores runs by striking the ball bowled at the wicket with the bat,
            while the bowling and fielding side tries to prevent this and dismiss each batter (so they are "out"). 
            Means of dismissal include being bowled, when the ball hits the stumps and dislodges the bails, 
            and by the fielding side catching the ball after it is hit by the bat, but before it hits the ground. 
            When ten batters have been dismissed, the innings ends and the teams swap roles.

In [23]:
phrase_to_find = ['bat-and-ball','22-yard pitch','three stumps','fielding side']

In [24]:
phrase_pattern=[]

In [25]:
for text in phrase_to_find:
    phrase_pattern.append(nlp(text))
    
    

In [26]:
phrase_pattern

[bat-and-ball, 22-yard pitch, three stumps, fielding side]

In [27]:
matcher.add('CricketPhrases',None,*phrase_pattern)

In [28]:
match_found=matcher(doc)

In [29]:
match_found

[(4975789261877713877, 3, 8),
 (4975789261877713877, 27, 29),
 (4975789261877713877, 42, 44),
 (4975789261877713877, 68, 70),
 (4975789261877713877, 110, 112)]

In [32]:
for matches_id,start,end in match_found:
    string_id=nlp.vocab.strings[matches_id]
    span = doc[start:end]
    print(matches_id,string_id,start,end,span.text)
    

4975789261877713877 CricketPhrases 3 8 bat-and-ball
4975789261877713877 CricketPhrases 27 29 22-yard pitch
4975789261877713877 CricketPhrases 42 44 three stumps
4975789261877713877 CricketPhrases 68 70 fielding side
4975789261877713877 CricketPhrases 110 112 fielding side


In [33]:
for matches_id,start,end in match_found:
    string_id=nlp.vocab.strings[matches_id]
    span = doc[start-4:end+5]
    print(matches_id,string_id,start,end,span.text)

4975789261877713877 CricketPhrases 3 8 
4975789261877713877 CricketPhrases 27 29 
            which is a 22-yard pitch with a wicket at each
4975789261877713877 CricketPhrases 42 44 two bails balanced on three stumps. 
            The batting side
4975789261877713877 CricketPhrases 68 70 while the bowling and fielding side tries to prevent this and
4975789261877713877 CricketPhrases 110 112 
            and by the fielding side catching the ball after it
