## Phrase or Vocab matching
### Using spacy we can pass a phrase and let spacy check where the phrase is in the doc or text of string of sentence
### This can be done in python but we there are many simple methods where we can find spaces , punctuations and anyother symbols between the words
### In spacy this can be done using Matcher
### Matcher basically try to match the word or phrase in the given sentence of doc.
### We can send list of patterns and ask Matcher to find the list of phases to check

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')



In [2]:
from spacy.matcher import  Matcher
## WE will load nlp.vocab
matcher= Matcher(nlp.vocab)

#### Now we will find the phrase solarpower from the text
#### Solarpower can be solar power, solar-power or solarpower or it canbe of any kind
#### We will define what patterns to find a word
#### Patters are list of dictionaries with rules in it

In [3]:

## WE will define pattern for solarpower
pattern1= [{'LOWER':'solarpower'}]
##Soalr power
pattern2=[{'LOWER':'solar'}, {'LOWER':'power'}]
#solar-power
pattern3=[{'LOWER':'solar'}, {'IS_PUNCT':True}, {'LOWER':'power'}]

#### Now we will create a sentence and find the matcher

In [4]:
doc= nlp(u"The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.")

In [5]:
## This is not just text this is spacy doc
doc

The Solar Power industry continues to grow as demand for solarpower increases. Solar-power cars are gaining popularity.

#### We will add the patterns to matcher before finding the phrases

In [6]:
matcher.add('SolarPower', None, pattern1, pattern2, pattern3)

In [7]:
## We can view the patterns
matcher.get('SolarPower')

(None,
 [[{'LOWER': 'solarpower'}],
  [{'LOWER': 'solar'}, {'LOWER': 'power'}],
  [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]])

In [8]:
##LEts pass this doc to Matcher and find the matched areas
found_matches= matcher(doc)
found_matches

[(8656102463236116519, 1, 3),
 (8656102463236116519, 10, 11),
 (8656102463236116519, 13, 16)]

### The above list of tuples consists of a matchid of each word and its placements.
#### We can elaborate this little and can display what are the words 

In [9]:
for match_id, start, end in found_matches:
    ## WE will take the string id from nlp
    string_id= nlp.vocab.strings[match_id]
    #print(string_id)
    ## Now we will take the actual text from the doc
    span= doc[start: end]
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


In [10]:
for text in doc:
    print(text.lemma_, text.pos_)

the DET
Solar PROPN
Power PROPN
industry NOUN
continue VERB
to PART
grow VERB
as SCONJ
demand NOUN
for ADP
solarpow ADJ
increase NOUN
. PUNCT
solar ADJ
- PUNCT
power NOUN
car NOUN
be AUX
gain VERB
popularity NOUN
. PUNCT


### Now lets check with some other sentence which is little different
### Note: Here we have different naming (solarpowered instead of solar power)
### Lets check the result

In [11]:
doc= nlp(u"'Solar-powered energy runs solar-powered cars.")

In [12]:
found_matches= matcher(doc)
found_matches

[]

### We didnt get any results
### In spacy while creating pattern we have token attributes
## Other token attributes
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

### From this we can pick lemmatization for powered so that we can get power
### Lets Try!
### In the second dict of pattern we have added one more item which is op. * this means that it will consider if there are more than zero punctuations not only one.
### Below are some of other items we can use
This found both two-word patterns, with and without the hyphen!

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


In [13]:
patternlemma=[{'LOWER':'solar'}, {'IS_PUNCT':True, 'OP':'*'}, {'LEMMA':'power'}]

In [14]:
matcher.add('SolarPower', None, patternlemma)

In [15]:
matcher.get('SolarPower')

(None,
 [[{'LOWER': 'solarpower'}],
  [{'LOWER': 'solar'}, {'LOWER': 'power'}],
  [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}],
  [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP': '*'}, {'LEMMA': 'power'}]])

### Lets Test now

In [16]:
found_matches= matcher(doc)
found_matches

[(8656102463236116519, 1, 4), (8656102463236116519, 6, 9)]

In [17]:
doc

'Solar-powered energy runs solar-powered cars.

#### Lets check the lemmas of the doc

In [18]:
## This looks great
for text in doc:
    print(text.lemma_)

'
Solar
-
power
energy
run
solar
-
power
car
.


### WE can even mention hastags also

In [19]:
doc2=nlp(u'Hello Bangalore #HelloBglre #Gm #FeelingFit')

### IF we are not sure about the words after the Hashtag we can simply place an empty string as wild card

In [20]:
patternhash=[{'ORTH': '#'}, {}]
matcher.add('Hash',None, patternhash)

In [21]:
matcher.get('Hash')

(None, [[{'ORTH': '#'}, {}]])

In [22]:
found_matches= matcher(doc2)
found_matches

[(13232031677015429232, 2, 4),
 (13232031677015429232, 4, 6),
 (13232031677015429232, 6, 8)]

In [23]:
for match_id, start, end in found_matches:
    string_id= nlp.vocab.strings[match_id]
    print(string_id)
    span = doc2[start:end]
    print(span.text)

Hash
#HelloBglre
Hash
#Gm
Hash
#FeelingFit


### Now we will look into phrase matcher
### instead of pattern we will pass list of values to Phasematcher instead of matcher

In [24]:
from spacy.matcher import PhraseMatcher
phrasematcher=PhraseMatcher(nlp.vocab)

### We will take the text of REAGANOMICS and below is the link for reference
https://en.wikipedia.org/wiki/Reaganomics
### But we have a file which is converted to text

In [25]:
with open('reaganomics.txt','r', errors='ignore') as reaga:
    raeganomics=nlp(reaga.read())

In [26]:
## Lets take some phrases from the text and try to search those in the file
phrase_list=['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

In [27]:
### We will convert every phrase to nlp doc
phrase_nlp= [nlp(text) for text in phrase_list]

In [28]:
phrase_nlp

[voodoo economics,
 supply-side economics,
 trickle-down economics,
 free-market economics]

In [29]:
### NOw lets add those phrase nlp to the matcher
phrasematcher.add('Economics' , None, *phrase_nlp)

In [30]:
## Now lets find the phrase matcher
found_matches= phrasematcher(raeganomics)
found_matches

[(11454562835486586514, 41, 45),
 (11454562835486586514, 49, 53),
 (11454562835486586514, 54, 56),
 (11454562835486586514, 61, 65),
 (11454562835486586514, 673, 677),
 (11454562835486586514, 2986, 2990)]

### Now lets check the words

In [31]:
for match_id, start, end in found_matches:
    string_id= nlp.vocab.strings[match_id]
    span= raeganomics[start:end]
    print(string_id, span.text)

Economics supply-side economics
Economics trickle-down economics
Economics voodoo economics
Economics free-market economics
Economics supply-side economics
Economics trickle-down economics


### If we want to check in which context those phrases came eg: if we want to get some words befre and after those phrases to know the context we can simply add words to start and end

In [32]:
for match_id, start, end in found_matches:
    string_id= nlp.vocab.strings[match_id]
    span= raeganomics[start-10:end+5]
    print(span.text)

during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle
associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political
economics, referred to as trickle-down economics or voodoo economics by political opponents, and
down economics or voodoo economics by political opponents, and free-market economics by political advocates.


At the same time he attracted a following from the supply-side economics movement, which formed in
against institutions.[66] His policies became widely known as "trickle-down economics", due to the


### Now lets check the same with solar power
### We will pass only solar power inthe phrase list and check whether it will detect solarpowered

In [33]:
doc= nlp(u"'Solar-powered energy runs solar-powered cars.")

In [47]:
phrase_list=['SolarPower','Solar Power', 'solar-powered']
phrase_nlp= [nlp(text) for text in phrase_list]

In [50]:
phrasematcher.add('SolarPower',None, *phrase_nlp)

In [36]:
found_matches = phrasematcher(doc)
found_matches

[(8656102463236116519, 6, 9)]

### Lets pass pattern in the phrase list

In [37]:
phrasematcher.remove('SolarPower')

In [38]:
phrase_list=['SolarPower','Solar Power']
phrase_nlp= [nlp(text) for text in phrase_list]

In [39]:
phrase_nlp.append(patternlemma)

In [40]:
phrasematcher.add('SolarPower',None, *phrase_nlp)

TypeError: unhashable type: 'dict'

### WE cannot pass pattern and phrase list together

In [41]:
spacy.__version__

'2.3.2'

## So to wrap it up
### IF we want to find the phrases irrespective with case sensitive we can use phrase matcher as it can provide much more flexibility
### But if we want phrases in different styles say like solar powered, solar-powered, and lemmatizations we can use Matcher