__Vocabulary and Phrase Matching__

Vocabulary and phrase matching are two techniques used in natural language processing (NLP) to identify words and phrases that are relevant to a specific task, such as information retrieval or sentiment analysis.

Vocabulary matching involves comparing a text document to a predefined list of words or terms that are relevant to the task at hand. This list of words is often referred to as a vocabulary or dictionary. The goal of vocabulary matching is to identify instances of the relevant words or terms in the text. This can be done using simple string matching techniques, such as checking if a word in the text is present in the vocabulary, or more complex methods, such as using regular expressions to match patterns of text.

Phrase matching, on the other hand, involves identifying sequences of words that are relevant to the task. This can be done using a variety of techniques, including rule-based methods and statistical methods. In rule-based methods, patterns of text are defined using regular expressions or other syntax, and the system matches text that matches the defined patterns. In statistical methods, a model is trained on a corpus of text, and the model is used to identify relevant phrases in new text.

Both vocabulary and phrase matching are important techniques in NLP, as they allow for the identification of relevant text in large volumes of data. They are used in a variety of applications, including information retrieval, sentiment analysis, and machine translation, among others.

### Vocabulary Matcher

In [None]:
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [None]:
# Creating patterns
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'},{'LOWER':'power'}]
pattern3 = [{'LOWER': 'solar'},{'IS_;UNCT':True}, {'LOWER':'power'}]
pattern = [pattern1, pattern2, pattern3]
matcher.add('SolarPower', None,  pattern)


In [None]:
doc = nlp(u'The Solar Power industry containues to grow as demand \ for solarpower increases. Solar-power cars are gaining popularrity.')

In [None]:
found_matches = matcher(doc)
print(found_matches)

In [None]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

In [None]:
# Redefine the patterns
pattern1 = [{'LOWER':'solarpower'}]
pattern2 = [{'LOWER':'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]

# Remove the old patterns to avoid duplication
matcher.remove('SolarPower')

#Add the new set of patterns to the 'SolarPower' matcher
matcher.add('SolarPower', None, pattern1, pattern2)

In [None]:
found_matches = matcher(doc)
print(found_matches)

In [None]:
doc2 = nlp(u'Solar-powered energy runs solar-powered cars.')

In [None]:
found_matches = matcher(doc2)
print(found_matches)

### Phrase Matcher

In [None]:
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

In [None]:
TERMS = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add('TerminologyList', None, *patterns)

In [None]:
text_doc = nlp("text here you want to match")
matches = matcher(text_doc)
print(matches)