<a href="https://colab.research.google.com/github/drpetros11111/NLP_Portilia/blob/NLP_Spacy_Basics_1/05_Vocabulary_and_Matching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Vocabulary and Matching
So far we've seen how a body of text is divided into tokens, and how individual tokens are parsed and tagged with parts of speech, dependencies and lemmas.

In this section we will identify and label specific phrases that match patterns we can define ourselves.

## Rule-based Matching
spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

In [1]:
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Importing Matcher
In this code snippet, you're importing the Matcher from spaCy and creating a matcher object using your NLP model's vocabulary (nlp.vocab).

-----------------------------
#Explanation

##from spacy.matcher import Matcher:

This imports the Matcher class from spaCy, which allows you to create rule-based pattern matching for identifying specific sequences of tokens (e.g., phrases, names, or other token patterns) in a text.

##matcher = Matcher(nlp.vocab):

This creates a Matcher object and ties it to the vocabulary (nlp.vocab) of the current NLP model (nlp).

The nlp.vocab object contains all the lexical information (like words and their attributes) used by the NLP model.

Once you create the matcher object, you can define specific patterns and use it to find these patterns in the text processed by the nlp object.

-------------------------------
----------------------------
#How the Matcher Works:
Matcher allows you to define patterns based on token attributes, such as the token's text, lemma (base form), part-of-speech tag, etc.

You can then apply the matcher to a Doc object (like doc from earlier) to find all occurrences of the patterns in the text.

##Example:
###Adding and Using Patterns
Let's say you want to find all occurrences of the phrase "quick brown fox" in a text. You can define a pattern for this sequence and apply the matcher.

---------------------------
# Define a pattern
    pattern = [{"LOWER": "quick"}, {"LOWER": "brown"}, {"LOWER": "fox"}]

--------------------------------
# Add the pattern to the matcher
    matcher.add("QUICK_BROWN_FOX", [pattern])

---------------------------
# Apply the matcher to the processed doc
    matches = matcher(doc)

---------------------------------
# Print the matched spans
    for match_id, start, end in matches:
       span = doc[start:end]  # The matched span
       print(f"Matched: {span.text}")

-------------------------------
#Explanation of the Pattern:
The pattern is a list of dictionaries, where each dictionary corresponds to a token in the sequence:

{"LOWER": "quick"}: This looks for a token where the lowercase form is "quick".

{"LOWER": "brown"}: This looks for a token where the lowercase form is "brown".

{"LOWER": "fox"}: This looks for a token where the lowercase form is "fox".

matcher.add("QUICK_BROWN_FOX", [pattern]):

Adds the pattern to the matcher with the name "QUICK_BROWN_FOX".

The list around pattern allows you to add multiple patterns under the same name.


##matches = matcher(doc):

Applies the matcher to the doc object to find all occurrences of the pattern.

matches returns a list of tuples in the format (match_id, start, end) where:

match_id: The ID of the matched pattern.

start: The starting token index of the match.

end: The ending token index of the match.

span = doc[start:end]:

Extracts the matched text from the doc using the start and end indices.

--------------------
#Output:
For the sentence "The quick brown fox jumped over the lazy dog's back.", the output would be:

Matched: quick brown fox

Would you like to explore more examples or go deeper into how pattern matching can be customized?

<font color=green>Here `matcher` is an object that pairs to the current `Vocab` object. We can add and remove specific named matchers to `matcher` as needed.</font>

### Creating patterns
In literature, the phrase 'solar power' might appear as one word or two, with or without a hyphen. In this section we'll develop a matcher named 'SolarPower' that finds all three:

In [4]:
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]

# Pass a list of patterns as the second argument
matcher.add('SolarPower', [pattern1, pattern2, pattern3])

Let's break this down:
* `pattern1` looks for a single token whose lowercase text reads 'solarpower'
* `pattern2` looks for two adjacent tokens that read 'solar' and 'power' in that order
* `pattern3` looks for three adjacent tokens, with a middle token that can be any punctuation.<font color=green>*</font>

<font color=green>\* Remember that single spaces are not tokenized, so they don't count as punctuation.</font>
<br>Once we define our patterns, we pass them into `matcher` with the name 'SolarPower', and set *callbacks* to `None` (more on callbacks later).

### Applying the matcher to a Doc object

In [7]:
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')
print(doc)

The Solar Power industry continues to grow as demand for solarpower increases. Solar-power cars are gaining popularity.


In [8]:
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


# Applying Matcher

-------------------
##found_matches = matcher(doc):

matcher(doc) applies the matcher object (which contains pre-defined patterns) to the doc.

The doc is a processed text, typically created by passing a string of text through nlp (the spaCy language model).

This line returns a list of matches found in the document based on the patterns you defined earlier using the Matcher object.

------------------------------
##print(found_matches):

This prints the matches that were found in the doc.

The matches are returned in the form of a list of tuples.

----------------------
Each tuple in the list contains three values:

##match_id:
An integer or hash representing the ID of the pattern that was matched.

##start:

The index of the token where the match starts (inclusive).

end: The index of the token where the match ends (exclusive).

-------------------------
---------------------------
#Example of Output
The output of found_matches is typically a list of tuples that look like this:

    [(match_id, start, end), (match_id, start, end), ...]

For example:

    [(1234567890123456789, 2, 5), (9876543210987654321, 7, 9)]

Here’s what each value in the tuple represents:

##match_id:

This is a unique identifier (often a hash or integer) representing the specific pattern that was matched.

When adding a pattern to the Matcher, you often give it a name, and this is how spaCy internally tracks it.

##start and end:

These are the indices of the tokens in the doc where the match occurred.

In this case:
The first match spans the tokens from index 2 to 5.

The second match spans the tokens from index 7 to 9.

Extracting and Printing the Matched Text:
You can extract the matched span from the document using the start and end indices.

------------------------
Here’s an example of how to do this:

    for match_id, start, end in found_matches:
       matched_span = doc[start:end]  # Get the span of matched tokens
       print(f"Match found: {matched_span.text}")

This code will print the actual text that was matched in the document.

-----------------------
##Example Scenario
Let’s assume you have a document "The quick brown fox jumps over the lazy dog.", and you added a pattern to the matcher to detect the phrase "quick brown fox".

    pattern = [{"LOWER": "quick"}, {"LOWER": "brown"}, {"LOWER": "fox"}]
    matcher.add("ANIMAL_ACTION", [pattern])
    found_matches = matcher(doc)
    print(found_matches)

------------------
##Possible Output:

  [(1234567890123456789, 1, 4)]
This output tells you:

The match ID (1234567890123456789) corresponds to the pattern "quick brown fox".

The matched span starts at index 1 (for "quick") and ends at index 4 (the token after "fox").

You can print the actual matched text:

    for match_id, start, end in found_matches:
       print(f"Matched: {doc[start:end].text}")

-----------------
##Output:

Matched: quick brown fox

-----------------
#In Summary
matcher(doc) finds all patterns in the document doc.

found_matches is a list of tuples, where each tuple contains the match ID, start, and end token indices of the matched span.

You can use the start and end indices to extract the matched span from the doc and print the corresponding text.

`matcher` returns a list of tuples. Each tuple contains an ID for the match, with start & end tokens that map to the span `doc[start:end]`

In [9]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


Code Explanation:
python
Copy code
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)
for match_id, start, end in found_matches::

This loops over all the matches returned by the matcher, where:
match_id: The unique identifier (usually a hash or integer) for the matched pattern.
start: The index of the first token of the matched span in the doc.
end: The index of the token immediately after the last token in the matched span.
string_id = nlp.vocab.strings[match_id]:

nlp.vocab.strings is a spaCy method that allows you to convert the integer match_id back into its string form (the name you gave the pattern when adding it to the matcher).
This step retrieves the human-readable string name for the matched pattern.
span = doc[start:end]:

doc[start:end] creates a span from the doc, which includes all tokens between start and end indices.
This span represents the portion of the document where the pattern was found.
print(match_id, string_id, start, end, span.text):

This prints the following:
match_id: The integer ID of the match.
string_id: The string name corresponding to the match (retrieved from nlp.vocab.strings).
start: The starting token index of the matched span.
end: The ending token index (exclusive) of the matched span.
span.text: The actual text of the matched span.
Example Scenario:
Let’s assume we have a document with the sentence "The quick brown fox jumps over the lazy dog." and a matcher that looks for the pattern "quick brown fox".

Here’s what might happen step-by-step:

You define a pattern and add it to the matcher:

python
Copy code
pattern = [{"LOWER": "quick"}, {"LOWER": "brown"}, {"LOWER": "fox"}]
matcher.add("QUICK_BROWN_FOX_PATTERN", [pattern])
You process the document and apply the matcher:

python
Copy code
doc = nlp("The quick brown fox jumps over the lazy dog.")
found_matches = matcher(doc)
Now, when you run the loop:

python
Copy code
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)
Expected Output:
The output would look like this (depending on your pattern match):

Copy code
1234567890123456789 QUICK_BROWN_FOX_PATTERN 1 4 quick brown fox
Here’s what each part of the output means:

1234567890123456789: The internal match ID (a hash or integer representing the pattern).
QUICK_BROWN_FOX_PATTERN: The string name associated with the pattern that was matched.
1: The starting index of the match (the token "quick" is at index 1).
4: The ending index of the match (this is the index after "fox", so the match spans tokens 1 to 3).
quick brown fox: The text of the matched span from the document.
Summary:
This loop extracts and prints details about each matched pattern:
The internal match ID.
The name of the pattern (e.g., "QUICK_BROWN_FOX_PATTERN").
The starting and ending token indices.
The actual matched text from the document.
This code is useful for analyzing the results of pattern matching and understanding which patterns were matched and where they occurred in the text. Would you like to dive deeper into pattern matching or how to define more complex patterns?

The `match_id` is simply the hash value of the `string_ID` 'SolarPower'

### Setting pattern options and quantifiers
You can make token rules optional by passing an `'OP':'*'` argument. This lets us streamline our patterns list:

In [None]:
# Redefine the patterns:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]

# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2)

In [None]:
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


This found both two-word patterns, with and without the hyphen!

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


### Be careful with lemmas!
If we wanted to match on both 'solar power' and 'solar powered', it might be tempting to look for the *lemma* of 'powered' and expect it to be 'power'. This is not always the case! The lemma of the *adjective* 'powered' is still 'powered':

In [None]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LEMMA': 'power'}] # CHANGE THIS PATTERN

# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2)

In [None]:
doc2 = nlp(u'Solar-powered energy runs solar-powered cars.')

In [None]:
found_matches = matcher(doc2)
print(found_matches)

[(8656102463236116519, 0, 3)]


<font color=green>The matcher found the first occurrence because the lemmatizer treated 'Solar-powered' as a verb, but not the second as it considered it an adjective.<br>For this case it may be better to set explicit token patterns.</font>

In [None]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solarpowered'}]
pattern4 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'powered'}]

# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2, pattern3, pattern4)

In [None]:
found_matches = matcher(doc2)
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 5, 8)]


## Other token attributes
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

### Token wildcard
You can pass an empty dictionary `{}` as a wildcard to represent **any token**. For example, you might want to retrieve hashtags without knowing what might follow the `#` character:
>`[{'ORTH': '#'}, {}]`

___
## PhraseMatcher
In the above section we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead.

In [None]:
# Perform standard imports, reset nlp
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

For this exercise we're going to import a Wikipedia article on *Reaganomics*<br>
Source: https://en.wikipedia.org/wiki/Reaganomics

In [None]:
with open('../TextFiles/reaganomics.txt', encoding='utf8') as f:
    doc3 = nlp(f.read())

In [None]:
# First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

# Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]

# Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)

# Build a list of matches:
matches = matcher(doc3)

In [None]:
# (match_id, start, end)
matches

[(3473369816841043438, 41, 45),
 (3473369816841043438, 49, 53),
 (3473369816841043438, 54, 56),
 (3473369816841043438, 61, 65),
 (3473369816841043438, 673, 677),
 (3473369816841043438, 2985, 2989)]

<font color=green>The first four matches are where these terms are used in the definition of Reaganomics:</font>

In [None]:
doc3[:70]

REAGANOMICS
https://en.wikipedia.org/wiki/Reaganomics

Reaganomics (a portmanteau of [Ronald] Reagan and economics attributed to Paul Harvey)[1] refers to the economic policies promoted by U.S. President Ronald Reagan during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by political advocates.


## Viewing Matches
There are a few ways to fetch the text surrounding a match. The simplest is to grab a slice of tokens from the doc that is wider than the match:

In [None]:
doc3[665:685]  # Note that the fifth match starts at doc3[673]

same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian

In [None]:
doc3[2975:2995]  # The sixth match starts at doc3[2985]

against institutions.[66] His policies became widely known as "trickle-down economics", due to the significant

Another way is to first apply the `sentencizer` to the Doc, then iterate through the sentences to the match point:

In [None]:
# Build a list of sentences
sents = [sent for sent in doc3.sents]

# In the next section we'll see that sentences contain start and end token values:
print(sents[0].start, sents[0].end)

0 35


In [None]:
# Iterate over the sentence list until the sentence end value exceeds a match start value:
for sent in sents:
    if matches[4][1] < sent.end:  # this is the fifth match, that starts at doc3[673]
        print(sent)
        break

At the same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian demand-stimulus economics.


For additional information visit https://spacy.io/usage/linguistic-features#section-rule-based-matching
## Next Up: NLP Basics Assessment