<a href="https://colab.research.google.com/github/drpetros11111/NLP_Portilia/blob/NLP_Spacy_Basics_1/05_Vocabulary_and_Matching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Vocabulary and Matching
So far we've seen how a body of text is divided into tokens, and how individual tokens are parsed and tagged with parts of speech, dependencies and lemmas.

In this section we will identify and label specific phrases that match patterns we can define ourselves.

## Rule-based Matching
spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

In [3]:
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

In [4]:
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Importing Matcher
In this code snippet, you're importing the Matcher from spaCy and creating a matcher object using your NLP model's vocabulary (nlp.vocab).

-----------------------------
#Explanation

##from spacy.matcher import Matcher:

This imports the Matcher class from spaCy, which allows you to create rule-based pattern matching for identifying specific sequences of tokens (e.g., phrases, names, or other token patterns) in a text.

##matcher = Matcher(nlp.vocab):

This creates a Matcher object and ties it to the vocabulary (nlp.vocab) of the current NLP model (nlp).

The nlp.vocab object contains all the lexical information (like words and their attributes) used by the NLP model.

Once you create the matcher object, you can define specific patterns and use it to find these patterns in the text processed by the nlp object.

-------------------------------
----------------------------
#How the Matcher Works:
Matcher allows you to define patterns based on token attributes, such as the token's text, lemma (base form), part-of-speech tag, etc.

You can then apply the matcher to a Doc object (like doc from earlier) to find all occurrences of the patterns in the text.

##Example:
###Adding and Using Patterns
Let's say you want to find all occurrences of the phrase "quick brown fox" in a text. You can define a pattern for this sequence and apply the matcher.

---------------------------
# Define a pattern
    pattern = [{"LOWER": "quick"}, {"LOWER": "brown"}, {"LOWER": "fox"}]

--------------------------------
# Add the pattern to the matcher
    matcher.add("QUICK_BROWN_FOX", [pattern])

---------------------------
# Apply the matcher to the processed doc
    matches = matcher(doc)

---------------------------------
# Print the matched spans
    for match_id, start, end in matches:
       span = doc[start:end]  # The matched span
       print(f"Matched: {span.text}")

-------------------------------
#Explanation of the Pattern:
The pattern is a list of dictionaries, where each dictionary corresponds to a token in the sequence:

{"LOWER": "quick"}: This looks for a token where the lowercase form is "quick".

{"LOWER": "brown"}: This looks for a token where the lowercase form is "brown".

{"LOWER": "fox"}: This looks for a token where the lowercase form is "fox".

matcher.add("QUICK_BROWN_FOX", [pattern]):

Adds the pattern to the matcher with the name "QUICK_BROWN_FOX".

The list around pattern allows you to add multiple patterns under the same name.


##matches = matcher(doc):

Applies the matcher to the doc object to find all occurrences of the pattern.

matches returns a list of tuples in the format (match_id, start, end) where:

match_id: The ID of the matched pattern.

start: The starting token index of the match.

end: The ending token index of the match.

span = doc[start:end]:

Extracts the matched text from the doc using the start and end indices.

--------------------
#Output:
For the sentence "The quick brown fox jumped over the lazy dog's back.", the output would be:

Matched: quick brown fox

Would you like to explore more examples or go deeper into how pattern matching can be customized?

<font color=green>Here `matcher` is an object that pairs to the current `Vocab` object. We can add and remove specific named matchers to `matcher` as needed.</font>

### Creating patterns
In literature, the phrase 'solar power' might appear as one word or two, with or without a hyphen. In this section we'll develop a matcher named 'SolarPower' that finds all three:

In [5]:
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]

# Pass a list of patterns as the second argument
matcher.add('SolarPower', [pattern1, pattern2, pattern3])

Let's break this down:
* `pattern1` looks for a single token whose lowercase text reads 'solarpower'
* `pattern2` looks for two adjacent tokens that read 'solar' and 'power' in that order
* `pattern3` looks for three adjacent tokens, with a middle token that can be any punctuation.<font color=green>*</font>

<font color=green>\* Remember that single spaces are not tokenized, so they don't count as punctuation.</font>
<br>Once we define our patterns, we pass them into `matcher` with the name 'SolarPower', and set *callbacks* to `None` (more on callbacks later).

### Applying the matcher to a Doc object

In [6]:
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')
print(doc)

The Solar Power industry continues to grow as demand for solarpower increases. Solar-power cars are gaining popularity.


In [7]:
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


# Applying Matcher

-------------------
##found_matches = matcher(doc):

matcher(doc) applies the matcher object (which contains pre-defined patterns) to the doc.

The doc is a processed text, typically created by passing a string of text through nlp (the spaCy language model).

This line returns a list of matches found in the document based on the patterns you defined earlier using the Matcher object.

------------------------------
##print(found_matches):

This prints the matches that were found in the doc.

The matches are returned in the form of a list of tuples.

----------------------
Each tuple in the list contains three values:

##match_id:
An integer or hash representing the ID of the pattern that was matched.

##start:

The index of the token where the match starts (inclusive).

end: The index of the token where the match ends (exclusive).

-------------------------
---------------------------
#Example of Output
The output of found_matches is typically a list of tuples that look like this:

    [(match_id, start, end), (match_id, start, end), ...]

For example:

    [(1234567890123456789, 2, 5), (9876543210987654321, 7, 9)]

Here’s what each value in the tuple represents:

##match_id:

This is a unique identifier (often a hash or integer) representing the specific pattern that was matched.

When adding a pattern to the Matcher, you often give it a name, and this is how spaCy internally tracks it.

##start and end:

These are the indices of the tokens in the doc where the match occurred.

In this case:
The first match spans the tokens from index 2 to 5.

The second match spans the tokens from index 7 to 9.

Extracting and Printing the Matched Text:
You can extract the matched span from the document using the start and end indices.

------------------------
Here’s an example of how to do this:

    for match_id, start, end in found_matches:
       matched_span = doc[start:end]  # Get the span of matched tokens
       print(f"Match found: {matched_span.text}")

This code will print the actual text that was matched in the document.

-----------------------
##Example Scenario
Let’s assume you have a document "The quick brown fox jumps over the lazy dog.", and you added a pattern to the matcher to detect the phrase "quick brown fox".

    pattern = [{"LOWER": "quick"}, {"LOWER": "brown"}, {"LOWER": "fox"}]
    matcher.add("ANIMAL_ACTION", [pattern])
    found_matches = matcher(doc)
    print(found_matches)

------------------
##Possible Output:

  [(1234567890123456789, 1, 4)]
This output tells you:

The match ID (1234567890123456789) corresponds to the pattern "quick brown fox".

The matched span starts at index 1 (for "quick") and ends at index 4 (the token after "fox").

You can print the actual matched text:

    for match_id, start, end in found_matches:
       print(f"Matched: {doc[start:end].text}")

-----------------
##Output:

Matched: quick brown fox

-----------------
#In Summary
matcher(doc) finds all patterns in the document doc.

found_matches is a list of tuples, where each tuple contains the match ID, start, and end token indices of the matched span.

You can use the start and end indices to extract the matched span from the doc and print the corresponding text.

`matcher` returns a list of tuples. Each tuple contains an ID for the match, with start & end tokens that map to the span `doc[start:end]`

In [8]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


# Analyzing the results of pattern matching
    for match_id, start, end in found_matches:
       string_id = nlp.vocab.strings[match_id]  # get string representation
       span = doc[start:end]                    # get the matched span
       print(match_id, string_id, start, end, span.text)

----------------------
##for match_id, start, end in found_matches::

This loops over all the matches returned by the matcher, where:

##match_id:

The unique identifier (usually a hash or integer) for the matched pattern.
start: The index of the first token of the matched span in the doc.

##end:

The index of the token immediately after the last token in the matched span.

##string_id = nlp.vocab.strings[match_id]:

nlp.vocab.strings is a spaCy method that allows you to convert the integer match_id back into its string form (the name you gave the pattern when adding it to the matcher).

This step retrieves the human-readable string name for the matched pattern.

##span = doc[start:end]:

doc[start:end] creates a span from the doc, which includes all tokens between start and end indices.

This span represents the portion of the document where the pattern was found.
print(match_id, string_id, start, end, span.text):

-----------------
##This prints the following:
match_id: The integer ID of the match.

string_id: The string name corresponding to the match (retrieved from nlp.vocab.strings).

start: The starting token index of the matched span.

end: The ending token index (exclusive) of the matched span.

span.text: The actual text of the matched span.

------------------------
#Example Scenario:
Let’s assume we have a document with the sentence "The quick brown fox jumps over the lazy dog." and a matcher that looks for the pattern "quick brown fox".

Here’s what might happen step-by-step:

##1. You define a pattern and add it to the matcher:

    pattern = [{"LOWER": "quick"}, {"LOWER": "brown"}, {"LOWER": "fox"}]
    matcher.add("QUICK_BROWN_FOX_PATTERN", [pattern])

##2. You process the document and apply the matcher:

    doc = nlp("The quick brown fox jumps over the lazy dog.")

    found_matches = matcher(doc)

##3. Now, when you run the loop:

    for match_id, start, end in found_matches:
      string_id = nlp.vocab.strings[match_id]  # get string representation
      span = doc[start:end]                    # get the matched span
      print(match_id, string_id, start, end, span.text)
      
##4. Expected Output:
The output would look like this (depending on your pattern match):

1234567890123456789 QUICK_BROWN_FOX_PATTERN 1 4 quick brown fox

Here’s what each part of the output means:

##1234567890123456789:

The internal match ID (a hash or integer representing the pattern).

##QUICK_BROWN_FOX_PATTERN:

The string name associated with the pattern that was matched.

The starting index of the match (the token "quick" is at index 1).

4: The ending index of the match (this is the index after "fox", so the match spans tokens 1 to 3).

quick brown fox: The text of the matched span from the document.

-------------------
#Summary
This loop extracts and prints details about each matched pattern:

The internal match ID.
The name of the pattern (e.g., "QUICK_BROWN_FOX_PATTERN").

The starting and ending token indices.
The actual matched text from the document.

This code is useful for analyzing the results of pattern matching and understanding which patterns were matched and where they occurred in the text.

The `match_id` is simply the hash value of the `string_ID` 'SolarPower'

### Setting pattern options and quantifiers
You can make token rules optional by passing an `'OP':'*'` argument. This lets us streamline our patterns list:

In [13]:
import spacy

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocabulary
matcher = spacy.matcher.Matcher(nlp.vocab)

# Define the patterns:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', [pattern1, pattern2]) # Only two arguments are needed

# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')

# Add the new set of patterns to the 'SolarPower' matcher in a list:
matcher.add('SolarPower', [pattern1, pattern2])

# Explain pattern2
    pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]

This line of code defines a pattern for the spaCy matcher. Let's break down what each part means:

--------------------------
##pattern2 = [...]:

This assigns a list to the variable pattern2.

This list contains the pattern to be matched.

---------------------
##{'LOWER': 'solar'}:

This is the first item in the pattern.

It's a dictionary specifying that the matcher should look for a token whose lowercase form is "solar".

-------------------------
##{'IS_PUNCT': True, 'OP':'*'}:

This is the second item.

It's also a dictionary, and it specifies that the matcher should look for any punctuation mark.

The 'OP': '*' part means that this token can occur zero or more times.

This allows for cases like "solarpower", "solar-power", and "solar---power".

----------------------
##{'LOWER': 'power'}:

This is the third item, and it's similar to the first. It specifies that the matcher should look for a token whose lowercase form is "power".

-----------------------
#In summary
this pattern will match any occurrence of "solar" followed by zero or more punctuation marks and then "power".

This allows for flexibility in how the words "solar" and "power" are combined, such as with or without hyphens or other punctuation.

In [14]:
# Redefine the patterns:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]

# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', [pattern1, pattern2]) # Pass a list of patterns as the second argument.

In [15]:
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


This found both two-word patterns, with and without the hyphen!

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


### Be careful with lemmas!
If we wanted to match on both 'solar power' and 'solar powered', it might be tempting to look for the *lemma* of 'powered' and expect it to be 'power'. This is not always the case! The lemma of the *adjective* 'powered' is still 'powered':

In [18]:
import spacy

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocab
matcher = spacy.matcher.Matcher(nlp.vocab)

pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LEMMA': 'power'}]

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', [pattern1, pattern2])

# Create a sample document
doc = nlp("Solarpower is awesome! The solar-powered car is eco-friendly.")

# Match the patterns in the document
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 0, 1), (8656102463236116519, 5, 8)]


In [19]:
doc2 = nlp(u'Solar-powered energy runs solar-powered cars.')

In [20]:
found_matches = matcher(doc2)
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 5, 8)]


<font color=green>The matcher found the first occurrence because the lemmatizer treated 'Solar-powered' as a verb, but not the second as it considered it an adjective.<br>For this case it may be better to set explicit token patterns.</font>

In [24]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solarpowered'}]
pattern4 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'powered'}]

# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', [pattern1, pattern2, pattern3, pattern4]) # Pass all patterns as a single list

ValueError: [E175] Can't remove rule for unknown match pattern ID: SolarPower

In [26]:
import spacy

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocab
matcher = spacy.matcher.Matcher(nlp.vocab)

pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solarpowered'}]
pattern4 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'powered'}]

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', [pattern1, pattern2, pattern3, pattern4]) # Pass all patterns as a single list

# Explanation:
-----------------------
##matcher.add(...):

##matcher:
This is the Matcher object that you created earlier using Matcher(nlp.vocab).

It is used to define and find token-based patterns in the text.

##.add(...):

This method is used to add a new pattern or set of patterns to the Matcher object. It associates the pattern(s) with a name (in this case, "ANIMAL_PATTERN"), which helps in identifying the matched patterns later.

-------------------------
#"ANIMAL_PATTERN" (Pattern Name):

The first argument "ANIMAL_PATTERN" is a string that serves as a label or name for this particular pattern.

When the pattern is matched in the document, this name will be associated with the match.

You can give any name that makes sense for the pattern you are adding (for example, "ANIMAL_PATTERN" to indicate that the pattern matches animal-related phrases).

##[pattern] (Pattern List):

The second argument is a list of patterns that you want to match.

Each pattern is defined as a list of dictionaries, where each dictionary specifies the attributes of a token in the pattern.

In this case, [pattern] means you are adding a single pattern.

If you have multiple patterns, you can add them all at once by passing a list of them.

Each pattern is a list of dictionaries, where each dictionary describes one token's attributes (like LOWER, POS, LEMMA, etc.).

----------------------------
#Example of the Pattern:
Let’s look at an example pattern and the call to matcher.add in context.

    from spacy.matcher import Matcher
    nlp = spacy.load("en_core_web_sm")


##matcher = Matcher(nlp.vocab)

----------------------------
# Define a pattern that looks for "quick" followed by "brown" followed by "fox"
    pattern = [{"LOWER": "quick"}, {"LOWER": "brown"}, {"LOWER": "fox"}]

# Add the pattern to the matcher with the name "ANIMAL_PATTERN"
    matcher.add("ANIMAL_PATTERN", [pattern])

# Now apply the matcher to a document
    doc = nlp("The quick brown fox jumps over the lazy dog.")
    matches = matcher(doc)

    for match_id, start, end in matches:
       matched_span = doc[start:end]
       print(f"Match ID: {match_id}, Pattern Name: {nlp.vocab.strings[match_id]}, Matched Text: {matched_span.text}")

---------------------------
#Detailed Breakdown:
    pattern = [{"LOWER": "quick"}, {"LOWER": "brown"}, {"LOWER": "fox"}]

This is the pattern being added to the matcher.

It is a list of three dictionaries, where each dictionary describes one token.

------------
    {"LOWER": "quick"}

This specifies that the first token should be the lowercase word "quick".

{"LOWER": "brown"}: The second token should be the lowercase word "brown".

{"LOWER": "fox"}: The third token should be the lowercase word "fox".

The pattern matches any sequence of tokens that follow the exact order "quick", "brown", and "fox".
matcher.add("ANIMAL_PATTERN", [pattern]):

The add method adds this pattern to the Matcher under the name "ANIMAL_PATTERN".

You can now search for this pattern in any document processed by nlp.

---------------------------
#Output Example:
If the document "The quick brown fox jumps over the lazy dog." is processed, the matcher will find a match because the sequence "quick brown fox" exists in the document. The output would be:

##Match ID: 1234567890123456789, Pattern ##Name: ANIMAL_PATTERN, Matched Text: quick brown fox

-----------------
#What Happens in the Code:
You define a pattern that looks for a specific sequence of tokens (in this case, "quick", "brown", "fox").

You add this pattern to the Matcher under the name "ANIMAL_PATTERN".
You apply the matcher to a doc object (the document).

If the pattern is found in the document, it returns a match. You can then print the matched text and the name of the pattern that matched it.

----------
#Summary
matcher.add("ANIMAL_PATTERN", [pattern]): This line adds a pattern to the Matcher object with the label "ANIMAL_PATTERN".

The pattern is a list of dictionaries, where each dictionary defines the properties of one token (such as its text, part of speech, etc.).

Once the pattern is added, you can use the matcher to find occurrences of that pattern in any document and retrieve the matched spans.

Would you like further details on defining more complex patterns or token attributes?

-------------------------
-------------------------
#Note:

------------------
# Add the pattern to the matcher with the name "ANIMAL_PATTERN"
    matcher.add("ANIMAL_PATTERN", [pattern])

-----------------------------
#1. The Pattern Definition:

    pattern = [{"LOWER": "quick"}, {"LOWER": "brown"}, {"LOWER": "fox"}]

Here, we are defining a single pattern.

This pattern is a list of dictionaries, where each dictionary specifies the attributes of a single token.

In this example:

{"LOWER": "quick"}: This dictionary specifies that the first token should be the word "quick" in lowercase.

{"LOWER": "brown"}: The second token should be "brown" in lowercase.

{"LOWER": "fox"}: The third token should be "fox" in lowercase.

This structure allows you to match the sequence of tokens "quick brown fox" in a document.

-----------------------
#2. Adding the Pattern to the Matcher:
    matcher.add("ANIMAL_PATTERN", [pattern])

Now, you are adding this pattern to the Matcher object with the label "ANIMAL_PATTERN". Here’s why the syntax has two levels:

"ANIMAL_PATTERN": This is the label or name you assign to the pattern. It helps identify the pattern when a match is found.

[pattern]: This is where the list of patterns comes in.

------------------------
#Why do we need a list here?
The matcher.add method takes a list of patterns as its second argument.

This is because you might want to define multiple patterns that can match the same label.

For example, if you wanted the label "ANIMAL_PATTERN" to match both "quick brown fox" and "lazy dog", you could define multiple patterns like this:

    pattern1 = [{"LOWER": "quick"}, {"LOWER": "brown"}, {"LOWER": "fox"}]
    pattern2 = [{"LOWER": "lazy"}, {"LOWER": "dog"}]

--------------
# Add both patterns to the matcher with the same label
    matcher.add("ANIMAL_PATTERN", [pattern1, pattern2])

In this case, matcher.add("ANIMAL_PATTERN", [pattern1, pattern2]) adds both patterns under the same label "ANIMAL_PATTERN".

So, the matcher will search for both "quick brown fox" and "lazy dog" in the text.

#Simplifying:
    pattern = [{"LOWER": "quick"}, {"LOWER": "brown"}, {"LOWER": "fox"}]:
    
This defines one pattern.

##[pattern]:

Since the matcher.add() expects a list of patterns, we pass the single pattern in a list ([pattern]).

If you had multiple patterns to match under the same label, you would define each pattern separately and pass them as a list:

    matcher.add("PATTERN_NAME", [pattern1, pattern2, pattern3])

-------------------
#Summary
pattern = [...]: Defines a single pattern as a list of token attributes (in this case, lowercase words).

matcher.add("LABEL", [pattern]): Adds a list of one or more patterns to the matcher, with "LABEL" as the name for those patterns.

[pattern]: Even if you have only one pattern, you need to wrap it in a list because matcher.add() expects a list of patterns.

If you have multiple patterns, you can pass a list like [pattern1, pattern2, ...] to match different token sequences under the same label.

Does this clarify why both syntaxes are used? Let me know if you need more examples or further clarification!










In [27]:
found_matches = matcher(doc2)
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 5, 8)]


## Other token attributes
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

### Token wildcard
You can pass an empty dictionary `{}` as a wildcard to represent **any token**. For example, you might want to retrieve hashtags without knowing what might follow the `#` character:
>`[{'ORTH': '#'}, {}]`

___
## PhraseMatcher
In the above section we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead.

-------------------------
#Pattern vs. Phrase Matching
The Matcher and PhraseMatcher in spaCy are both used for finding patterns in text, but they work in fundamentally different ways and are designed for different types of tasks. Here’s a detailed comparison to help clarify the differences:

--------------------
#1. Pattern Matcher (Matcher)
The Matcher is a flexible, token-based pattern matcher that allows you to define rules or patterns to match specific combinations of tokens.

It works on token attributes, allowing for highly customizable and complex matching rules.

----------------------
#Key Characteristics:
##Token-based:

It operates on individual tokens (words) and allows you to match specific token attributes, such as text, part-of-speech tags, lemmas, etc.

##Flexible Patterns:
You can define complex patterns that include multiple tokens, optional tokens, and conditions based on token attributes.

#@Customizable:
It lets you match tokens based on various attributes like LOWER, LEMMA, POS, TAG, etc.

For example, you can create a pattern that matches a verb followed by a noun, or a sequence of specific words in lowercase.

Fine-Grained Control: You can define how each token in a sequence should behave.

------------------------
#Example Usage:

    from spacy.matcher import Matcher
    nlp = spacy.load("en_core_web_sm")
    matcher = Matcher(nlp.vocab)

# Define a pattern: match "quick" followed by "brown" followed by "fox"
    pattern = [{"LOWER": "quick"}, {"LOWER": "brown"}, {"LOWER": "fox"}]
    matcher.add("ANIMAL_PATTERN", [pattern])

    doc = nlp("The quick brown fox jumps over the lazy dog.")
    matches = matcher(doc)

    for match_id, start, end in matches:
       span = doc[start:end]
       print(f"Matched span: {span.text}")

---------------------------
#2. Phrase Matcher (PhraseMatcher)
The PhraseMatcher is designed for fast and efficient matching of large sets of exact phrases.

Instead of matching token attributes or more abstract patterns, it directly matches sequences of tokens (phrases) that are pre-defined.

------------------------
##Key Characteristics:
###Phrase-based:

It matches exact sequences of tokens (phrases) that you have predefined. It doesn't work with token attributes like POS or LEMMA, but instead looks for exact phrase matches.

Faster and more efficient: Since it works on exact phrases and doesn't need to evaluate token attributes, it is much faster, especially when you need to match a large list of phrases.

###Ideal for Named Entities and Keywords:

It’s particularly useful when you have a predefined set of phrases or entities you want to search for in the text (e.g., company names, product names, or other proper nouns).

----------------------------
#Example Usage:

    from spacy.matcher import PhraseMatcher
    nlp = spacy.load("en_core_web_sm")
    phrase_matcher = PhraseMatcher(nlp.vocab)

# Define a list of phrases to match
    terms = ["quick brown fox", "lazy dog"]
    patterns = [nlp(term) for term in terms]
    phrase_matcher.add("ANIMAL_PHRASE", patterns)

    doc = nlp("The quick brown fox jumps over the lazy dog.")
    matches = phrase_matcher(doc)

    for match_id, start, end in matches:
      span = doc[start:end]
      print(f"Matched phrase: {span.text}")

----------------------------
##Key Differences Between Matcher and PhraseMatcher
###Feature	Matcher	PhraseMatcher
Matching Type	Token-based pattern matching	Exact phrase matching

Customization	Highly customizable, token attributes like POS, LEMMA, TAG, DEP, etc.

Matches exact token sequences (phrases)

Flexibility	Can define complex patterns with token attributes and logical operators (e.g., optional tokens)	Matches predefined exact phrases only

Use Cases	Matching based on grammatical structure, token attributes, or complex rules	Finding specific named entities or predefined keywords

Efficiency	More flexible, but slower for large sets of phrases	Very fast and efficient for large sets of phrases

Typical Use	Grammatical patterns, custom rules, token sequences	Named entity recognition, keyword extraction

--------------------------------
#When to Use Matcher:
You need flexibility to match tokens based on attributes like part of speech, lemma, or specific sequences of tokens.

You want to match grammatical structures, token types, or sequences with optional tokens.

You’re dealing with complex token combinations or patterns.

----------------
#When to Use PhraseMatcher:
You want to match exact phrases (e.g., named entities, product names, specific keywords).

You have a large set of predefined phrases to look for and need a fast solution.

You don’t need token attributes like POS or LEMMA, just exact token matches.

---------------------------
#Summary
##Matcher:
Use this for more complex, flexible patterns that involve specific token attributes.

It gives you granular control over token matching but is slower for large sets of phrases.

##PhraseMatcher:
Use this for fast, exact matching of predefined phrases.

It's ideal for entity recognition or finding specific phrases in text and is much more efficient when working with large sets of phrases.

In [42]:
# Perform standard imports, reset nlp
import spacy
nlp = spacy.load('en_core_web_sm')

In [43]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

For this exercise we're going to import a Wikipedia article on *Reaganomics*<br>
Source: https://en.wikipedia.org/wiki/Reaganomics

In [44]:
# Import the request library
import requests

# Define the URL
url = 'https://en.wikipedia.org/wiki/Reaganomics'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    doc3 = nlp(response.text)
else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")

# Retrieve data

-----------------------
# 1. Checking the Response Status Code (response.status_code == 200):
response.status_code: When making an HTTP request using libraries like requests, response is an object that holds the server's response.

status_code: This attribute tells you the HTTP status code of the response. HTTP status codes indicate whether the request was successful or not.

The most common codes include:
200: OK (the request was successful, and the server returned the expected data).

404: Not Found (the server couldn't find what was requested).

500: Internal Server Error (the server encountered an error).
if response.status_code == 200:: This line checks if the response from the server was successful, meaning the server returned a 200 OK status code. If so, the following block of code will execute.

---------------------------
#2. Parsing the HTML Content:
    doc3 = nlp(response.text)
##response.text:

This retrieves the content of the HTTP response as a string (typically the raw HTML or JSON data that the server returned).

##doc3 = nlp(response.text):

This line uses spaCy's NLP model (nlp) to process the text content of the response. It essentially creates a Doc object from the text, allowing you to perform various NLP tasks (like tokenization, named entity recognition, etc.).

###doc3:

This is the result of processing the response.text through spaCy's NLP pipeline. It contains the parsed document that spaCy can work with (tokens, sentences, etc.).

However, this code might seem a bit odd because normally you would not pass raw HTML to spaCy directly.

If the response.text contains HTML (as most web content does), it is better to clean and extract meaningful text from the HTML using a library like BeautifulSoup before passing it to spaCy.

Normally, it would look more like this:

    from bs4 import BeautifulSoup

    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        # Extract the text from the HTML
        clean_text = soup.get_text()
        # Pass the clean text to spaCy for NLP processing
        doc3 = nlp(clean_text)
    else:
        print(f"Failed to retrieve data. Status code: {response.status_code}")

Here, BeautifulSoup is used to clean up the raw HTML and extract the actual text content that you’re interested in before passing it to spaCy.

---------------------
#3. Else Block:

    else:
       print(f"Failed to retrieve data. Status code: {response.status_code}")

If the status_code is not 200 (indicating the request was unsuccessful), the program will execute the else block and print an error message along with the actual status_code.

------------------------
#Summary
The code checks if the HTTP request was successful by verifying if the status code is 200.

If successful, it processes the response's text content using spaCy (doc3 = nlp(response.text)), though this is unconventional because HTML should typically be cleaned first.

If unsuccessful, it prints an error message showing the status code of the failed request.

In [34]:
# First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

# Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]

# Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)

# Build a list of matches:
matches = matcher(doc3)

In [35]:
# (match_id, start, end)
matches

[(3473369816841043438, 19502, 19504)]

<font color=green>The first four matches are where these terms are used in the definition of Reaganomics:</font>

In [36]:
doc3[:70]

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-

## Viewing Matches
There are a few ways to fetch the text surrounding a match. The simplest is to grab a slice of tokens from the doc that is wider than the match:

In [37]:
doc3[665:685]  # Note that the fifth match starts at doc3[673]

meta property="og:image:width" content="800">
<meta property="og:image:height" content="529

In [38]:
doc3[2975:2995]  # The sixth match starts at doc3[2985]

class="user-links-collapsible-item mw-list-item user-links-collapsible-item"><a data

Another way is to first apply the `sentencizer` to the Doc, then iterate through the sentences to the match point:

In [39]:
# Build a list of sentences
sents = [sent for sent in doc3.sents]

# In the next section we'll see that sentences contain start and end token values:
print(sents[0].start, sents[0].end)

0 447


In [41]:
# Iterate over the sentence list until the sentence end value exceeds a match start value:
for sent in sents:
    if matches[0][1] < sent.end:  # Access the first match
        print(sent)
        break

These policies are characterized as <a href="/wiki/Supply-side_economics" title="Supply-side economics">supply-side economics</a>, <a href="/wiki/Trickle-down_economics" title="Trickle-down economics">trickle-down economics</a>, or "voodoo economics" by opponents,<sup id="cite_ref-Roubini-1997_5-0" class="reference"><a href="#cite_note-Roubini-1997-5"><span class="cite-bracket">&#91;</span>5<span class="cite-bracket">&#93;</span></a></sup><sup id="cite_ref-Voodoo_economics-2004_6-0" class="reference"><a href="#cite_note-Voodoo_economics-2004-6"><span class="cite-bracket">&#91;</span>6<span class="cite-bracket">&#93;</span></a></sup> including some Republicans, while Reagan and his advocates preferred to call it <a href="/wiki/Free_market_economy" class="mw-redirect" title="Free market economy">free-market economics</a>.



For additional information visit https://spacy.io/usage/linguistic-features#section-rule-based-matching
## Next Up: NLP Basics Assessment