Notes: In this lesson, we'll take a look at spaCy's matcher, which lets you write rules to find words and phrases in text.

In [None]:
!wget https://www.gutenberg.org/files/11/11-0.txt

--2023-10-16 12:13:16--  https://www.gutenberg.org/files/11/11-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 174313 (170K) [text/plain]
Saving to: ‘11-0.txt’


2023-10-16 12:13:16 (705 KB/s) - ‘11-0.txt’ saved [174313/174313]



Certainly, here's the table of token attributes in text processing using spaCy:

| ATTRIBUTE       | VALUE TYPE | DESCRIPTION                                      |
|-----------------|------------|--------------------------------------------------|
| ORTH            | unicode    | The exact verbatim text of a token.             |
| TEXT V2.1       | unicode    | The exact verbatim text of a token.             |
| LOWER           | unicode    | The lowercase form of the token text.           |
| LENGTH          | int        | The length of the token text.                   |
| IS_ALPHA        | bool       | Token text consists of alphabetic characters.   |
| IS_ASCII        | bool       | Token text consists of ASCII characters.       |
| IS_DIGIT        | bool       | Token text consists of digits.                 |
| IS_LOWER        | bool       | Token text is in lowercase.                    |
| IS_UPPER        | bool       | Token text is in uppercase.                    |
| IS_TITLE        | bool       | Token text is in titlecase.                    |
| IS_PUNCT        | bool       | Token is punctuation.                           |
| IS_SPACE        | bool       | Token is whitespace.                            |
| IS_STOP         | bool       | Token is a stop word.                          |
| IS_SENT_START   | bool       | Token is the start of a sentence.              |
| SPACY           | bool       | Token has a trailing space.                    |
| LIKE_NUM        | bool       | Token text resembles a number.                |
| LIKE_URL        | bool       | Token text resembles a URL.                    |
| LIKE_EMAIL      | bool       | Token text resembles an email address.        |
| POS             | unicode    | The token's simple part-of-speech tag.         |
| TAG             | unicode    | The token's part-of-speech tag.                |
| DEP             | unicode    | The token's dependency label.                  |
| LEMMA           | unicode    | The lemma (base form) of the token.            |
| SHAPE           | unicode    | The visual shape of the token.                 |
| ENT_TYPE        | unicode    | The entity label of the token.                |

In [None]:
import spacy

# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# Process the text
text = "The quick brown fox jumped over the lazy dog."
docs = nlp(text)

# Iterate through the tokens and print their attributes
for token in docs:
    print(f"Token: {token.text}")
    print(f"POS Tag: {token.pos_}")    # Part-of-Speech (POS) Tag
    print(f"Tag: {token.tag_}")        # Detailed POS Tag
    print(f"Dependency Label: {token.dep_}")
    print(f"Lemma: {token.lemma_}")
    print(f"Shape: {token.shape_}")
    print(f"Entity Type: {token.ent_type_}\n")


Token: The
POS Tag: DET
Tag: DT
Dependency Label: det
Lemma: the
Shape: Xxx
Entity Type: 

Token: quick
POS Tag: ADJ
Tag: JJ
Dependency Label: amod
Lemma: quick
Shape: xxxx
Entity Type: 

Token: brown
POS Tag: ADJ
Tag: JJ
Dependency Label: amod
Lemma: brown
Shape: xxxx
Entity Type: 

Token: fox
POS Tag: NOUN
Tag: NN
Dependency Label: nsubj
Lemma: fox
Shape: xxx
Entity Type: 

Token: jumped
POS Tag: VERB
Tag: VBD
Dependency Label: ROOT
Lemma: jump
Shape: xxxx
Entity Type: 

Token: over
POS Tag: ADP
Tag: IN
Dependency Label: prep
Lemma: over
Shape: xxxx
Entity Type: 

Token: the
POS Tag: DET
Tag: DT
Dependency Label: det
Lemma: the
Shape: xxx
Entity Type: 

Token: lazy
POS Tag: ADJ
Tag: JJ
Dependency Label: amod
Lemma: lazy
Shape: xxxx
Entity Type: 

Token: dog
POS Tag: NOUN
Tag: NN
Dependency Label: pobj
Lemma: dog
Shape: xxx
Entity Type: 

Token: .
POS Tag: PUNCT
Tag: .
Dependency Label: punct
Lemma: .
Shape: .
Entity Type: 



In [None]:
# set varible
PATH = "/content/11-0.txt"

#reading the data
data = open(PATH).read()

#if you get an error try the following
#data = open('11-0.txt',encoding = 'cp850').read()

import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(data)

Let’s say we want to find phrases starting with the word Alice followed by a verb.

In [None]:
#initialize matcher
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "Alice" and a Verb
#TEXT is for the exact match and VERB for a verb
pattern = [{"TEXT": "Alice"}, {"POS": "VERB"}]


# Add the pattern to the matcher

#the first variable is a unique id for the pattern (alice).
#The second is an optional callback and the third one is our pattern.
matcher.add("alice", [pattern])

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['Alice think', 'Alice started', 'Alice had', 'Alice had', 'Alice began', 'Alice opened', 'Alice ventured', 'Alice felt', 'Alice took', 'Alice thought', 'Alice had', 'Alice went', 'Alice went', 'Alice thought', 'Alice kept', 'Alice had', 'Alice thought', 'Alice called', 'Alice replied', 'Alice began', 'Alice guessed', 'Alice said', 'Alice went', 'Alice knew', 'Alice heard', 'Alice thought', 'Alice heard', 'Alice noticed', 'Alice dodged', 'Alice looked', 'Alice looked', 'Alice replied', 'Alice replied', 'Alice felt', 'Alice turned', 'Alice thought', 'Alice replied', 'Alice folded', 'Alice said', 'Alice waited', 'Alice crouched', 'Alice noticed', 'Alice laughed', 'Alice went', 'Alice thought', 'Alice said', 'Alice said', 'Alice glanced', 'Alice caught', 'Alice looked', 'Alice added', 'Alice felt', 'Alice remarked', 'Alice waited', 'Alice coming', 'Alice looked', 'Alice said', 'Alice thought', 'Alice considered', 'Alice replied', 'Alice felt', 'Alice replied', 'Alice sighed', 'Al

Find adjectives followed by a noun .

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}]
matcher.add("id1", [pattern])
matches = matcher(doc)
# We will show you the first 20 matches
print("Matches:", set([doc[start:end].text for match_id, start, end in matches][:20]))

Matches: {'large rabbit', 'golden key', 'dreamy sort', 'long passage', 'other parts', 'little girl', 'grand words', 'right distance', 'dry leaves', '* START', 'legged table', 'hot day', 'low hall', 'first thought', 'good opportunity', 'many miles', 'pink eyes', 'several things', 'own mind', 'right word'}


Match begin as LEMMA followed by an adposition

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": "begin"},{"POS": "ADP"}]
matcher.add("id1", [pattern])
matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))

Matches: {'began in', 'begins with', 'beginning from', 'begin at', 'began by', 'begin with', 'beginning with', 'Begin at'}


Quantifier


| OP  | DESCRIPTION                                                    |
|-----|----------------------------------------------------------------|
| !   | Negate the pattern, requiring it to match exactly 0 times.     |
| ?   | Make the pattern optional, allowing it to match 0 or 1 times.   |
| +   | Require the pattern to match 1 or more times.                   |
| *   | Allow the pattern to match zero or more times.                 |

In [None]:
# For example, match the exact word Alice followed by zero or more punctuations:
matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": "Alice"}, {"IS_PUNCT": True,"OP":"*"}]
matcher.add("id1", [pattern])
matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))

Matches: {'Alice.', 'Alice, “', 'Alice,) “', 'Alice: “', 'Alice: “—', 'Alice,)', 'Alice:', 'Alice, (', 'Alice!”', 'Alice; “', 'Alice', 'Alice (', 'Alice,', 'Alice;', 'Alice!', 'Alice. “'}


REGEX

Example: Match all words starting with “a” followed by parts of speech that start with “V” (VERB etc)

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": {"REGEX": "^a"}},{"POS": {"REGEX": "^V"}}]
matcher.add("country", [pattern])
matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches][:20]))

Matches: {'and round', 'are located', 'all made', 'away went', 'and burning', 'about stopping', 'all round', 'and saying', 'and see', 'all think', 'and finding', 'and looked', 'and noticed', 'and went', 'and found', 'all locked', 'and make', 'and wander', 'all seemed'}


Add and Remove Patterns

You can add more patterns to the Macther before running it. You onlly need to use unique ids for every pattern.


In [None]:
matcher = Matcher(nlp.vocab)

pattern = [{"TEXT": "Alice"}, {"IS_PUNCT": True,"OP":"*"}]
matcher.add("id1", [pattern])

pattern = [{"POS": "ADJ"},{"LOWER":"rabbit"}]
matcher.add("id2", [pattern])

matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))

Matches: {'large rabbit', 'Alice.', 'Alice, “', 'Alice,) “', 'Alice: “', 'Alice: “—', 'Alice,)', 'Alice:', 'Alice, (', 'Alice!”', 'Alice; “', 'Alice', 'Alice (', 'Alice,', 'Alice;', 'Alice!', 'Alice. “'}


In [None]:
matcher.remove('id1')

matches = matcher(doc)
print("Matches:", set([doc[start:end].text for match_id, start, end in matches]))

Matches: {'large rabbit'}


refrensi:
- https://pythonwife.com/rule-based-matching-with-spacy/