# Token-based matching

spaCy features a **rule-matching engine**, the `Matcher`, that operates over tokens, similar to regular expressions. 
The rules can refer to **token annotations** (e.g. the token text or tag_, and flags (e.g. IS_PUNCT). 
The rule matcher also lets you pass in a custom callback to act on matches – for example, to merge entities and apply custom labels. 
You can also associate patterns with entity IDs, to allow some basic entity linking or disambiguation. To match large terminology lists, you can use the PhraseMatcher, which accepts Doc objects as match patterns.

In [2]:
import spacy
from spacy.matcher import Matcher

## 增加 patterns
1. A token whose lowercase form matches “hello”, e.g. “Hello” or “HELLO”.
2. A token whose is_punct flag is set to True, i.e. any punctuation.
3. A token whose lowercase form matches “world”, e.g. “World” or “WORLD”.

In [None]:
[{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]

First, we initialize the `Matcher` with a **vocab**. The matcher must always share the same vocab with the documents it will operate on.

We can now call `matcher.add()` with an `ID` and our `custom pattern`. 

The second argument lets you pass in an optional callback function to invoke on a successful match. For now, we set it to None.

In [5]:
import en_core_web_sm
# nlp = spacy.load("en_core_web_sm")
nlp = en_core_web_sm.load()

In [7]:
matcher = Matcher(nlp.vocab)

# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]

matcher.add("HelloWorld", None, pattern)

In [8]:
doc = nlp("Hello, world! Hello world!")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

15578876784678163569 HelloWorld 0 3 Hello, world


In [None]:
https://spacy.io/usage/rule-based-matching#matcher