## Purpose of this notebook

What you can do about phrases and patterns in spacy.


Before we get into it, it should be pointed out that
- you can express more complex types of patterns, see https://spacy.io/api/matcher#patterns
- You could extend this to more complex tasks, like maybe rule-based phrase and named entity extraction.
  - ...though you might base that on more specific existing code like PhraseMatcher and EntityRuler,
    which may work faster and/or annotate automatically.
  - and in the case of NER would probaly still be less effective than existing trained NER model components

We will demonstrate that somewhat in the process.

[The things we can match tokens on](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes) are mostly attributes already on the Token object, so if you've already worked with spacy, you should recognize most as things you can also access via code (e.g. from the [Token documentation](https://spacy.io/api/token#attributes) and our [spacy intro notebook](methods_nlp__spacy_basics.ipynb)):
- `ORTH` or `TEXT` - the text as-is
- `LENGTH` - length of TEXT
- `LOWER` - the lowercase version of the text
- `NORM` - the normalized version (seems to do things like resolve contractions, and otherwise often be the lemmatizer output?)
- `LEMMA` - lemmatixed
- `SHAPE` - alphabetic characters become X or x, numeric by d, and sequences of the same are truncated after 4, so e.g. Katherine80 would become Xxxxxdd
- `POS` - coarse tagging  (often following wider conventions), e.g. `NOUN`; many models follow [a relatively universal parts-of-speech set](https://universaldependencies.org/u/pos/)
  - (so often `ADJ`, `ADP`, `ADV`, `AUX`, `CCONJ`, `DET`, `INTJ`, `NOUN`, `NUM`, `PART`, `PRON`, `PROPN`, `PUNCT`, `SCONJ`, `SYM`, `VERB`, `X` -- but you should check for each model)
- `TAG` - finer tagging than POS  
  - (easily model/language specific, e.g. `'N|soort|ev|basis|zijd|stan'`)
- `MORPH` - morphological properties 
  - (easily model/language specific, something like `Gender=Com|Number=Sing`)
  - note that this supports `{"MORPH": {"IS_SUBSET": ["Number=Sing", "Gender=Neut"]}}`

- `DEP` - dependency relation in the parse ([this takes some more typing to specfy](https://spacy.io/usage/rule-based-matching#dependencymatcher))
- `IS_SENT_START`
- `IS_LOWER`, `IS_UPPER`, `IS_TITLE`; `IS_PUNCT`, `IS_SPACE`, `IS_STOP`; `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`; `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`
- `_` accesses custom attributes

Also,
* you can add the optional `OP`, a [quantifier](https://spacy.io/usage/rule-based-matching#quantifiers) (that uses a syntax muc like regexes)
  - `?`, `+`, `*`, `{n}`, `{n,m}`, `{n,}`, `{,m}`
  - ...and `!` which means it must _not_ match (match 0 times at this point) 
    - TOFIGURE: (does it consume so mean an "anything but this?" -- or does it _not_ consume and mean more of a negative lookahead?)
  - consider e.g. `{'POS':'ADJ', 'OP':'*'}, {"POS": {"REGEX": "(NOUN|PROPN)"}}`

* instead of the value being one literal scring, you can make it a dict and add one of the following:
  - use `==`, `>=`, `<=`, `>`, `<`
    - consider e.g. `{"LENGTH": {">=": 10}}`
  - use `IN`, `NOT_IN`, `IS_SUBSET`, `IS_SUPERSET`, `INTERSECTS` to compare with lists
    - consider e.g. `{"POS": {"IN":["NOUN","PROPN"]} }` or `{"MORPH": {"IS_SUBSET": ["Number=Sing", "Gender=Neut"]}}`

  - invoke [the regex operator](https://spacy.io/usage/rule-based-matching#regex)
    - consider e.g. `{"TEXT": {"REGEX": "deff?in[ia]tely"}}`, or `{"TAG": {"REGEX": "^N"}`
  - invoke [the fuzzy operator](https://spacy.io/usage/rule-based-matching#fuzzy)
    - which (by default) matches an [edit distance](https://en.wikipedia.org/wiki/Edit_distance#Example) of 2, so e.g. `{"FUZZY": "favorite"}` will match  "favourite", "favorites", and "gavorite".
    
  - some combinations, e.g. `{"FUZZY": {"IN": ["awesome", "cool", "wonderful"]}}}`

* 'match _any_ one token' can be done with `{}`
  - TOFIGURE: does that mean `{'OP':'+'}` matches one or more?

## Enough words, let's start doing things

In [11]:
import spacy
import spacy.matcher
from spacy.matcher import Matcher
import spacy.displacy

import wetsuite.helpers.spacy
import wetsuite.datasets


# some example data to use later
some_cvdr_text        = wetsuite.datasets.load('cvdr-mostrecent-text').data.random_values(250)
some_rechtspraak_text = list( case.get('bodytext')   for case in wetsuite.datasets.load('rechtspraaknl-struc').data.random_values(250) )

### How do I put patterns in, how do I get matches out?

Both depends a little on whether you are are using a separate Matcher, or are adding a Ruler to the model object.

Chances are you'll end up on something more visual while experimenting,
and something more succinct once you're using this as a tool.

#### bare Matcher

A Matcher is a separate thing that you can run on a document.

You attach it to an existing pipeline - in part just because _something_ needs to create the tokens-with-attributes we will match on.

For example:

In [None]:
dutch  = spacy.load('nl_core_news_lg')

matcher = spacy.matcher.Matcher(dutch.vocab)
#            pattern name      pattern 
matcher.add("HW_MetPunct",    [   [{"LOWER": "hallo"}, {"IS_PUNCT": True},           {"LOWER": "wereld"}]  ])
matcher.add("HW_Opt",         [   [{"LOWER": "hallo"}, {"IS_PUNCT": True, "OP":'*'}, {"LOWER": "wereld"}]  ])  # punctuation is optional in this one (so matches both in the example)
# note that the pattern is a list of lists. This is because you can have multiple patterns for one rule (e.g. to specify some variants), so far we have just one.

doc = dutch("De hallo, wereld test.  Eerste patroon matcht niet op hallo wereld.")
print( 'Sentence tokens: ', list( repr(tok) for tok in doc ) ) # (just to point out what the token and their text are)

for match_id, start, end in matcher( doc ):   # running that Matcher will return a sequence of matches   (note: end is exclusive)
    match_str  = dutch.vocab.strings[match_id] # (spacy makes a point of having an integer representation of everything it might store; this gets a get string representation of the pattern name, and is optional if you don't care)
    match_span = doc[start:end]  # fetch as span, mostly because it's the easiest way to get the .text again:
    print(f"Pattern {repr(match_str):15s} matches token {start:3d}..{end-1:<3d}  which is the text {repr(match_span.text)}")

Sentence tokens:  ['De', 'hallo', ',', 'wereld', 'test', '.', ' ', 'Eerste', 'patroon', 'matcht', 'niet', 'op', 'hallo', 'wereld', '.']
Pattern 'HW_MetPunct'   matches token   1..3    which is the text 'hallo, wereld'
Pattern 'HW_Opt'        matches token   1..3    which is the text 'hallo, wereld'
Pattern 'HW_Opt'        matches token  12..13   which is the text 'hallo wereld'


In [6]:
# Let's make a slightly more interesting example, and run on some real data
# let's look for   one or more adjectives,  before  <a noun or proper noun>.

an_pattern = [
    [ 
        {"POS":"ADJ",   "OP":"+"},        # "+" means one or more  
        {"POS":{"IN":["NOUN","PROPN"]} } 
    ],
]
matcher = spacy.matcher.Matcher(dutch.vocab)
matcher.add("adjective-noun", an_pattern)

doc = dutch( some_cvdr_text[0] )
for _match_id, start, end in matcher(doc):
    print( doc[start:end].text, end='\n' )

onroerende zaakbelastingen
onroerende zaken
directe belastingen
onroerende zaak
persoonlijk recht
onroerende zaak
onroerende zaak
onroerende zaak
volgtijdig gebruik
onroerende zaak
onroerende zaak
basisregistratie kadaster
onroerende zaak
onroerende zaak
onroerende zaken
onroerende zaak
onroerende zaken
onroerende zaak
onroerende zaak
onroerende zaken
onroerende zaak
onroerende zaak
onroerende zaken
onroerende zaak
overeenkomstige toepassing
tweede lid
onroerende zaken
open grond
onroerende zaken
openbare eredienst
openbare bezinningssamenkomsten
levensbeschouwelijke aard
onroerende zaken
zodanige onroerende zaken
onroerende zaken
volledige rechtsbevoegdheid
openbaar vervoer
publiekrechtelijke rechtspersonen
zodanige werken
ander afvalwater
publiekrechtelijke rechtspersonen
zodanige werken
onroerende zaak
onroerende zaken
publieke dienst
onroerende zaken
zodanige onroerende zaken
onroerende zaken
zodanige onroerende zaken
onroerende zaken
zodanige onroerende zaken
eerste lid
onroerende

#### EntityRuler, or preferably SpanRuler

When using spacy to collect and _combine_ different kinds of annotation,
you might want to visualize all the things you have,
i.e. marked it up on the document object, and show that.

If this is your goal, switch from a Matcher to either a Spanruler or an EntityRuler,
which work in _almost_ the same way, but are set on the document object (in `.spans` and `.ents`, respectively),
and there is out-of-the-box visualizaton for both.

<!-- -->

Using an EntityRuler makes it easy to visualize exactly as we previously did with entities, but has an already-mentioned problem: 
Entities are assumed to never overlap, so matches will be silently dropped when they overlap, 
with other matches or with entities that e.g. existing NER added.
(You can disable existing NER, but it may defeat the point of combining things; there is no way to avoid dropping overlapping matches).

<!-- -->

...which is why spacy recommends using SpanRuler instead. 
Visualizing spans isn't as compact exactly _because_ it is capable of displaying overlapping spans.

Note that `.spans` is not just a list of matches, but is made to spans from different, named sources.
As such, `.spans` is a dict from a useful name to a list of spans. 
SpanRuler defaults to add in the key `'ruler'`, which is why we have to explicitly tell it that 
in the second visualization below.

Note that the way you hand in the pattern into Rulers is a _little_ different from Matchers,
and the same between these two Rulers:

In [None]:
# using EntityRuler:
dutch_with_er  = spacy.load('nl_core_news_lg')
dutch_with_er.remove_pipe("ner")  # the existing NER just happens to _NOT_ clash on this specific sentence (so you could comment this to see both) but in general you cannot assume that
ruler = dutch_with_er.add_pipe("entity_ruler")
ruler.add_patterns( [ {'label':'ADJ_N',  "pattern":[ {"POS":"ADJ", "OP":"+"},  {"POS":{"IN":["NOUN","PROPN"]} }  ]} ] )

doc = dutch_with_er("""Onder de naam "afvalstoffenheffing" wordt een directe belasting geheven als bedoeld inartikel 15.33 van de Wet milieubeheer;artikel 15.33 van de Wet milieubeheer; de afvalstoffenheffing als bedoeld in deze verordening en de daarbij behorende tarieventabel wordt naar afzonderlijke grondslagen geheven ter zake van het feitelijk gebruik van een perceel ten aanzien waarvan krachtensartikel 10.21en10.22 van de Wet milieubeheereen verplichting tot het inzamelen van huishoudelijke afvalstoffen geldt.artikel 10.21en10.22 van de Wet milieubeheereen verplichting tot het inzamelen van huishoudelijke afvalstoffen geldt.""")
spacy.displacy.render(doc, style='ent', jupyter=True)

In [11]:
# using SpanRuler and its visualization instead:
dutch_with_sr  = spacy.load('nl_core_news_lg')
ruler = dutch_with_sr.add_pipe("span_ruler")
ruler.add_patterns( [ {'label':'ADJ_N',  "pattern":[  {"POS":"ADJ",  "OP":"+"},  {"POS":{"IN":["NOUN","PROPN"]} }  ],} ] )

doc = dutch_with_sr("""Onder de naam "afvalstoffenheffing" wordt een directe belasting geheven als bedoeld inartikel 15.33 van de Wet milieubeheer;artikel 15.33 van de Wet milieubeheer; de afvalstoffenheffing als bedoeld in deze verordening en de daarbij behorende tarieventabel wordt naar afzonderlijke grondslagen geheven ter zake van het feitelijk gebruik van een perceel ten aanzien waarvan krachtensartikel 10.21en10.22 van de Wet milieubeheereen verplichting tot het inzamelen van huishoudelijke afvalstoffen geldt.artikel 10.21en10.22 van de Wet milieubeheereen verplichting tot het inzamelen van huishoudelijke afvalstoffen geldt.""")
doc.spans

spacy.displacy.render(doc, style='span', options={"spans_key":"ruler"}, jupyter=True) # we have to tell it which source of spans we want; SpanRuler uses 'ruler'

<!--
#### Manually

When you have specific things you want to do, you might get more creative with code.

Say, "I want only adjective-noun pairs from the sentences that mention appellant(/e/es), and want to count how of they they occur":

count = collections.defaultdict(int)
for text in some_rechtspraak_text:
    doc = dutch( text )
    for sent in doc.sents:
        if not 'appellant' in sent.text.lower():
            continue
        for _, start_i, end_i in matcher( sent ):
            count[ sent[ start_i : end_i ].text.strip() ] += 1 

for str, count in sorted( count.items(), key=lambda x:x[1], reverse=True):
    print( f'{count:5d}  {str}')
    #if count<3:
    #    break
-->

### Making life a little easier?

These rules are... verbose.   They are not fun to write, so don't exactly invite working on improving them.

Could we make our own little rule system to make that less typing?

Yes, though there are some very real limitations to doing this:
- it's shorter, but it's also another thing to learn
- it's just for simpler patterns - it's easily _less_ expressive and it does _not_ allow everything that you can specify at spacy level

...but it could be nice to do some quick tests.

In [None]:
def expand_pats(patstring):
    """ Takes a singular string with some specific pattern names,
        Returns a spacy pattern that has expanded those pattern names to  actual patterns,
        e.g.:: 
            'behandeld op DATELIKE' 
        becomes::
            [{'LOWER': 'behandeld'}, {'LOWER': 'op'}, {'LIKE_NUM': True}, {'POS': 'PROPN'}, {'LIKE_NUM': True}]

        Just a makeshift experiment to see how useful this idea can be.
        The existing patterns are uppercased to make it less likely to clash with a literal word.
        
        To see how this is intended to be used, see the creation of the Matcher below this definition 
    """
    pats = {
        'DPN':      [{'POS':'DET', 'OP':'*'}, {'POS':'PRON', 'OP':'*'}, {"POS": {"REGEX": "(NOUN|PROPN)"}}],
        'ADJN':     [{'POS':'DET', 'OP':'*'}, {'POS':'ADJ', 'OP':'*'}, {"POS": {"REGEX": "(NOUN|PROPN)"}}],
        'ADJ':      [{'POS':'ADJ', 'OP':'+'}],
        'V':        [{"POS": {"IN": ["VERB", "AUX"]}}],
        #'V':        [{"POS": {"REGEX": "(VERB|AUX)"}}],
        #'ANYPLUS':  [{"TEXT": {"REGEX": "."}, 'OP':'+'}], #figure out whether there is a better/faster way to match any token
        #'ANYSTAR':  [{"TEXT": {"REGEX": "."}, 'OP':'*'}],
        'ANYPLUS':  [{'OP':'+'}], #figure out whether there is a better/faster way to match any token
        'ANYSTAR':  [{'OP':'*'}],
        'DATELIKE': [{"LIKE_NUM": True}, {"POS": "PROPN"}, {"LIKE_NUM": True}],
    }
    ret = []
    for part in patstring.split():
        if part == part.lower():
            ret.append( {'LOWER':part} )  # in theory we could also normalize that to use LEMMA, and catch inflections
        else:
            if part in pats:
                ret.extend( pats[part] )
            else:
                print('Do not know pattern %r'%part)
    return ret


dutch  = spacy.load('nl_core_news_lg')

matcher = Matcher(dutch.vocab)

# for comparison, first spacy's own style...   (which is actually necessary for this nontrivial pattern)
matcher.add( 'deze v|u|a|b|v V', [ [
        {"LOWER":   {"IN":["deze", "dit"]}},
        {"LOWER":   {"IN":["verordening", "uitspraak", "arrest", "beschikking", "vonnis"]}},
        {"POS": {"IN":["VERB","AUX"]} },
        {'OP':'+'},
        {"LIKE_NUM": True}, {"POS": "PROPN"}, {"LIKE_NUM": True}
],] )

# ...and then our brief one  (lowercase means words to match, uppercase means a thing that expand_pats probably expands)
for pat in (
        #'ADJN heeft V dat',
        #'zou zijn V door',
        #'gelet op de ADJN',
        #'ADJ rechten',
        #'de beschuldigingen ANYPLUS zijn',
        #'aan de hand van het ADJN',
        #'aan de hand van de ADJN',
        #'volgens ADJN',
        #'gaat ADJN uit van',
        #'gaan ADJN uit van',
        'behandeld op DATELIKE',
        'uitgesproken ANYSTAR op DATELIKE',
        'uitgesproken ANYSTAR van DATELIKE',
        'uitspraak ANYPLUS DATELIKE',
        #'voorlopig karakter', 'niet bindend', #'bodemprocedure',
        #'nadere stukken',
        #'ADJN concludeert',
        #'ADJN voert verweer',
    ):
    ep = expand_pats(pat)
    matcher.add( pat, [ep] )

In [None]:
# Okay, now run that on some actual data - specifically some court cases
for case in wetsuite.datasets.load('rechtspraaknl-struc').data.random_values( 250 ):
    print( 'In document', case['identifier'])

    doc = dutch( case.get('bodytext') )

    for sent in doc.sents:
        for match_id, start, end in matcher( sent ):
            match_str = dutch.vocab.strings[match_id]  # Get string representation, seems to point out the pattern name is added to the vocab too, presumably to have an integer-only representation?
            match_span = sent[start:end]  # The matched span
            print(f"  pattern:{repr(match_str):35s} matches:  {repr(match_span.text)}")    

In document ECLI:NL:GHLEE:2010:BN8169
In document ECLI:NL:PHR:2011:BO5254
  pattern:'uitspraak ANYPLUS DATELIKE'        matches:  'uitspraak een bestuursbesluit in stand gelaten, dan staat dit er in beginsel aan in de weg dat de strafrechter het verweer dat het besluit ten onrechte is genomen zelfstandig onderzoekt en daarop beslist (HR 24 september 2002'
In document ECLI:NL:RBGEL:2022:4102
  pattern:'uitgesproken ANYSTAR op DATELIKE'  matches:  'uitgesproken op 27 juli 2022'
  pattern:'deze v|u|a|b|v V'                  matches:  'Dit vonnis is gewezen door mr. E. Boerwinkel en in het openbaar uitgesproken op 27 juli 2022'
In document ECLI:NL:RBOBR:2016:6592
  pattern:'uitspraak ANYPLUS DATELIKE'        matches:  'uitspraak: 28 november 2016'
  pattern:'deze v|u|a|b|v V'                  matches:  'Dit vonnis is op tegenspraak gewezen naar aanleiding van het onderzoek ter terechtzitting van 14 november 2016'
  pattern:'uitgesproken ANYSTAR op DATELIKE'  matches:  'uitgesproken op 28 n

# Discussion

Pattern matching can be great,
but you probably want to know its limitations before you spend a lot of time
discovering that it _isn't_ enough for your purposes, or at least, how much work it will be.


One of the things to realize is that pattern matching only ever reports what it matches,
and examples that look great are usually the patterns that are _very_ regular.

If what you need is exact and rigid matching, great!

Yet the more varied, the more _natural_ text is, the harder that is to do,
amd **it will not tell you what you are missing**, what you need to improve.

That's your job, as your own technician.