# Purpose of this notebook

What you can do about phrases and patterns in spacy.


Before we get into it, it should be pointed out that
- you can express more complex types of patterns, see https://spacy.io/api/matcher#patterns
- You could extend this to more complex tasks, like maybe rule-based phrase and named entity extraction.
  - ...though you might base that on more specific existing code like PhraseMatcher and EntityRuler,
    which may work faster and/or annotate automatically.
  - and in the case of NER would probaly still be less effective than existing trained NER model components

We will demonstrate that somewhat in the process.

In [1]:
import collections
import spacy.displacy
import spacy.matcher
from spacy.matcher import Matcher

import wetsuite.datasets
import wetsuite.helpers.spacy

2024-09-29 11:20:04.623442: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-29 11:20:04.802470: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-29 11:20:06.787165: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-09-29 11:20:06.791041: I tensorflow/comp

In [2]:
# some example data to use later
some_cvdr_text        = wetsuite.datasets.load('cvdr-mostrecent-text').data.random_values(250)
some_rechtspraak_text = list( case.get('bodytext')   for case in wetsuite.datasets.load('rechtspraaknl-struc').data.random_values(250) )

[The things we can match tokens on](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes) are mostly attributes already on the Token object, so if you've already worked with spacy, you should recognize most as things you can also access via code (e.g. from the [Token documentation](https://spacy.io/api/token#attributes) and our [spacy intro notebook](methods_nlp__spacy_basics.ipynb)):
- `ORTH` or `TEXT` - the text as-is
- `LENGTH` - length of TEXT
- `LOWER` - the lowercase version of the text
- `NORM` - the normalized version (seems to do things like resolve contractions, and otherwise often be the lemmatizer output?)
- `LEMMA` - lemmatixed
- `SHAPE` - alphabetic characters become X or x, numeric by d, and sequences of the same are truncated after 4, so e.g. Katherine80 would become Xxxxxdd
- `POS` - coarse tagging  (often following wider conventions), e.g. `NOUN`; many models follow [a relatively universal parts-of-speech set](https://universaldependencies.org/u/pos/)
  - (so often `ADJ`, `ADP`, `ADV`, `AUX`, `CCONJ`, `DET`, `INTJ`, `NOUN`, `NUM`, `PART`, `PRON`, `PROPN`, `PUNCT`, `SCONJ`, `SYM`, `VERB`, `X` -- but you should check for each model)
- `TAG` - finer tagging than POS  
  - (easily model/language specific, e.g. `'N|soort|ev|basis|zijd|stan'`)
- `MORPH` - morphological properties 
  - (easily model/language specific, something like `Gender=Com|Number=Sing`)
  - note that this supports `{"MORPH": {"IS_SUBSET": ["Number=Sing", "Gender=Neut"]}}`

- `DEP` - dependency relation in the parse ([this takes some more typing to specfy](https://spacy.io/usage/rule-based-matching#dependencymatcher))
- `IS_SENT_START`
- `IS_LOWER`, `IS_UPPER`, `IS_TITLE`; `IS_PUNCT`, `IS_SPACE`, `IS_STOP`; `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`; `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`
- `_` accesses custom attributes

Also,
* you can add the optional `OP`, a [quantifier](https://spacy.io/usage/rule-based-matching#quantifiers) (that uses a syntax muc like regexes)
  - `?`, `+`, `*`, `{n}`, `{n,m}`, `{n,}`, `{,m}`
  - ...and `!` which means it must _not_ match (match 0 times at this point) 
    - TOFIGURE: (does it consume so mean an "anything but this?" -- or does it _not_ consume and mean more of a negative lookahead?)
  - consider e.g. `{'POS':'ADJ', 'OP':'*'}, {"POS": {"REGEX": "(NOUN|PROPN)"}}`

* instead of the value being one literal scring, you can make it a dict and add one of the following:
  - use `==`, `>=`, `<=`, `>`, `<`
    - consider e.g. `{"LENGTH": {">=": 10}}`
  - use `IN`, `NOT_IN`, `IS_SUBSET`, `IS_SUPERSET`, `INTERSECTS` to compare with lists
    - consider e.g. `{"POS": {"IN":["NOUN","PROPN"]} }` or `{"MORPH": {"IS_SUBSET": ["Number=Sing", "Gender=Neut"]}}`

  - invoke [the regex operator](https://spacy.io/usage/rule-based-matching#regex)
    - consider e.g. `{"TEXT": {"REGEX": "deff?in[ia]tely"}}`, or `{"TAG": {"REGEX": "^N"}`
  - invoke [the fuzzy operator](https://spacy.io/usage/rule-based-matching#fuzzy)
    - which (by default) matches an [edit distance](https://en.wikipedia.org/wiki/Edit_distance#Example) of 2, so e.g. `{"FUZZY": "favorite"}` will match  "favourite", "favorites", and "gavorite".
    
  - some combinations, e.g. `{"FUZZY": {"IN": ["awesome", "cool", "wonderful"]}}}`

* 'match _any_ one token' can be done with `{}`
  - TOFIGURE: does that mean `{'OP':'+'}` matches one or more?

### How do put patterns in, how do I get matches out?

Both depends a little on whether you are are using a separate Matcher, or are adding a Ruler to the model object.

Chances are you'll end up on something more visual while experimenting,
and something more succinct once you're using this as a tool.

#### bare Matcher

A Matcher is a separate thing that you can run on a document. For example:

In [186]:
dutch  = spacy.load('nl_core_news_lg')

matcher = Matcher(dutch.vocab)
matcher.add("HW_MetPunct", [   [{"LOWER": "hallo"}, {"IS_PUNCT": True}, {"LOWER": "wereld"}]            ])
matcher.add("HW_Opt",      [   [{"LOWER": "hallo"}, {"IS_PUNCT": True, "OP":'*'}, {"LOWER": "wereld"}]  ])  # punctuation is optional in this one (so matches both in the example)

doc = dutch("De hallo, wereld test. Eerste patroon matcht niet op hallo wereld.")
print( list( repr(tok) for tok in doc ) ) # (just to point out what the tokens are)

for match_id, start, end in matcher( doc ):   # (end is exclusive)
    match_str  = dutch.vocab.strings[match_id] # (spacy makes a point of having an integer representation of everything it might store; this gets a get string representation of the pattern name, and is optional if you don't care)
    match_span = doc[start:end]  # fetch as span, mostly because it's the easiest way to get the .text again:
    print(f"Pattern {repr(match_str):15s} matches token {start:3d}..{end-1:<3d}  which is the text {repr(match_span.text)}")


['De', 'hallo', ',', 'wereld', 'test', '.', 'Eerste', 'patroon', 'matcht', 'niet', 'op', 'hallo', 'wereld', '.']
Pattern 'HW_MetPunct'   matches token   1..3    which is the text 'hallo, wereld'
Pattern 'HW_Opt'        matches token   1..3    which is the text 'hallo, wereld'
Pattern 'HW_Opt'        matches token  11..12   which is the text 'hallo wereld'


In [159]:
#As a slightly more interesting example, on some real data, let's look for  
#one or more adjectives,  before  <a noun or proper noun>.

an_pattern = [
    [ # you can have more rules in a matcher, this has just one 
        {"POS":"ADJ",   "OP":"+"},   
        {"POS":{"IN":["NOUN","PROPN"]} } 
    ],
]
matcher = spacy.matcher.Matcher(dutch.vocab)
matcher.add("adjective-noun", an_pattern)

doc = dutch( some_cvdr_text[0] )
for _, start, end in matcher(doc):
    print( doc[start:end].text, end=';   ' )

# 


bijzondere commissie;   derde lid;   bijzondere commissie;   eerste lid;   politieke ambtsdragers;   decentrale politieke ambtsdragers;   openbaar vervoer;   eigen auto;   eigen auto;   functionele beperking;   tijdelijke functionele beperking;   eerste lid;   geschikte vervoersvoorziening;   werkelijke verblijfkosten;   eerste lid;   politieke ambtsdragers;   decentrale politieke ambtsdragers;   eerste lid;   tijdelijk ontslag;   Nadere regels;   politieke ambtsdragers;   decentrale politieke ambtsdragers;   inhoudelijke informatie;   maximale vergoeding;   overlegde stukken;   bijzondere deskundigheid;   zwaarte taak;   eerste lid;   politieke ambtsdragers;   decentrale politieke ambtsdragers;   beroepsmatige deskundigheid;   bijzondere beroepsmatige deskundigheid;   redelijke verhouding;   politieke ambtsdragers;   decentrale politieke ambtsdragers;   eerste lid;   politieke ambtsdragers;   decentrale politieke ambtsdragers;   eerste lid;   tweede lid;   vaste vergoedingen;   politi

#### EntityRuler, or preferably SpanRuler

When using spacy to collect and _combine_ different kinds of annotation,
you might specifically want it marked up on the document object.

You can do that by using not a Matcher but a Spanruler (ends up in `.spans`) or an EntityRuler (ends up in `.ents`),
which work in _almost_ the same way, but are set on the object - and both can also be visualized.

<!-- -->

Using an EntityRuler makes it easy to visualize exactly as you did with entities,
but has some problems for that same reason: Entities are assumed to never overlap,
so matches are silently dropped when they overlaps - with other matches or with entities that existing NER added.
(You can disable existing NER, but it may defeat the point of combining things; there is no way to avoid dropping overlapping matches).

<!-- -->

...which is why spacy recommends using SpanRuler instead complete. 
Visualizing spans isn't as compact exactly _because_ it is capable of displaying overlapping spans. 

Note that `.spans` is not just a list of matches, but is made to spans from different, named sources.
As such, `.spans` is a dict from a useful name to a list of spans. 
SpanRuler defaults to add in the key `'ruler'`, which is why we have to explicitly tell it that 
in the second visualization below.

Note that the way you hand in the pattern into Rulers is a _little_ different from Matchers,
and the same between these two Rulers:

In [4]:
# using EntityRuler:
dutch_with_er  = spacy.load('nl_core_news_lg')
dutch_with_er.remove_pipe("ner")  # the existing NER just happens not to clash here, so you _could_ we can leave them in
ruler = dutch_with_er.add_pipe("entity_ruler")
ruler.add_patterns( [ {'label':'ADJ_N',  "pattern":[ {"POS":"ADJ", "OP":"+"},  {"POS":{"IN":["NOUN","PROPN"]} }  ]} ] )

doc = dutch_with_er("""Onder de naam "afvalstoffenheffing" wordt een directe belasting geheven als bedoeld inartikel 15.33 van de Wet milieubeheer;artikel 15.33 van de Wet milieubeheer; de afvalstoffenheffing als bedoeld in deze verordening en de daarbij behorende tarieventabel wordt naar afzonderlijke grondslagen geheven ter zake van het feitelijk gebruik van een perceel ten aanzien waarvan krachtensartikel 10.21en10.22 van de Wet milieubeheereen verplichting tot het inzamelen van huishoudelijke afvalstoffen geldt.artikel 10.21en10.22 van de Wet milieubeheereen verplichting tot het inzamelen van huishoudelijke afvalstoffen geldt.""")
spacy.displacy.render(doc, style='ent', jupyter=True)

In [185]:
# using SpanRuler and its visualization instead:
dutch_with_sr  = spacy.load('nl_core_news_lg')
ruler = dutch_with_sr.add_pipe("span_ruler")
ruler.add_patterns( [ {'label':'ADJ_N',  "pattern":[  {"POS":"ADJ",  "OP":"+"},  {"POS":{"IN":["NOUN","PROPN"]} }  ],} ] )

doc = dutch_with_sr("""Onder de naam "afvalstoffenheffing" wordt een directe belasting geheven als bedoeld inartikel 15.33 van de Wet milieubeheer;artikel 15.33 van de Wet milieubeheer; de afvalstoffenheffing als bedoeld in deze verordening en de daarbij behorende tarieventabel wordt naar afzonderlijke grondslagen geheven ter zake van het feitelijk gebruik van een perceel ten aanzien waarvan krachtensartikel 10.21en10.22 van de Wet milieubeheereen verplichting tot het inzamelen van huishoudelijke afvalstoffen geldt.artikel 10.21en10.22 van de Wet milieubeheereen verplichting tot het inzamelen van huishoudelijke afvalstoffen geldt.""")
doc.spans

spacy.displacy.render(doc, style='span', options={"spans_key":"ruler"}, jupyter=True) # we have to tell it which source of spans we want; SpanRuler uses 'ruler'

<!--
#### Manually

When you have specific things you want to do, you might get more creative with code.

Say, "I want only adjective-noun pairs from the sentences that mention appellant(/e/es), and want to count how of they they occur":

count = collections.defaultdict(int)
for text in some_rechtspraak_text:
    doc = dutch( text )
    for sent in doc.sents:
        if not 'appellant' in sent.text.lower():
            continue
        for _, start_i, end_i in matcher( sent ):
            count[ sent[ start_i : end_i ].text.strip() ] += 1 

for str, count in sorted( count.items(), key=lambda x:x[1], reverse=True):
    print( f'{count:5d}  {str}')
    #if count<3:
    #    break
-->

### Making life a little easier?

These rules are... verbose.  We can make a little rule system to make that less typing.

There are some very real limitations to doing this, as this is easily _less_ expressive - it does _not_ allow everything that you can specify at spacy level.
But it could be nice to do some quick tests

In [6]:
import spacy
from spacy.matcher import Matcher
import wetsuite.datasets

def expand_pats(patstring):
    '  '
    pats = {
        'DPN':      [{'POS':'DET', 'OP':'*'}, {'POS':'PRON', 'OP':'*'}, {"POS": {"REGEX": "(NOUN|PROPN)"}}],
        'ADJN':     [{'POS':'DET', 'OP':'*'}, {'POS':'ADJ', 'OP':'*'}, {"POS": {"REGEX": "(NOUN|PROPN)"}}],
        'ADJ':      [{'POS':'ADJ', 'OP':'+'}],
        'V':        [{"POS": {"IN": ["VERB", "AUX"]}}],
        #'V':        [{"POS": {"REGEX": "(VERB|AUX)"}}],
        #'ANYPLUS':  [{"TEXT": {"REGEX": "."}, 'OP':'+'}], #figure out whether there is a better/faster way to match any token
        #'ANYSTAR':  [{"TEXT": {"REGEX": "."}, 'OP':'*'}],
        'ANYPLUS':  [{'OP':'+'}], #figure out whether there is a better/faster way to match any token
        'ANYSTAR':  [{'OP':'*'}],
        'DATELIKE': [{"LIKE_NUM": True}, {"POS": "PROPN"}, {"LIKE_NUM": True}],
    }
    ret = []
    for part in patstring.split():
        if part == part.lower():
            ret.append( {'LOWER':part} )
        else:
            if part in pats:
                ret.extend( pats[part] )
            else:
                print('Do not know pattern %r'%part)
    return ret


dutch  = spacy.load('nl_core_news_lg')

matcher = Matcher(dutch.vocab)
for pat in (
    #'ADJN heeft V dat',
    #'zou zijn V door',
    #'gelet op de ADJN',
    #'ADJ rechten',
    #'de beschuldigingen ANYPLUS zijn',
    #'aan de hand van het ADJN',
    #'aan de hand van de ADJN',
    #'volgens ADJN',
    #'gaat ADJN uit van',
    #'gaan ADJN uit van',
 #   'behandeld op DATELIKE',
 #   'uitgesproken ANYSTAR op DATELIKE',
 #   'uitgesproken ANYSTAR van DATELIKE',
 #   'uitspraak ANYPLUS DATELIKE',
    #'voorlopig karakter', 'niet bindend', #'bodemprocedure',
    #'nadere stukken',
    #'ADJN concludeert',
    #'ADJN voert verweer'
    ):
    #print(pat)
    ep = expand_pats(pat)
    #print(ep)
    matcher.add( pat, [ep] ) 


matcher.add( 'deze v|u V', [ [ 
        {"LOWER":   {"IN":["deze", "dit"]}},   
        {"LOWER":   {"IN":["verordening", "uitspraak", "arrest", "beschikking", "vonnis"]}},   
        {"POS": {"IN":["VERB","AUX"]} },
        {'OP':'+'},
        {"LIKE_NUM": True}, {"POS": "PROPN"}, {"LIKE_NUM": True}
], ])

In [7]:
for case in wetsuite.datasets.load('rechtspraaknl-struc').data.random_values(250):
    print(case['identifier'])

    doc = dutch( case.get('bodytext') )

    for sent in doc.sents:
        for match_id, start, end in matcher( sent ):
            match_str = dutch.vocab.strings[match_id]  # Get string representation, seems to point out the pattern name is added to the vocab too, presumably to have an integer-only representation?
            match_span = sent[start:end]  # The matched span
            print(f"  {repr(match_str):35s} matches:  {repr(match_span.text)}")    

ECLI:NL:GHSGR:2009:BO0472
ECLI:NL:PHR:1997:AA3305
ECLI:NL:GHSGR:2005:AT8681
ECLI:NL:GHSHE:2023:2403
ECLI:NL:RBNHO:2015:3501
ECLI:NL:RBAMS:1999:AA3505
ECLI:NL:RBMAA:2002:AF2556
  'deze v|u V'                        matches:  'Dit vonnis is gewezen door mr. De Kort, rechter, en ter openbare terechtzitting van 18 juli 2002'
ECLI:NL:GHARN:2009:BJ6878
ECLI:NL:RVS:2004:AO1664
ECLI:NL:CRVB:2020:1679
ECLI:NL:RBNHO:2018:7169
ECLI:NL:CRVB:2010:BN3515
ECLI:NL:RBZWB:2024:2796
  'deze v|u V'                        matches:  'Deze uitspraak is gedaan door mr. R.J.H. de Brouwer, kantonrechter, bijgestaan door de griffier mr. C.A. Lequin, en in het openbaar uitgesproken op 18 maart 2024'
ECLI:NL:RBGEL:2019:1920
ECLI:NL:GHAMS:2011:BQ4861
ECLI:NL:RBNNE:2020:571
  'deze v|u V'                        matches:  'Deze uitspraak is gewezen door mr. W.S. Sikkema, voorzitter, mr. L.W. Janssen en mr. C.H. Beuker, rechters, bijgestaan door mr. B.E. Oosterhout, griffier, en uitgesproken ter openbare terechtzittin

## Other ideas

### Finding spelling variants and typos

I have seen lantaarnpalen, lantarenpalen


In [11]:
import spacy
from spacy.matcher import Matcher
import wetsuite.datasets
import wetsuite.helpers.notebook

dutch = spacy.load('nl_core_news_lg')
matcher = Matcher(dutch.vocab)
matcher.add( 'fuzzy lantaarnpaal', [ [{"LOWER": {"FUZZY": "lantaarnpaal"}}] ])

for text in wetsuite.datasets.load('cvdr-mostrecent-text').data.random_values(250):
    if 'paal' in text:
        if len(text) < 500000:
            doc = dutch( text )

            for sent in doc.sents:
                for match_id, start, end in matcher( sent ):
                    # assumes it's a single token, so we only need to look at the Token sent[start],  not the Span sent[start:end] 
                    tok = sent[start]
                    print(f" {repr(tok.text)}")    

 'lantaarns'


In [27]:
import spacy
from spacy.matcher import Matcher
import wetsuite.datasets
import wetsuite.helpers.notebook

dutch = spacy.load('nl_core_news_lg')
matcher = Matcher(dutch.vocab)
matcher.add( 'fuzzy', [ [{"LOWER": {"FUZZY": "rechtspraak"}}] ])

for text in wetsuite.datasets.load('cvdr-mostrecent-text').data.random_values(250):
#for case in wetsuite.datasets.load('rechtspraaknl-struc').data.random_values(250):
#    text = case.get('bodytext')

    if len(text) < 500000:
        if 'recht' in text:
            doc = dutch( text )

            for match_id, start, end in matcher( doc ):
                # assumes it's a single token, so we only need to look at the Token sent[start],  not the Span sent[start:end] 
                tok = doc[start]
                print(f" {repr(tok.text)}")    