# Purpose of this notebook

Provide _some_ answer to "so how you detect interesting terms / phrases?"

Annoyingly, one has to respond with "that depends on what you mean with interesting phrases"


There are varied methods, some simple enough that you could implement yourself in half an hour,
that will return with some interesting fragments, so seem to work. 
...yet often have assumptions that turn out to not match up with what you thought when you heard 'interesting'.

While each of these methods visible do useful things, 
each will miss things you may have wanted to give, 
which is invisible, and it is also not clear why.


Consider if your goal was
- "what multi-word phrases appear in this document" 
- "what multi-word phrases make this document interesting" 
- "what multi-word phrases make this document different from others in a set" 
- "can we make lists of words" 
- "what multi-word phrases make this document interesting" 
- match known phrases
- match phrases of a specific topic
They may may seem like subtle variations,

Also, if you have not yet thought what kind of phrases are more interesting, 
or why, then you can't expect a method to prefer those. 


So for the most part, the below is a start on just the first in that list, 
to introduce some methods, but refined output will need your refined needs (and some refined code).


For example, 
- **tf-idf** is more of an ingredient for a larger analysis, search, and other things, yet 
  - combined with n-grams, they might tell you combinations of words that are more common than others, but still little about how they compare. 
  - so _by itself_ it's not useful for much more than assistance making stoplists.
  - there's a [separate notebook that goes into its basics](methods_text_terms_tfidf.ipynb)

- using any language parsing, you could look for patterns. 
  - the output may be clean, but it's unclear what one might miss
  - a basic example follows below

- **Collocation analysis** often refers to a probability-based "does this combination of words appears more often together than its parts would suggest?", which is still simple and works a little better.
  - it might pick up "eigen gebruik", "echtgenoot of geregistreerde partner", "werk en inkomen naar arbeidsvermogen"
  - ...but also just fragments that happen, well, because sentences have structure ("heeft gedaan", "verplichtingen uit"), or have been ripped from their context ("tijdstip zal", "KONING DER").
  - so 'more common together' turns out to not be quite enough for clean output 
  - there's a [separate notebook that goes into its basics](methods_text_terms_collocations.ipynb)

- topic modelling goes further, asking "what sets of words or phrases seem to join and disinguish documents in a set"
  - this adds a goal that pushes down in that above list.
  - there's a [separate notebook that goes into its basics](methods_text_topic_modeling.ipynb)


## Spacy pattern matcher

Notes: 
- you can express more complex types of patterns, see https://spacy.io/api/matcher#patterns
- You could extend this to more complex tasks, like maybe rule-based phrase and named entity extraction.
  - ...though you might base that on more specific existing code like PhraseMatcher and EntityRuler,
    which may work faster and/or annotate automatically.
  - and in the case of NER would probaly still be less effective than existing trained NER model components


In [2]:
import collections
import spacy.displacy
import spacy.matcher
from spacy.matcher import Matcher

import wetsuite.datasets
import wetsuite.helpers.spacy

In [16]:
dutch  = spacy.load('nl_core_news_lg')

matcher = Matcher(dutch.vocab)
matcher.add("HW_MetPunct", [   [{"LOWER": "hallo"}, {"IS_PUNCT": True}, {"LOWER": "wereld"}]            ])
matcher.add("HW_Opt",      [   [{"LOWER": "hallo"}, {"IS_PUNCT": True, "OP":'*'}, {"LOWER": "wereld"}]  ])  # punctuation is optional in this one, so matches both

doc = dutch("De hallo, wereld test. Eerste patroon matcht niet op hallo wereld.")
print( list( repr(tok) for tok in doc ) ) # (just to point out what the tokens are)

for match_id, start, end in matcher( doc ):
    match_str = dutch.vocab.strings[match_id]  # Get string representation, seems to point out the pattern name is added to the vocab too, presumably to have an integer-only representation?
    span = doc[start:end]  # The matched span
    print(f"Pattern {match_str:15s} matches token {start:3d}..{end:3d} matches text {repr(span.text)}")    

['De', 'hallo', ',', 'wereld', 'test', '.', 'Eerste', 'patroon', 'matcht', 'niet', 'op', 'hallo', 'wereld', '.']
Pattern HW_MetPunct     matches token   1..  4 matches text 'hallo, wereld'
Pattern HW_Opt          matches token   1..  4 matches text 'hallo, wereld'
Pattern HW_Opt          matches token  11.. 13 matches text 'hallo wereld'


[[{'LOWER': 'I'}, {'LOWER': 'like'}, {'LOWER': 'cheese'}],
 [{'LOWER': 'hungry'}, {'LOWER': 'like'}, {'LOWER': 'the'}, {'LOWER': 'wolf'}]]

('NOUN', Gender=Com|Number=Sing, 'N|soort|ev|basis|zijd|stan')

[The things we can match include](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes) attributes on Token -- you will recognize most of these from the [Token documentation](https://spacy.io/api/token#attributes) and our [spacy intro notebook](methods_nlp__spacy_basics.ipynb):
* token text, derived forms, assigned by the model/parse, and some existin gpatterns
- ORTH or TEXT - the text as-is
- LENGTH - length of TEXT
- LOWER - the lowercase version of the text
- NORM - the normalized version (seems to do things like resolve contractions, and otherwise often be the lemmatizer output?)
- LEMMA - lemmatixed
- SHAPE - alphabetic characters become X or x, numeric by d, and sequences of the same are truncated after 4, so e.g. Katherine80 would become Xxxxxdd
- POS - coarse tagging  (often following wider conventions), e.g. `NOUN`
- TAG - finer taggging  (more easily model/language specific, e.g. `'N|soort|ev|basis|zijd|stan'`)
- MORPH - morphological properties, something like `Gender=Com|Number=Sing`
- DEP - dependency relation in the parse
- LIKE_NUM
- LIKE_URL
- LIKE_EMAIL
- IS_SENT_START
- IS_ALPHA, IS_ASCII, IS_DIGIT
- IS_LOWER, IS_UPPER, IS_TITLE
- IS_PUNCT, IS_SPACE, IS_STOP
- `_` for values in custom attributes


Operators:
- REGEX, e.g.
  - `{"TEXT": {"REGEX": "deff?in[ia]tely"}}` or 
  - `{"TAG": {"REGEX": "^V"}`


SyntaxError: invalid syntax (1587020517.py, line 1)

There is also, significantly, 

OP  



which is a regex-like quantifier you can add to this part of the match
- `?`, `+`, `*`, `{n}`, `{n,m}`, `{n,}`, `{,m}`
- ...and `!` which means it must _not_ match (match 0 times at this point) (does it consume so mean an "anything but this?" or does it _not_ consume and mean more of a negative lookahead?)

https://spacy.io/usage/rule-based-matching#quantifiers

In [18]:
rechtspraak = wetsuite.datasets.load('rechtspraaknl-struc')
rechtspraak_items = list( docdict.get('bodytext')  for _, docdict in rechtspraak.data.random_sample(250) )

In [19]:

# As an example, look for 
#  < one or more adjectives>  before  <a noun or proper noun>
# This is too simple to be directly useful, but works fine as an example
an_pattern = [
    [ # you can have more rules in a matcher, this has just one 
        {"POS": "ADJ",                "OP": "+"},   
        #{"POS": {"IN":["ADJ","ADV"]},  "OP": "+"},   
        {"POS": {"IN":["NOUN","PROPN"]} } 
    ],
]
matcher = spacy.matcher.Matcher(dutch.vocab)
matcher.add("adjective-noun", an_pattern)


count_pats = collections.defaultdict(int)

larger_rvs_advice = [] # 
for body in rechtspraak_items:
    #body = '\n'.join( item['body'] )
    doc = dutch( body )

    for sent in doc.sents:
        if not 'appellant' in sent.text.lower():
            continue
        #matches = matcher( doc )

        matches = matcher( sent )
        for match_id, start_i, end_i in matches:
            print( '------------------' )

            if 1:
                # display with some context:
                print( sent[ : start_i ].text,      end=''  ) # sentence before match, as-is
                print( " [[  ",                     end=''  ) # match: bracket
                for tok in sent[ start_i : end_i ]:          # and for each token mention part of speech
                    print(f'{tok.text}/{tok.pos_}', end='  ')
                print( "]] ",                       end=''  )
                print( sent[ end_i : ].text.rstrip() ) # sentence before match, as-is

                # print( doc[ start_i-5 : start_i ].text, 
                #     '[',
                #     doc[ start_i : end_i ].text,
                #     ']',
                #     doc[ end_i : end_i+5 ].text,
                #     )
            if 1:
                # just count for now
                count_pats[ doc[ start_i : end_i ].text ] += 1 
    #break

for str, count in sorted( count_pats.items(), key=lambda x:x[1], reverse=True):
    print( f'{count:5d}  {str}')

------------------
Namens appellante heeft mr. S.A.E. Vancraeynest [[  hoger/ADJ  beroep/NOUN  ]] ingesteld.
------------------
Bij besluit van 14 januari 2004 heeft CIZ appellante in het kader van de AWBZ voor de periode van 8 december 2003 tot 8 april 2004 geïndiceerd voor [[  huishoudelijke/ADJ  zorg/NOUN  ]] klasse 2 (2 tot 3,9 uur per week). 

1.2.
------------------
Bij het [[  nieuwe/ADJ  besluit/NOUN  ]] op bezwaar van 7 juni 2006 heeft CIZ appellante  geïndiceerd voor hulp bij het huishouden voor de periode van 8 december 2003 tot 12 april 2005.
------------------
Appellante heeft vervolgens zowel het zorgkantoor CZ (zorgkantoor) als CIZ gevraagd om vergoeding van schade die zij heeft geleden doordat zij destijds te weinig [[  persoonsgebonden/ADJ  budget/NOUN  ]] (pgb) heeft ontvangen.
------------------
Appellante heeft zich in [[  hoger/ADJ  beroep/NOUN  ]] gemotiveerd tegen deze uitspraak gekeerd, voor zover daarbij de rechtsgevolgen van het vernietigde besluit van 17 juli

In [None]:

an_pattern = [     [ 
        {"LEMMA": "vergrijp"}  
    ],
]
matcher = spacy.matcher.Matcher(dutch.vocab)
matcher.add("adjective-noun", an_pattern)

count_pats = collections.defaultdict(int)

larger_rvs_advice = [] # 
for body in rechtspraak_items:
    #body = '\n'.join( item['body'] )
    doc = dutch( body )

    for sent in doc.sents:
        if not 'appellant' in sent.text.lower():
            continue
        #matches = matcher( doc )

        matches = matcher( sent )
        for match_id, start_i, end_i in matches:
            print( '------------------' )

            if 1:
                # display with some context

                print( sent[ : start_i ].text, end='' )
                print( " [[  ", end='' )
                for tok in sent[ start_i : end_i ]:
                    print(f'{tok.text}/{tok.pos_}', end='  ')
                    
                print( "]] ", end='' )
                print( sent[ end_i : ].text.rstrip() )

                # print( doc[ start_i-5 : start_i ].text, 
                #     '[',
                #     doc[ start_i : end_i ].text,
                #     ']',
                #     doc[ end_i : end_i+5 ].text,
                #     )
            if 1:
                # just count for now
                count_pats[ doc[ start_i : end_i ].text ] += 1 
    #break

for str, count in sorted( count_pats.items(), key=lambda x:x[1], reverse=True):
    print( f'{count:5d}  {str}')

In [None]:
import requests
r = requests.get("https://restcountries.com/v2/all")
countries = {c["name"]: c   for c in r.json()}
list( countries ) 

# Slightly less basic

## Extracting patterns with rule-based matching

In [7]:
import wetsuite.helpers.strings

In [4]:
rvs = wetsuite.datasets.load('cvdr-mostrecent-text')


Deze verordening <verb>
Deze uitspraak
appellante
bevestigt 

De omgevingsvergunning

Het bestemmingsplan


met ingang van


In [46]:
texts = []
for _, text in wetsuite.datasets.load('cvdr-mostrecent-text').data.random_sample(50):
    texts.append( text )

rechtspraak_nl = wetsuite.datasets.load('rechtspraaknl-struc')
for _, dd in rechtspraak_nl.data.random_sample(50):
    texts.append( dd.get('bodytext') )

len(texts)

100

In [None]:
dutch  = spacy.load('nl_core_news_lg')


In [47]:
an_pattern = [ 
    [ 
        {"ORTH":   "Deze"},   
        {"ORTH":   {"IN":["verordening", "uitspraak"]}},   
        {"POS": {"IN":["VERB","AUX"]} } 
    ],

]

matcher = spacy.matcher.Matcher(dutch.vocab)
matcher.add("pats", an_pattern)

count_pats = collections.defaultdict(int)

for text in texts:
    doc = dutch(text)
    matches = matcher( doc )
    for match_id, start_i, end_i in matches:
        # we could mark and display them in an existing parse, but for now just count them
        #print( doc[ start_i : end_i ].text )
        count_pats[ doc[ start_i : end_i ].text ] += 1 

print( len(count_pats) )
for str, count in sorted( count_pats.items(), key=lambda x:x[1], reverse=True):
    print( f'{count:5d}  {str}')

6
   23  Deze verordening treedt
   19  Deze verordening wordt
   18  Deze uitspraak is
   11  Deze verordening kan
    7  Deze verordening is
    5  Deze verordening verstaat


In [None]:


The way spacy handled entities supports this somewhat - 
it handles them as a special thing, but only barely.

You can manually mark up things as entities
 set_ents( list of Spans with labels )

You can make rules that set them, but they tend to be pattern recognition that _ignore_ context.



for example, 

patterns = [
    {"label": "PHONE_NUMBER", 
     "pattern": [
       {"ORTH": "("}, 
       {"SHAPE": "ddd"}, 
       {"ORTH": ")"}, 
       {"SHAPE": "ddd"},
       {"ORTH": "-", "OP": "?"},
       {"SHAPE": "dddd"}]}
    ]


...which matches:
(123)456-7890
(123)4567890
...and nothing else.



You can match entirely-fixed names like:

patterns = []
for name in patterns:
    patterns.append(
 		'label':'PERSON',
 		pattern:name,
    )


nlp = Dutch()
ruler = EntityRuler( nlp )
ruler.add_patterns()

