## Purpose of this notebook

Explain what collocation analysis does.

...and how it is a tool, and what it may be good for, and what it is probably _not good enough_ for.

Collocations are about finding words that are next to each other more than just chance.

There are many reasons to seek them out, which often filters for a specific type of collocations,
trying to answer questions such as 
- meaningful patterns that are baked into the language, like "in terms of", and nearly-institutionalized phrases "eigen gebruik" or "echtgenoot of geregistreerde partner",
- "what verbs do we use with this noun", e.g. we _make_ sense rather than _have_ sense
- _fragments_ of phrases, "tijdstip zal", "verplichtingen uit", "heeft gedaan"
- so 'statistically more common together' turns out to not be quite enough for clean output 

Also note that collocation analysis alone only suggests and doesn't claim something might be lexicalized/institutionalized,
and it says even less about whether its meaning might be compositional (predictable from the parts) or are _not_ predictable from the parts (e.g. idioms).
See also the wider term of multi-word expressions (MWE).

For this and other reasons, it turns out 'statistically more common together' is a good start to start seeing phrases, though not quite enough for clean output.

TODO: add some reading

In [1]:
# For local installs you can install the package once.   In colab you get a disposable environment and will have to start with this install each time. 
!pip3 install wetsuite -U

In [1]:
import re
import random

import bs4  # BeautifulSoup is a handy way of scraping some text from HTML or XML

import wetsuite.helpers.koop_parse
import wetsuite.helpers.etree
import wetsuite.helpers.net
import wetsuite.helpers.strings
import wetsuite.helpers.collocation
import wetsuite.helpers.notebook
import wetsuite.helpers.spacy
import wetsuite.datasets

Fetch Burgerlijk wetboek 7 from KOOP respositories, in XML form.

At 60k words this you may consider this fairly large, yet for this kind of analysis it is fairly small.

But we get the answer quickly, so we can play with the various parameters.

...which we want in a demo because as we will find out,
this whole concepts is somewhat finicky,
in that there is no zone with best results because we never really specified 
what kind of things we find more interesting than others.


In [42]:
bwb7_xml  = wetsuite.helpers.net.download( 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0005290/2008-03-26_0/xml/BWBR0005290_2008-03-26_0.xml' )
bwb7_soup = bs4.BeautifulSoup( bwb7_xml, features='xml' ) # bs4 makes it a little easier to pick things out of that XML.
sents     = []
for al in bwb7_soup.select('lid al'):
    sents.append(  ' '.join( al.find_all(string=True) )  )

In [86]:
# split each sentence into words - dumb version
tokenized_sents = []
for sent in sents:
    tokenized_sents.append( wetsuite.helpers.strings.simple_tokenize(sent) )

tokenized_sents[0]

['De',
 'koop',
 'van',
 'een',
 'tot',
 'bewoning',
 'bestemde',
 'onroerende',
 'zaak',
 'of',
 'bestanddeel',
 'daarvan',
 'wordt',
 'indien',
 'de',
 'koper',
 'een',
 'natuurlijk',
 'persoon',
 'is',
 'die',
 'niet',
 'handelt',
 'in',
 'de',
 'uitoefening',
 'van',
 'een',
 'beroep',
 'of',
 'bedrijf',
 'schriftelijk',
 'aangegaan']

parameters include
 - ***n-gram lengths to include*** - you generally don't want very long, because the more text you process, 
   the more entries there are with just one count  (frankly a problem with almost any n-gram based analysis)
   We can cheat in the following example, again because there is not a lot of text
 
 - what this code calls ***connectors*** - does not enter n-gram sequences with one of these words at the _edge_, but accepts them inside.
   This is a makeshift measure to reduce the amount of somewhat-longer sequences that we pick up in the middle, 
   by ignoring anything that starts or ends with these uninteresting words.
   (and also keeps down the unique n-gram count, and thereby memory use)
 
 - ***minimum count*** - removes some things that would score well just because the combo happens once or twice
 
 - ***scoring method*** on top of the basic counting statistics (mostly trying to get the numbers on a more reasonable, readable scale)

In [87]:
connectors = 'De de een het  dat die dit deze  van voor met in op bij om na   en of   is   aan  ook   je ik we hij    door kan zal dan als tot lid te heeft niet worden wordt waarin'.split()
gramlens   = (2,3,4,5,6,7,8)  # we can get away with larger n-grams because this is a fairly small text
mincount   = 10               # remove n-gram sequences that didn't occur very much, for cleaner results
showtop    = 500

# do the calculation
coll = wetsuite.helpers.collocation.Collocation( connectors=connectors )
for tokenized_sent in tokenized_sents: # feed in text
    coll.consume_tokens( tokenized_sent, gramlens=gramlens )
coll.cleanup_ngrams(mincount=mincount) 
print( "Scoring, showing top %d\n"%showtop)
scores = coll.score_ngrams()

# show the result
print( ' %9s   %55s    %12s %20s'%( 'score', 'n-gram', 'n-gram count', 'individual counts' ) )
for strtup, score,  tup_count, uni_counts in scores[-showtop:]:
    print( ' %9.3f   %55s    %12s %30s=%d'%(score, ' '.join(strtup),   tup_count, '*'.join(str(n) for n in uni_counts), wetsuite.helpers.collocation.product(uni_counts)) )

Scoring, showing top 500

     score                                                    n-gram    n-gram count    individual counts
     0.000                  handelt in de uitoefening van een beroep              14   17*1437*5847*57*3537*1759*59=2988591451947501327
     0.000                                 ten nadele van de huurder              10           273*57*3537*5847*188=60501132707652
     0.000                               ten nadele van de werknemer              17           273*57*3537*5847*243=78200932169997
     0.001                     uitoefening van een beroep of bedrijf              20        57*3537*1759*59*1157*60=1452487407525180
     0.001                                     nadele van de huurder              10               57*3537*5847*188=221615870724
     0.001                                       bedoeld in de leden              12              308*1437*5847*118=305367339816
     0.001        arbeidsovereenkomst of bij regeling door of namens           

## Similar, in gensim
Note that collocations are simple enough that a lot of NLP libraries do something like it, sometimes with their own take on it.

See e.g. [this NLTK example](https://www.nltk.org/howto/collocations.html), 
or the below code for gensim (see als [docs](https://radimrehurek.com/gensim/models/phrases.html)) 

In [None]:
# gensim separates building a model. 
# Note that the below code uses the input both to build the model, and to then look in the same text.

from gensim.models.phrases import Phrases

phrase_model = Phrases(tokenized_sents, min_count=3, threshold=0.4, connector_words=connectors, scoring='npmi', delimiter=' ')

for phrase, score in sorted( phrase_model.find_phrases( tokenized_sents ).items(), key=lambda x:x[1], reverse=True)[:500]: 
    firstword,rest = phrase.split(' ',1) # (gensim deals with varied-n-grams with a simpler interface)
    print('%5.2f  %20s %-20s'%(score,firstword,rest))

 1.00             gedateerd ontvangstbewijs     
 1.00                  PbEG L                   
 1.00               geldsom leent               
 1.00                  2002 65                  
 1.00            wederzijds goedvinden          
 1.00              uiterste wilsbeschikking     
 1.00               Wetboek van Burgerlijke     
 1.00                  Onze Minister            
 1.00                 tafel en bed              
 1.00                    II van de Huisvestingswet
 1.00           uitvoerbaar bij voorraad        
 1.00           Burgerlijke Rechtsvordering     
 1.00                 meest gerede              
 1.00                natuur en landschap        
 1.00                   332 333                 
 1.00  Uitvoeringsinstituut werknemersverzekeringen
 1.00             structuur uitvoeringsorganisatie
 1.00                mannen en vrouwen          
 1.00               sociaal fiscaalnummer       
 1.00         verschillende werkgevers          
 1.00        

Something we probably _should_ touch on in more detail is that the scoring is central to what is even lifted out.

Gensim works differently from our code (based on a little more consideration), so gives different results.

## Now with more text

In [9]:
# Get a bunch more text

# while we're at it, get a spacy model to split sentences, 
# which should reduce the amount of across-sentence-boundary nonsense
# probably doesn't win a lot, but it's simple enough to do

paragraphs  = []
for bwbid, text in wetsuite.datasets.load('bwb-mostrecent-text').data.random_sample(12500):
    paragraphs.extend( re.split(r'[\s]{2,}', text) )


coll = wetsuite.helpers.collocation.Collocation( connectors=connectors )

for paragraph in wetsuite.helpers.notebook.ProgressBar(paragraphs, description='splitting sentences'):
    if len(paragraph) >= 1000000: # spacy refuses (without you upping the limit) for GPU memory reasons
        continue
    # Yes, we could do a full spacy parse, but it would be slower.  sentence_split is  faster and good enough for now.
    sents = wetsuite.helpers.spacy.sentence_split( paragraph, as_plain_sents=True )
    for sent in sents:
        toks = wetsuite.helpers.strings.simple_tokenize(sent)
        coll.consume_tokens( toks, gramlens=(2,3,4,5) )

splitting sentences:   0%|          | 0/293188 [00:00<?, ?it/s]

In [10]:
print( "Cleanup")
print( '    before:', coll.counts() )
coll.cleanup_ngrams(mincount=8)
print( '     after:', coll.counts() )

Cleanup
    before: {'from_tokens': 19810501, 'unigrams': 167786, 'ngrams': 10053537}
     after: {'from_tokens': 19810501, 'unigrams': 167786, 'ngrams': 369461}


In [43]:
top = 2000
print( "Scoring, showing top %d\n"%top)
scores = coll.score_ngrams( )
print( ' %9s   %55s    %12s %20s'%('score', 'n-gram', 'n-gram count', 'individual counts') )
for strtup, score,  tup_count, uni_counts in scores[-top:]:
    print( ' %9.3f   %55s    %12s %20s=%d'%(score, ' '.join(strtup),   tup_count, '*'.join(str(n) for n in uni_counts), wetsuite.helpers.collocation.product(uni_counts)) )

Scoring, showing top 2000

     score                                                    n-gram    n-gram count    individual counts
   171.261                polycyclische aromatische koolwaterstoffen              49            51*71*166=601086
   171.304                    preconcurrentieel ontwikkelingsproject              29                62*97=6014
   171.500                                  dynamisch aankoopsysteem              14                56*25=1400
   171.709                     Helicopter landing officer Genormeerd              10          17*68*42*18=873936
   171.721                                 basisregistratie kadaster             169              705*289=203745
   171.742                                      Beroepsopleiding VAM              16                22*83=1826
   171.762                                                   St Goar              12                79*13=1027
   171.818                    arbeidsondersteuning jonggehandicapten              36

# Filtering more?

We could try to use linguistic parsing to filter what even goes into n-grams.

For example, we could consider the part of speech of each word.  
The parsing is going to be slower, but let's see if that's potentially worth it.

Since the collocation class only really cares about the count, not the text it is reporting on,
we can just say "word/POS" is the word and it'll do it.

In [None]:
# A fancier tokenizer-and-tagger (e.g. spacy's) is now necessary.
#   Note that this is slower than the simpler splitting we did above.
import spacy

nlp = spacy.load("nl_core_news_lg") # _md or even _sm might be preferable for faster testing

def nl_tokenize(nl_sent, with_pos=True):
     doc = nlp( nl_sent.rstrip() ) # this is the slow bit
     if with_pos:
          return list(tok.text+'/'+tok.pos_ for tok in doc)
     else:
          return list(tok.text              for tok in doc)

In [None]:
# different example text: kamervragen
kv   = wetsuite.datasets.load('tweedekamer-kamervragen-struc')

coll = wetsuite.helpers.collocation.Collocation( ) 
# note that we are not using connectors anymore, we will be imitating that, hopefully more precisely

for kv_id, kv_details in wetsuite.helpers.notebook.ProgressBar( kv.data.random_sample(8000) ):
    vraagdata = kv_details['vraagdata']

    for number in vraagdata:
        try:
            vraag   , _ = vraagdata[number]['vraag']
            antwoord, _ = vraagdata[number]['antwoord']
        except KeyError: # TODO: fix
            continue

        for text in vraag, antwoord:
            for paragraph in re.split(r'[\s]{2,}', text):
                for sentence in wetsuite.helpers.spacy.sentence_split( paragraph, as_plain_sents=True ):
                    coll.consume_tokens(  nl_tokenize( sentence, with_pos=True ), gramlens=(2,3,4,5)  )

In [170]:
print( coll.counts() )
coll.cleanup_ngrams(mincount=8)
print( coll.counts() )

{'from_tokens': 4028753, 'unigrams': 110032, 'ngrams': 8627273}
{'from_tokens': 4028753, 'unigrams': 110032, 'ngrams': 117034}


If you, say, would wish for only specific things, e.g. in the shape of 
ADJECTIVE ADJECTIVE NOUN or DET PRON or ADVERB VERB, you are going to remove a lot of interesting things,
and will never know, especiallly since there are _many_ possible sequence of POS tags that are reasonable for very simple phrases, e.g. 

        opsporing/NOUN en/CCONJ vervolging/NOUN
        Inspectie/NOUN Gezondheidszorg/PROPN en/CCONJ Jeugd/NOUN
        het/DET Landelijk/ADJ Register/NOUN
        de/DET Tweede/PROPN Kamer/PROPN
and maybe you even care about:

        Hebt/AUX u/PRON kennisgenomen/VERB


But, say, we are more interested in things that start with DET (e.g. "the" as part of a name) 
than that end with DET (when we care about phrases, we can declare that as incomplete).

Similarly, we probably don't care about things that _end_ in ADP or PRON.

Or anything containing a PUNCT



In [171]:
def bad_ngram(ngram):
    flat = ' '.join(ngram)
    if ngram[-1].endswith('/ADP'):
        return True
    if ngram[-1].endswith('/ADV'):
        return True
    if ngram[-1].endswith('/DET'):
        return True
    if ngram[-1].endswith('/AUX'): # but not VERB
        return True
    if ngram[-1].endswith('/CCONJ'):
        return True
    if ngram[-1].endswith('/SCONJ'):
        return True
    if ngram[-1].endswith('/PRON'):
        return True
    if '/PUNCT' in flat:
    #if ngram[-1].endswith('/PUNCT'):
        return True
    return False

coll.cleanup_ngrams_func( bad_ngram )

print( coll.counts() )

{'from_tokens': 4028753, 'unigrams': 110032, 'ngrams': 45855}


In [172]:
top = 2000
print( "Scoring, showing top %d\n"%top)
scores = coll.score_ngrams( )
print( ' %9s   %55s    %12s %20s'%('score', 'n-gram', 'n-gram count', 'individual counts') )
for strtup, score,  tup_count, uni_counts in scores[-top:]:
    print( ' %9.3f   %55s    %12s %20s=%d'%(score, ' '.join(strtup),   tup_count, '*'.join(str(n) for n in uni_counts), wetsuite.helpers.collocation.product(uni_counts)) )
    # or, to remove the POS in this summary:
    #print( ' %9.3f   %55s    %12s %20s=%d'%(score, ' '.join(s.split('/',1)[0] for s in strtup),   tup_count, '*'.join(str(n) for n in uni_counts), wetsuite.helpers.collocation.product(uni_counts)) )


Scoring, showing top 2000

     score                                                    n-gram    n-gram count    individual counts
     7.956                              toenmalige/ADJ Minister/NOUN              46             207*1574=325818
     7.962                           Ondernemend/ADJ Nederland/PROPN              53              53*8154=432162
     7.966                            humanitaire/ADJ principes/NOUN              10               233*66=15378
     7.993                             ontbranding/NOUN brengen/VERB               8               8*1226=9808
     8.000                             biologische/ADJ landbouw/NOUN              12              126*175=22050
     8.023                                       januari/PROPN jl./X             127            1408*1749=2462592
     8.054                          agrarische/ADJ kinderopvang/NOUN              18              140*352=49280
     8.058                                  lagere/ADJ inkomens/NOUN            

This seems viable enough.

