<a href="https://colab.research.google.com/github/WetSuiteLeiden/example-notebooks/blob/main/specific-little-experiments/abbreviations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# (only) in colab, run this first to install wetsuite from (the most recent) source. 
#    (this should soon simplify to something like   !pip3 install --upgrade wetsuite)
# For your own setup, see wetsuite's install guidelines.
!pip3 install -U wetsuite

# Purpose of this notebook

Try to extract acronyms from text.

This is in part a gentle entry. 
Extracting abbreviations is sort of a classic in introductory NLP courses, of "you can do this from scratch, see some immediate results, get an idea of limitations".


But also, it may actually be useful to get an idea of abbreviations that are common to the legal domain, some subdomains, 
as well as terms that happen to pop up out from commonly discussed topics, not just legal terminology.

Particularly laws might be interesting, though this is ''not'' likely to be a very clean source of those.

It _might_ also be a good source of entity names for later training -- after some classifcation, anyway.


The code that currently backs it is roughly the simple kind that you might find in such a course -- there is a `wetsuite.phrases.abbreviations.abbrev_find( )`
that mainly just look for text like "Word Combination (WC)" and a few basic variants of that idea. 

### Basic tests on unreasonably clean wordlist data

That is, take existing lists of abbreviations. 
- Anything we catch can go towards our output
- Anything we do not is an instruction of improvement.
  - for example, the below already suggests we could ignore-and-consume 'en', 'van', 'voor',
  - and, more interestingly, that we could be looking within compounds

In [2]:
import re, pprint, random

import bs4

import wetsuite.helpers.net
import wetsuite.helpers.koop_parse
import wetsuite.helpers.patterns
import wetsuite.helpers.etree
import wetsuite.datasets

In [5]:
# Unreasonably clean data,  that also contains some less usual cases so we can report what we might eventually want to deal with
html = wetsuite.helpers.net.download('https://organisaties.overheid.nl/Zelfstandige_bestuursorganen/') 

soup = bs4.BeautifulSoup( html ) # parse the webpage into something we can query
for link in soup.select('.content .list--linked li a'):   # some scraping magic we might explain elsewhere 
    # we are interested in the link text:
    found = False
    for ab, words in wetsuite.helpers.patterns.abbrev_find( link.text ):
        print( 'FOUND  %s = %s'%( ab, words ) )
        found = True

    # Things we didn't find - more creative things that we _might_ want to consider
    if '(' in link.text and not found:    # (assuming bracket indicates there is an explained abbreviation in that link text)
        print( "MISS  ", link.text )

MISS   Aangewezen/aangemelde instanties (dezelfde) ex art. 1a.5.1 Vuurwerkbesluit
MISS   Airport Coordination Netherlands (ACNL)
MISS   Autoriteit Consument en Markt (ACM)
FOUND  AFM = ['Autoriteit', 'Financiële', 'Markten']
MISS   Autoriteit Nucleaire Veiligheid en Stralingsbescherming (ANVS)
MISS   Autoriteit online Terroristisch en Kinderpornografisch Materiaal (ATKM)
FOUND  AP = ['Autoriteit', 'Persoonsgegevens']
FOUND  BoW = ['Blik', 'op', 'Werk']
FOUND  BA = ['Bureau', 'Architectenregister']
FOUND  BBL = ['Bureau', 'Beheer', 'Landbouwgronden']
FOUND  BFT = ['Bureau', 'Financieel', 'Toezicht']
MISS   CBG (College ter Beoordeling van Geneesmiddelen) (CBG)
FOUND  CBR = ['Centraal', 'Bureau', 'Rijvaardigheidsbewijzen']
MISS   Centraal Bureau voor de Statistiek (CBS)
MISS   Centraal Orgaan opvang asielzoekers (COA)
FOUND  CCD = ['Centrale', 'Commissie', 'Dierproeven']
FOUND  CCMO = ['Centrale', 'Commissie', 'Mensgebonden', 'Onderzoek']
FOUND  CIZ = ['Centrum', 'Indicatiestelling', 'Zo

In [8]:
# Similar idea, different site
html = wetsuite.helpers.net.download('https://web.archive.org/web/20230907212736/https://publications.europa.eu/code/nl/nl-5000400.htm') 

soup = bs4.BeautifulSoup( html ) # parse the webpage into something we can query
for tr in soup.select('table.definitionsTable tr'):  
    tds = tr.findAll('td')

    # TODO: deal with the way it mentions multiple definitions
    text = '%s (%s)'%(tds[0].text.strip(), tds[1].text.strip())  # pretend we don't know this is good data and just put it next to each other

    found = False
    for ab, words in wetsuite.helpers.patterns.abbrev_find( text ):
        print( 'FOUND  %s = %s'%( ab, words ) )
        found = True

    # Things we didn't find - more creative things that we _might_ want to consider
    if '(' in text and not found:    # (assuming bracket indicates there is an explained abbreviation in that link text)
        print( "MISS  ", text )

MISS   ABH (Agentschap voor Buitenlandse Handel (voorheen BDBH (Belgische Dienst voor Buitenlandse Handel)))
MISS   ABVV (Algemeen Belgisch Vakverbond)
MISS   ACS (staten in Afrika, het Caribisch gebied en de Stille Oceaan)
FOUND  ACV = ['Algemeen', 'Christelijk', 'Vakverbond']
MISS   ADB (1.
Afrikaanse Ontwikkelingsbank
(African Development Bank)
2.
Arabische Ontwikkelingsbank
(Arab Development Bank)
3.
Aziatische Ontwikkelingsbank
(Asian Development Bank))
MISS   ADN (Europese Overeenkomst betreffende het internationale vervoer van gevaarlijke goederen over de binnenwateren)
MISS   ADR (Europese Overeenkomst betreffende het internationale vervoer van gevaarlijke goederen over de weg)
MISS   Afnor (Frans Normalisatie-instituut
(Association française de normalisation))
MISS   ALO (algemene leningsovereenkomst)
MISS   Altener II (meerjarenprogramma ter bevordering van hernieuwbare energiebronnen in de Gemeenschap)
MISS   AKE (Agentschap voor Kernenergie (OESO))
FOUND  ANP = ['Algemeen',

In [9]:
# Similar idea, different site
html = wetsuite.helpers.net.download( 'https://www.rijksfinancien.nl/memorie-van-toelichting/2021/OWB/XIII/onderdeel/644956' ) 

soup = bs4.BeautifulSoup( html ) # parse the webpage into something we can query
for tr in soup.select('.kio2 tr'):  

    tds = tr.findAll('td')

    text = '%s (%s)'%(tds[0].text.strip(), tds[1].text.strip())  # pretend we don't know this is good data and just put it next to each other

    found = False
    for ab, words in wetsuite.helpers.patterns.abbrev_find( text ):
        print( 'FOUND  %s = %s'%( ab, words ) )
        found = True

    # Things we didn't find - more creative things that we _might_ want to consider
    if '(' in text and not found:    # (assuming bracket indicates there is an explained abbreviation in that link text)
        print( "MISS  ", text )

MISS    ()
MISS   ACM (Autoriteit Consument en Markt)
FOUND  ACT = ['Accelerating', 'CCS', 'Technologies']
MISS   ACVG (Adviescollege Veiligheid Groningen)
FOUND  ANBI = ['Algemeen', 'nut', 'beogende', 'instellingen']
FOUND  AT = ['Agentschap', 'Telecom']
FOUND  ATR = ['Adviescollege', 'toetsing', 'regeldruk']
MISS   AWTI (Adviesraad voor Wetenschap, Technologie en Innovatie)
MISS   BBE (Biobased Economy)
FOUND  BBP = ['Bruto', 'Binnenlands', 'Product']
MISS   BES (Bonaire, Sint Eustatius, Saba)
MISS   BIS (Basisinfrastructuur voor cultuur)
MISS   BIPM (Bureau International des Poids en Mesures)
MISS   BMKB (Borgstellingsregeling Midden en Kleinbedrijf)
FOUND  BNP = ['Bruto', 'Nationaal', 'Product']
FOUND  BOM = ['Brabantse', 'Ontwikkelings', 'Maatschappij']
MISS   BPM (Belasting van personenauto's en motorrijwielen)
MISS   BTW (Belasting over de toegevoegde waarde)
MISS   BZ (Ministerie van Buitenlandse Zaken)
MISS   BZK (Ministerie van Binnenlandse Zaken en Koninkrijksrelaties)
MISS 

In [10]:
# Similar idea, different site
html = wetsuite.helpers.net.download('https://juridisch-woordenboek.nl/afkortingen') 

soup = bs4.BeautifulSoup( html ) # parse the webpage into something we can query
for tr in soup.select('table#afkortingen tbody tr'):
    #print(tr)
    tds = tr.findAll('td')

    text = '%s (%s)'%(tds[0].text.strip(), tds[1].text.strip())  # pretend we don't know this is good data and just put it next to each other

    found = False
    for ab, words in wetsuite.helpers.patterns.abbrev_find( text ):
        print( 'FOUND  %s = %s'%( ab, words ) )
        found = True

    # Things we didn't find - more creative things that we _might_ want to consider
    if '(' in text and not found:    # (assuming bracket indicates there is an explained abbreviation in that link text)
        print( "MISS  ", text )


FOUND  AA = ['Ars', 'Aequi']
FOUND  AA = ['Accountant', 'Administratieconsulent']
FOUND  AA = ['Advertising', 'Association']
MISS   a.a. (ad acta, bij de akten (wegleggen))
FOUND  AAA = ['American', 'Arbitration', 'Association']
MISS   AAC (Advies- en Arbitragecommissie)
MISS   AAf (Algemeen Arbeidsongeschiktheidsfonds)
MISS   AAR (Algemeen ambtenarenreglement)
MISS   AAR (Algemene Aanwijzingen voor de Rijksdienst)
FOUND  AAV = ['Algemene', 'administratieve', 'voorschriften']
MISS   AAW (Algemene Arbeidsongeschiktheidswet)
FOUND  AB = ['Administratiefrechterlijke', 'Beslissingen']
MISS   AB (Administratieve en Rechterlijke Beslissingen)
MISS   AB (Nederlandse Jurisprudentie Administratiefrechtelijke Beslissingen (sinds 1971))
MISS   AB (Wet Algemene Bepalingen)
FOUND  ABA = ['American', 'Bar', 'Association']
MISS   ABAR (Algemene bepalingen van administratief recht)
MISS   abbb (algemene beginselen van behoorlijk bestuur)
FOUND  ABP = ['Algemeen', 'Burgerlijk', 'Pensioenfonds']
MISS   

In [7]:
# Similar idea, different site
html = wetsuite.helpers.net.download('https://www.eur.nl/esl/campus/sanders-law-library/juridische-afkortingen') 

soup = bs4.BeautifulSoup( html ) # parse the webpage into something we can query
for tr in soup.select('div.accordion table tr'):
    tds = tr.findAll('td')
    if len(tds) != 2:
        print("SKIP %s"%tr)
    else:
        text = '%s (%s)'%(tds[1].text.strip(), tds[0].text.strip())  # pretend we don't know this is good data and just put it next to each other

        found = False
        for ab, words in wetsuite.helpers.patterns.abbrev_find( text ):
            print( 'FOUND  %s = %s'%( ab, words ) )
            found = True

        # Things we didn't find - more creative things that we _might_ want to consider
        if '(' in text and not found:    # (assuming bracket indicates there is an explained abbreviation in that link text)
            print( "MISS  ", text )


SKIP <tr><th>Afkorting</th><th>Betekenis</th></tr>
MISS   anno, in het jaar (a°)
MISS   Algemene bepalingen (A)
MISS   Antwoord der regering naar aanleiding van het verslag (A)
MISS   Arbeid; afzonderlijk verschenen van 1946-1953 (A)
MISS   Atlantic Reporter second series (A.2d.)
MISS   Accountancy en Bedrijfskunde (A&B)
MISS   Aansprakelijkheid en Verzekering (A&V)
MISS   Ars Aequi. Juridisch studentenblad (AA of A.A. of AAe)
MISS   Accountant-Administratieconsulent (AAC)
FOUND  AA = ['Advertising', 'Association']
MISS   ad acta, bij de akten (wegleggen) (a.a)
FOUND  AAA = ['American', 'Arbitration', 'Association']
MISS   Algemeen aanduidingenbesluit (AAB)
MISS   Algemene aannemingsvoorwaarden voor bedrijfsgebouwen in de landbouw (AABL)
MISS   Advies- en Arbitragecommissie (AAC)
MISS   Ars Aequi. Juridisch studentenblad (A Ae)
MISS   Algemeen arbeidsongeschiktheidsfonds (AAF of Aaf)
MISS   Adem-alcoholgehalte (AAG)
MISS   Ars Aequi jurisprudentiebundel (AA-Jur)
MISS   Algemene aannemi

### Run on a bunch of free-form document text

And let's try to make the results cleaner
by only reporting explanations that appear in multiple documents,
and counting how often each appears.

In [19]:
cvdr_per_doc_results = [] # a list of (what abbrev_find) returns, per document

for _, text in wetsuite.datasets.load('cvdr-mostrecent-text').data.random_sample(50000): # random_sample(smallnumber) for smaller/faster feedback, or .items() for everything
    results = wetsuite.helpers.patterns.abbrev_find(text)
    if len(results) > 0:
        cvdr_per_doc_results.append( results )

In [11]:
### clean the above - report only things that were explained the same way in two or more documents

# make functions for this, becase we'll be using this twice
def count_and_filter( per_doc_results, min_doc_occur=2 ):
    report_these = []
    for abbrev, words_count in wetsuite.helpers.patterns.abbrev_count_results( per_doc_results, remove_dots=True, case_insensitive_explanations=True ).items():
        for words, count in words_count.items():
            if count >= min_doc_occur:   # the point of that structure:  being able to ignore rarer explanations
                report_these.append( (abbrev, count, ' '.join(words) ) )
    return report_these

def print_filtered( report_these ):
    ## Print what we now have
    #report.sort(key=lambda tup: -tup[1]) # sort by count descending
    report_these.sort(key=lambda tup: (tup[0], -tup[1])) # sort/group by abbreviation alphabetically, then by count descending
    for abbrev, count, expl in report_these:
        print( '%10s   %3d:   %s'%( abbrev, count, expl ) )

In [12]:
explanations = count_and_filter( cvdr_per_doc_results )

print_filtered( explanations )

        AB     5:   algemeen bestuur
       ABP     6:   Algemene Burgerlijke Pensioenwet
       ABZ     3:   Algemeen Bestuurlijke Zaken
      ABdK     2:   Actief Bodembeheer de Kempen
       ACM     2:   Autoriteit Consument Markt
       ADL    31:   algemene dagelijkse levensverrichtingen
       ADL     5:   algemeen dagelijkse levensverrichtingen
       ADL     2:   algemene dagelijkse levensbehoeften
       ADL     2:   activiteiten dagelijks leven
       ADL     2:   Algemeen dagelijkse Levensbehoeften
       AFM    21:   Autoriteit Financiële Markten
       AHN     4:   Actueel Hoogtebestand Nederland
       AIM     3:   Activiteitenbesluit Internet Module
       AIV     7:   Advies instructie voorlichting
       ALS     2:   advanced life support
      AMvB    20:   Algemene Maatregel van Bestuur
      ANBI    13:   Algemeen Nut Beogende Instellingen
      ANBI     6:   algemeen nut beogende instelling
       AOV     2:   ambtenaar openbare veiligheid
       AOV     2:   aanvu

In [20]:
# Similar idea, but from BWB
bwb_per_doc_results = []

for _, text in wetsuite.datasets.load('bwb-mostrecent-text').data.random_sample(50000):
    results = wetsuite.helpers.patterns.abbrev_find(text)
    if len(results) > 0:
        bwb_per_doc_results.append( results )    

In [21]:
explanations = count_and_filter( cvdr_per_doc_results + bwb_per_doc_results ) # combine with the previous results

print_filtered( explanations )

        44     4:   48 44
        55     2:   58 55
        AA     7:   accountant administratieconsulent
        AB    28:   Algemeen Bestuur
        AB     2:   Activerende Begeleiding
        AB     2:   Archeologische Begeleiding
       ABN     2:   Algemeen Beschaafd Nederlands
       ABP    17:   Algemene Burgerlijke Pensioenwet
       ABP     2:   Algemeen Burgerlijk Pensioenfonds
       ABS     2:   Afval Breng Station
       ABU     4:   Algemene Bond Uitzendondernemingen
       ABZ     7:   Administratiebesluit Bijzondere Ziektekostenverzekering
       ABZ     3:   Algemeen Bestuurlijke Zaken
      ABdK     3:   Actief Bodembeheer de Kempen
       ACE     2:   Aanvullend Convenant Erfpacht
       ACM     5:   Autoriteit Consument Markt
      ACOS     4:   Assistant Chief of Staff
       ACS     2:   Ambassadeur Culturele Samenwerking
       ACT     4:   Accelarating CCS Technologies
       ACT     3:   Advance Corporation Tax
       ADL    85:   algemene dagelijkse levensverr

### Experiment: look for 'hierna:' / 'hierna te noemen:'
as another clean-ish source, that should also catch more creative cases,
except there are also a lot of things like `hierna: de Wet`, `hierna: het college`, and other such shortening.

In [24]:

def text_nearby_all( needle, haystack, chars_before=40, chars_after=40 ):
    ret = []
    for mob in re.finditer(needle, haystack):
        st, en = mob.start(0), mob.end(0)
        ret.append( (haystack[st-chars_before:st],  haystack[st:en].upper(),  haystack[en:en+chars_after]  ) )
    return ret


results = []
def find_hierna(text):
    text_res = []
    if 'hierna' in text:
        #for before, match, after in text_nearby_all('hierna', text):
        #    print( 'MATCH ...%s[%s]%s...'%(before, match.upper(), after) )
        #continue
        for match in re.findall( r'(?: de | het )([^.,\(]+)[\(](hierna[.]*?:? [^\)]+)[\)]', text ): # the ., is a quick and dirty "localize to sentence/phrase split
            long, short = match
            long = long.strip()
            text_res.append( [long, [short]] )
    if len(text_res) > 0:
        results.append( text_res )


for _, text in wetsuite.datasets.load('cvdr-mostrecent-text').data.random_sample(50000): # random_sample(smallnumber) for smaller/faster feedback, or .items() for everything
    find_hierna(text)

for _, text in wetsuite.datasets.load('bwb-mostrecent-text').data.random_sample(50000):
    find_hierna(text)

In [27]:
print( len(results) )

explanations = count_and_filter( results ) # combine with the previous results
print_filtered( explanations )

2380
Afdeling bestuursrechtspraak van de Raad van State    12:   hierna: de Afdeling
Afdeling bestuursrechtspraak van de Raad van State     3:   hierna: ABRvS
Algemeen beleidskader indeplaatsstelling bij taakverwaarlozing     2:   hierna: het beleidskader
Algemene Plaatselijke Verordening     6:   hierna: APV
Algemene Verordening Gegevensbescherming     2:   hierna te noemen: AVG
Algemene Wet Bijzondere Ziektekosten     2:   hierna: AWBZ
Algemene Wet bestuursrecht     2:   hierna: Awb
Algemene plaatselijke verordening    11:   hierna: APV
Algemene plaatselijke verordening Hoorn     3:   hierna: APV
Algemene plaatselijke verordening Purmerend 2003     2:   hierna: APV
Algemene subsidieverordening 2012     2:   hierna Asv 2012
Algemene subsidieverordening Venlo 2020     3:   hierna: AsV Venlo 2020
Algemene wet
                            bestuursrecht     2:   hierna: Awb
Algemene wet bestuursrecht   102:   hierna: Awb
Algemene wet bestuursrecht    10:   hierna Awb
Algemene wet bestuursr