<a href="https://colab.research.google.com/github/WetSuiteLeiden/example-notebooks/blob/main/datasets/dataset_intro_by_doing__bwb__(definitions_example).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# (only) in colab, run this first to install wetsuite from (the most recent) source. 
#    (this should soon simplify to something like   !pip3 install --upgrade wetsuite)
# For your own setup, see wetsuite's install guidelines.
!pip3 install -U wetsuite

## Purpose of this notebook

Explore what is in the Basis WettenBestand dataset(s), and what you could easily do with it. 

You may also want to find the [koop_bwb_docstructure](../specific-experiments/investigate-document-structures/koop_bwb_docstructure.ipynb) notebook 
which is some introduction to the structure of the varied documents in here.

Any real research question is probably going to be fairly specific,
so let's start with something relatively dumb - looking for definition lists.

In [1]:
import collections, random, pprint

import wetsuite.helpers.etree
import wetsuite.helpers.strings
import wetsuite.helpers.koop_parse
import wetsuite.helpers.net
from wetsuite.helpers import lazy
import wetsuite.datasets

In [2]:
bwb_text = wetsuite.datasets.load('bwb-mostrecent-text')
bwb_xml  = wetsuite.datasets.load('bwb-mostrecent-xml')
bwb_meta = wetsuite.datasets.load('bwb-mostrecent-meta-struc')  # we don't end up using this

In [None]:
#print( bwb_text.data.get('BWBR0034320') )
print( wetsuite.helpers.etree.debug_pretty( bwb_xml.data.get('BWBR0034320')) ) 
# or
#wetsuite.helpers.etree.debug_color( bwb_xml.data.get('BWBR0034320'))
# either way it's a bit much to show, though

## Figure out which laws have a definition list.

It might be cleverer to reach into specific parts of the XML, 
and e.g. look for an artikel/kop/titel with text 'Definities' or 'Begripsbepalingen'

But even simpler would be to check whether text like `In deze regeling wordt verstaan onder:` is present.

Such wording happens to often be used literally, and while we _likely_ are missing some cases, 
for a quick test this is plenty.

In [8]:
wordt_verstaan = 'deze regeling wordt verstaan'   # a substring of that that might catch some variant wording

bwbids_with_verstaan = set()
for bwbid, text in bwb_text.data.items():
    if wordt_verstaan in text:
        bwbids_with_verstaan.add( bwbid )
len( bwbids_with_verstaan )

5344

fetch the full XML for each of those BWB-ids,
then fish out just the definitions list -- by looking around the element that contains that same text

The following fishes out two separate things:
- the title of the section we found this in   -- for possible  future refinement of how we're picking this out
- definitions 

In [9]:
definitions = collections.defaultdict(list) # defined_thing -> (in_bwbid, definition), the main thing we're fishing out

def_header_titles = collections.Counter() # the name of the header of sections we select, to see how consistent it is

for test_bwbid in bwbids_with_verstaan:
    xmlbytes = bwb_xml.data.get( test_bwbid ) # get the document for it (again)
    etree = lazy.etree( xmlbytes )
    
    # That next line is syntax-fu via XPath, not easy to understand very quickly, apologies.  
    #   It's a bunch fewer lines of code than expressing the same via node finding and navigation.
    #   It asks for something like "the parent (if it is an artikel), of an alinea node that contains that 'deze regeling wordt verstaan' as text" 
    for node in list( etree.xpath( "//al[contains(text(),'%s')]/parent::artikel"%wordt_verstaan ) ):

        # This fishes out the header of the part we're in.
        kop   = node.find('kop')
        titel = kop.find('titel')
        if titel is not None:
            def_header_titles.update( [titel.text] )
            
        # The rest is picking up the definitions in the section:

        # From looking at some of these documents, most look like::
        #   <al><nadruk type="cur">de minister:</nadruk>de Minister van Binnenlandse Zaken en Koninkrijksrelaties;</al>
        # note: 
        # - a serious investigation would try for completeness, this is just a proof of concept and a test of usefulness.
        # - we currently use that nadruk as the thing to define. That will later prove to be too approximate, but it's simple for an example
        for al in node.xpath('//nadruk/parent::al'):
            al_before = wetsuite.helpers.etree.debug_pretty(al) # reformatting, only for "wait, why did it seem to contain nothing?" human-geared debug 

            nadruk = al.find('nadruk')
            defined_thing = nadruk.text
            if defined_thing is not None  and  len(defined_thing.strip()) >= 2: # skip some empty nodes, and single letters
                defined_thing = defined_thing.rstrip(': ')

                # the further text is often the etree-.tail of the nadruk node, but let's assume there can be markup in there,
                # We can use our own text extractor function on the whole thing 
                #   ...if we remove the term we are defining from the in-memory document 
                #   (specifically nadruk; we just copied it to `what`) before doing so, to avoid it showing up twice
                nadruk.text = ''

                rest_text = (  ' '.join( wetsuite.helpers.etree.all_text_fragments(al) )  ).strip('; ')

                if len(rest_text)==0: # only nadruk, no other text in the alinea -- this is probably wrong and skippable.
                    pass
                    #print('CONFUSED about:')
                    #print( al_before )
                else:
                    definitions[defined_thing].append( (test_bwbid, rest_text) )  # add the BWB-id to signal where it came from

In [11]:
# Count and list the name of the section we just picked these out of
#   (in part to see how well we would have done if we were looking for them by section name)
dht = list( def_header_titles.items() )  # a list of (headertext, count)
dht.sort( key=lambda x:x[1], reverse=True) # most used on top
for header, count in dht:
    if count >= 2: # show only those used more than once
        print( '%5s %s'%(count, header) )
# Turns out there's some variation:

 1230 Begripsbepalingen
  382 Definities
  173 Begripsbepaling
   66 Begripsomschrijvingen
   23 (begripsbepalingen)
   21 Begrippen
   17 Definitiebepalingen
   15 begripsbepalingen
   14 (definities)
   11 Definitiebepaling
   10 Begripsomschrijving
    9 Definitie
    7 (Begripsbepalingen)
    6 Algemene bepalingen
    6 (begripsomschrijving)
    4 definities
    3 
      
    3 Algemene begripsbepalingen
    3 (begripsomschrijvingen)
    2 Begrippen en definities
    2 – definities
    2 – Definities –
    2  Begripsbepalingen
    2 - Begripsbepalingen
    2 Begripsbepalingen 


In [15]:
# CONSIDER: using a case insensitive merge on defined_thing
defdata = list( definitions.items() )  
defdata.sort( key=lambda x:len(x[1]), reverse=True)

# items like:  ('Lucky Bamboo', [('BWBR0025197', 'sierplant met de wetenschappelijke naam Dracaena sanderiana')])
# The below throws away that origin ID for brevity,
# yet you might well want that when digging deeper

for defined_thing, definitions_list in defdata[:25]: # [:25] to print just a handful, otherwise this would be 60K lines
    # there are many cases of shorthands-per-document, e.g. 
    # "where we say minister, this particular one within this document", which are not general definitions. 
    # While always incomplete, We can remove some of the most common cases:
    if defined_thing.lower() in ('minister', 'de minister', 'wet','de wet', 'besluit',  'instelling'):
        continue # so ignore them


    # all mentions
    # if len(definitions_list)>0:
    #print( defined_thing )
    #     for origin_bwb, definition in definitions_list:
    #         print(f'  In {origin_bwb}: {definition}')

    # _not_ all mentions - only things that appear more often
    counts = wetsuite.helpers.strings.count_normalized( 
        list(definition     for origin_bwbid, definition  in definitions_list ),  
        min_count = 2   # show only definitions appearing twice or more
    )
    if len(counts) > 0: #if that leaves anything to print:
        print( defined_thing )
        pprint.pprint(counts)

    print()

school
{'bekostigde basisschool als bedoeld in de Wet op het primair onderwijs': 2,
 'bekostigde school als bedoeld in de Wet op het primair onderwijs of een bekostigde school of instelling als bedoeld in de Wet op de expertisecentra .': 12,
 'bekostigde speciale school voor basisonderwijs als bedoeld in de WPO': 17,
 'bekostigde speciale school voor basisonderwijs als bedoeld in de Wet op het primair onderwijs': 2,
 'een bekostigde basisschool als bedoeld in de WPO': 2,
 'school als bedoeld In artikel 1 van de wet': 3,
 'school als bedoeld in artikel 1 van de Wet op het primair onderwijs , artikel 1 van de Wet op de expertisecentra of artikel 1.1 van de Wet voortgezet onderwijs 2020': 2,
 'school als bedoeld in artikel 1 van de Wet op het primair onderwijs , artikel 1 van de Wet primair onderwijs BES en artikel 1 van de Wet op de expertisecentra': 2,
 'school als bedoeld in artikel 1 van de wet': 34,
 'school als bedoeld in de wet': 2,
 'school voor vwo, havo, mavo, vbo of praktijkond

...okay, that needs more work.