## Goal of this notebook

Inspect what the bwb fetching actually gives us, and how to deal with that when making a dataset (see the corresponding notebook in our datacollect repository) and when using it.

When you want more structure than plain text, you quickly have to dive deeper, in a way specific to a data source.

<!-- -->

<!--
This is also a continuation of what we started in the [dataset_docstructure_cvdr]() notebook,
now applied to Basis WettenBestand (BWB).
-->

## BWB

In [3]:
import collections, random
import wetsuite.helpers.net
import wetsuite.datasets
import wetsuite.helpers.etree
import wetsuite.helpers.koop_parse

### What do the documents look like?


Again, let's start with a cherry-picked example, and let's skip ahead to showing only the body (using a path we don't technically know just yet).

In [6]:
url = 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0009262/1998-01-14_0/xml/BWBR0009262_1998-01-14_0.xml'
example_tree = wetsuite.helpers.etree.fromstring( wetsuite.helpers.net.download(url) )
example_tree = wetsuite.helpers.etree.strip_namespace( example_tree )
print( wetsuite.helpers.etree.tostring( wetsuite.helpers.etree.indent( example_tree.find('wetgeving/wet-besluit/wettekst') ) ).decode('u8') )

<wettekst>
  <artikel bwb-ng-variabel-deel="/Artikel1" code="b1-1" stam-id="676033" versie-id="983772" id="C983771" label-id="655064" inwerking="1998-01-14" label="Artikel 1" bron="Stb.1998-16" effect="nieuwe-regeling" ondertekening_bron="1997-12-24" publicatie_bron="1998-01-13" publicatie_iwt="1998-01-13" status="goed">
    <kop>
      <label>Artikel</label>
      <nr>1</nr>
    </kop>
    <lid bwb-ng-variabel-deel="/Artikel1/Lid1" label-id="655064L1">
      <lidnr>1</lidnr>
      <al>De verkiezing van de leden van de raden van de gemeenten Deventer, Diepenveen en Bathmen, waarvoor de kandidaatstelling op 20 januari 1998 zou plaatsvinden, blijft achterwege.</al>
      <meta-data>
        <jcis>
          <jci versie="1.3" verwijzing="jci1.3:c:BWBR0009262&amp;artikel=1&amp;lid=1&amp;z=1998-01-14&amp;g=1998-01-14" onderdeel="lid=1"/>
        </jcis>
      </meta-data>
    </lid>
    <lid bwb-ng-variabel-deel="/Artikel1/Lid2" label-id="655064L2">
      <lidnr>2</lidnr>
      <al>De leden

There's [a schema](https://repository.officiele-overheidspublicaties.nl/Schema/BWB-WTI/2016-1/xsd/wti_2016-1.xsd),
which says that there is consistent wrapping:
- ***`toestand`*** node (useful attributes include `bwb-id`)
  - `bwb-inputbestand` (required but may be empty)
  - `bwb-wijzigingen` (required but may be empty)
  - `redactionele-correcties` (optional)
  - ***`wetgeving`***  node (useful attributes include `soort`)
    - `intitule`
    - `citeertitel`
    - ***(general content root)*** node
    - `meta-data`

It turns out what that 'general content root' element is called will vary, with what kind of document it is. 

Which is well correlated with `wetgetving`'s `soort` - though with just enough exceptions that you really can't just assume.

Let's get that conclusion from actual data:

#### What types of documents are there?

According to [Basiswettenbestand Gebruikersdocumentatie SRU](https://puc.overheid.nl/koop/doc/PUC_234296_13/1/#), `soort` is one of:
- AMvB
- AMvB-BES
- beleidsregel
- beleidsregel-BES
- circulaire
- circulaire-BES
- KB
- ministeriele-regeling
- ministeriele-regeling-archiefselectielijst
- ministeriele-regeling-BES
- pbo
- reglement
- rijksKB
- rijkswet
- verdrag
- wet
- wet-BES
- zbo

Let's see how that relates to the contents.

In [4]:
# load BWB documents -- parse the raw XML.  
#   Once we know our data and made datasets of it this will not be necesarry -- this is the prep work for that.
#   there are currently roughly 37k active toestanden.   
#   All of it in one go takes a while, and takes a lot of RAM, so let's make a selection.
bwb_parsed = []

bwb_xml = wetsuite.datasets.load('bwb-mostrecent-xml')
bwb_urls = bwb_xml.data.keys()
bwb_urls_subset = random.sample(bwb_urls, 5000)

for bwb_url in bwb_urls_subset:
    bytestring = bwb_xml.data.get( bwb_url )
    tree = wetsuite.helpers.etree.fromstring( bytestring )
    tree = wetsuite.helpers.etree.strip_namespace( tree )
    bwb_parsed.append( (bwb_url, tree) )
print('DONE parsing %d items'%len(bwb_parsed))

DONE parsing 5000 items


In [17]:
content_root_count = collections.defaultdict(int)

for url, tree in bwb_parsed:
    wetgeving = tree.find('wetgeving')
    soort     = wetgeving.get('soort')
    intitule, citeertitel, content_root, metadata = wetgeving.getchildren() # implicitly also tests whether there are always those four nodes the schema says
    content_root_count[ (soort, content_root.tag) ] += 1

In [19]:
for path_string in sorted(content_root_count):
    soort,content_root = path_string
    count = content_root_count[path_string]
    print(' %50r  with   %-20r  appeared %s times'%(soort, content_root, count))

                                             'AMvB'  with   'regeling'            appeared 1 times
                                             'AMvB'  with   'wet-besluit'         appeared 439 times
                                         'AMvB-BES'  with   'wet-besluit'         appeared 35 times
                                               'KB'  with   'regeling'            appeared 32 times
                                               'KB'  with   'wet-besluit'         appeared 83 times
                                     'beleidsregel'  with   'circulaire'          appeared 338 times
                                     'beleidsregel'  with   'regeling'            appeared 67 times
                                 'beleidsregel-BES'  with   'circulaire'          appeared 4 times
                                 'beleidsregel-BES'  with   'regeling'            appeared 2 times
                                       'circulaire'  with   'circulaire'          appeared 72 times
 

Which is a little messy  (and if we had included all documents there would have been a few more combinations).

On the other hand, if you treat rare things as exceptions, it's not so bad:

In [20]:
for path_string in sorted(content_root_count):
    soort,content_root = path_string
    count = content_root_count[path_string]
    if count > 40:
        print(' %50r  with   %-20r  appeared %s times'%(soort, content_root, count))

                                             'AMvB'  with   'wet-besluit'         appeared 439 times
                                               'KB'  with   'wet-besluit'         appeared 83 times
                                     'beleidsregel'  with   'circulaire'          appeared 338 times
                                     'beleidsregel'  with   'regeling'            appeared 67 times
                                       'circulaire'  with   'circulaire'          appeared 72 times
                            'ministeriele-regeling'  with   'regeling'            appeared 2462 times
                        'ministeriele-regeling-BES'  with   'regeling'            appeared 47 times
       'ministeriele-regeling-archiefselectielijst'  with   'regeling'            appeared 189 times
                                              'pbo'  with   'regeling'            appeared 324 times
                                              'wet'  with   'wet-besluit'         appeared 400

#### What about the document body?

The schema says little about the structure within that main content node.
How would we e.g. get to all the structure, find out how to refer to parts, etc?

Again, let's look at the data, with the same path counter.

In [8]:
count_paths = collections.defaultdict( int )

for url, tree in bwb_parsed:
    wetgeving = tree.find('wetgeving')
    #soort = wetgeving.get('soort')
    _, _, content_root, _ = wetgeving.getchildren() 

    for path, count in wetsuite.helpers.etree.path_count( content_root ).items():
        count_paths[path] += count

for path, count in sorted( count_paths.items() ): # sort by path as an approximate 'group similar things'
    if count > 100: # just the common stuff  to keep the output _relatively_ short (remove this to see _everything_)
        print( '%6d  %s'%(count,path))

   532  circulaire
   375  circulaire/bijlage
   129  circulaire/bijlage/adres/adresregel
  1158  circulaire/bijlage/al
   220  circulaire/bijlage/al/extref
   113  circulaire/bijlage/al/nadruk
   512  circulaire/bijlage/divisie
  1231  circulaire/bijlage/divisie/al
   160  circulaire/bijlage/divisie/al/extref
   639  circulaire/bijlage/divisie/divisie
  1313  circulaire/bijlage/divisie/divisie/al
   142  circulaire/bijlage/divisie/divisie/al/extref
   283  circulaire/bijlage/divisie/divisie/divisie
   746  circulaire/bijlage/divisie/divisie/divisie/al
   101  circulaire/bijlage/divisie/divisie/divisie/al/extref
   105  circulaire/bijlage/divisie/divisie/divisie/divisie/divisie/table/tgroup/tbody/row/entry
   124  circulaire/bijlage/divisie/divisie/divisie/divisie/table/tgroup/colspec
   283  circulaire/bijlage/divisie/divisie/divisie/divisie/table/tgroup/tbody/row
  1259  circulaire/bijlage/divisie/divisie/divisie/divisie/table/tgroup/tbody/row/entry
   918  circulaire/bijlage/divisie

...that's a lot of output - but to be fair, that's most paths that appears ever - you would *expect* that to be a lot. 

### Structure down to &lt;artikel>?

If we assume for a moment that all interesting content is within `artikel` tags 
(and ignoring some parts of the structure, like bijlage, for _relative_ brevity),
let's see 
- what path is between that content root and artikel,

and then
- what the structure is within artikels.

In [22]:
# we could probably alter path_count() to do stop at a node name, but for one time the code's not so bad:
how_to_get_to_artikel = collections.defaultdict(int)

for path_string in list( count_paths.keys() ):
    if path_string.endswith('/artikel'):
        how_to_get_to_artikel[ path_string ] += count_paths[path_string]

for path, count in sorted( how_to_get_to_artikel.items() ):
    print( '%6d  %s'%(count,path))

   153  circulaire/circulaire-tekst/artikel
   277  circulaire/circulaire-tekst/circulaire.divisie/artikel
  1195  circulaire/circulaire-tekst/circulaire.divisie/circulaire.divisie/artikel
   222  regeling/bijlage/artikel
   700  regeling/bijlage/divisie/artikel
    94  regeling/bijlage/divisie/divisie/artikel
 16693  regeling/regeling-tekst/artikel
    35  regeling/regeling-tekst/deel/artikel
    30  regeling/regeling-tekst/deel/hoofdstuk/artikel
    39  regeling/regeling-tekst/deel/hoofdstuk/titeldeel/afdeling/artikel
    81  regeling/regeling-tekst/deel/hoofdstuk/titeldeel/afdeling/paragraaf/artikel
    39  regeling/regeling-tekst/deel/hoofdstuk/titeldeel/artikel
   348  regeling/regeling-tekst/hoofdstuk/afdeling/artikel
   288  regeling/regeling-tekst/hoofdstuk/afdeling/paragraaf/artikel
  4276  regeling/regeling-tekst/hoofdstuk/artikel
  2928  regeling/regeling-tekst/hoofdstuk/paragraaf/artikel
   262  regeling/regeling-tekst/hoofdstuk/paragraaf/sub-paragraaf/artikel
    22  regel

That looks reasonably regular, actually.


### Structure within artikel?

Additionally assuming all real text is in `al` tags

In [9]:
within_artikel = collections.defaultdict(int)

for path_string in list( count_paths.keys() ):
    arti = path_string.find('/artikel/')
    if arti != -1  and  path_string.endswith('/al'):
        if '/meta-data'  in  path_string: # focus more on content for a moment
            continue
        within_artikel[ path_string[arti+1:] ] += count_paths[path_string]

for path, count in sorted( within_artikel.items() ):
    #if count>40: # uncomment for 'just the more common stuff'
        print( '%6d  %s'%(count,path))

 43012  artikel/al
   929  artikel/definitielijst/definitie-item/definitie/al
    99  artikel/definitielijst/definitie-item/definitie/lijst/li/al
     8  artikel/definitielijst/definitie-item/definitie/lijst/li/lijst/li/al
 75614  artikel/lid/al
   192  artikel/lid/definitielijst/definitie-item/definitie/al
    38  artikel/lid/definitielijst/definitie-item/definitie/lijst/li/al
     4  artikel/lid/definitielijst/definitie-item/definitie/lijst/li/lijst/li/al
 44206  artikel/lid/lijst/li/al
  5450  artikel/lid/lijst/li/lijst/li/al
   313  artikel/lid/lijst/li/lijst/li/lijst/li/al
    41  artikel/lid/lijst/li/lijst/li/lijst/li/lijst/li/al
   321  artikel/lid/lijst/li/lijst/li/table/tgroup/tbody/row/entry/al
    30  artikel/lid/lijst/li/lijst/li/table/tgroup/thead/row/entry/al
  1716  artikel/lid/lijst/li/table/tgroup/tbody/row/entry/al
    95  artikel/lid/lijst/li/table/tgroup/thead/row/entry/al
    28  artikel/lid/specificatielijst/specificatie-item/specificatie/al
 11941  artikel/lid/ta

Again, not so bad for a free-form document, and it resembles what we had already done for CVDR.

In fact, let's try to use the same 'text with context' function.

In [27]:
import random, pprint
import wetsuite.helpers.koop_parse

from importlib import reload
reload(wetsuite.helpers.koop_parse)

for url, tree in random.sample( bwb_parsed, 3 ):
    print(' === %s ==='%url )

    alinea_dicts = wetsuite.helpers.koop_parse.alineas_with_selective_path( 
        tree, 
        start_at_path = wetsuite.helpers.etree.path_between(tree, tree.find('wetgeving').getchildren()[2])
    )
    if 0: # ungrouped, to show you the intermediate form we're working with
        for alinea_dict in alinea_dicts:
            print('-'*80)
            pprint.pprint( alinea_dict )
    else:  # grouped
        # this groups text in specific keys are unique)
        pprint.pprint( wetsuite.helpers.koop_parse.merge_alinea_data( alinea_dicts ) )


 === https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0009262/1998-01-14_0/xml/BWBR0009262_1998-01-14_0.xml ===
[([],
  ['De tarieven, alle genoemd in EUR en exclusief BTW, zijn te onderscheiden '
   'in: ',
   '1A. Eenmalige bijdragen; ',
   '1B. Jaarlijkse bijdragen: ',
   '1B1. Basisbijdrage; ',
   '1B2. Variabele bijdrage (te betalen na certificatie) op basis van '
   'bedrijfsomvang. ',
   '2A. Eenmalige bijdragen; ',
   '2B. Jaarlijkse bijdragen: ',
   '2B1. Basisbijdrage; ',
   '2B2. Variabele bijdrage (te betalen na certificatie) op basis van de '
   'verkoopwaarde van de onder certificaat bereide producten. ',
   'De tarieven voor Landbouwers gelden voor: ',
   'Aangeslotenen die onverwerkte producten voortbrengen door middel van '
   'productie (landbouw). ',
   'De tarieven voor Bereiders/Importeurs gelden voor: ',
   'Aangeslotenen die verwerkte producten voortbrengen door middel van '
   'bereiding en/of verwerkte producten in de handel brengen; ',
   'Aangeslo



It's missing various BWB-specific details you _could_ be extracting, but not a bad start at all.

#### Other examples you could try

In [28]:
# short example with lists
#https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0006881/1994-09-01_0/xml/BWBR0006881_1994-09-01_0.xml

#https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0034743/2014-02-05_0/xml/BWBR0034743_2014-02-05_0.xml

# definities as examples of valuable text _not_ in an al
# https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0010278/1999-12-01_0/xml/BWBR0010278_1999-12-01_0.xml 

# https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0001827/2022-08-01_0/xml/BWBR0001827_2022-08-01_0.xml