<a href="https://colab.research.google.com/github/WetSuiteLeiden/data-collection/blob/master/api_koop_bwb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Purpose of this notebook

Show how we fetch data from the BWB repository to be used to create our corresponding datasets

TODO: finish, this is a copy-paste from a script


## some API notes (you can skip this)

entry types (`soort` attribute in the toestand) include:
- `ministeriele-regeling` (~18k), 
- `AMvB` (~3k), 
- `zbo` (~3k), 
- `beleidsregel` (~3k) (rijksdienst), 
- `wet` (~3k), 
- `pbo` (~2k), 
- `ministeriele-regeling-archiefselectielijst` (~1k)
- `KB` (~800), 
- `circulaire` (~500) (rijksdienst), 
- `ministeriele-regeling-BES` (~300), 
- `AMvB-BES` (~200), 
- `wet-BES` (~150), 
- `rijkswet` (~100), 
- `rijksKB` (~80), 
- `reglement` (~40) (van de Staten-Generaal), 
- `beleidsregel-BES`(~30), 
- `circulaire-BES`(~4), 
- `rijksAMvB`(~1)

#### BWB search results

Keep in mind that a BWB number refers to _all versions_ of a law, not a specific version of it (so is not technically an identifier)


This is why a [jci](#glossary_j) reference will typically be more specific, adding a date, e.g. `jci1.3:c:BWBR0045754&z=2022-08-01&g=2022-08-01` where
: `g` for geldigheidsdatum (validity)
: `z` for zichtdatum

Validity is useful to search for a specific version, and also to refer to a specific consolidation/version -- except in the cases of retroactively applying things, where you may technically have two that apply at the same time.

That last detail is why zichtdatum was added in jci 1.3 - see [page 4 of this](https://standaarden.overheid.nl/bwb/doc/Juriconnect-Standaard-BWB-1.3.pdf) but also the note in [Traceerbaarheid voor regelbeheersing bij uitvoeringsorganisaties](http://www.knowbility.nl/wp-content/uploads/BoekjeTraceerbaarheidDefPubv2.pdf) that suggests this isn't strictly followed.



The 'BWB refers to all versions/consolidations' is also part of why for each BWB there is 
* one **manifest** listing all consolidations/versions
  * [schema](https://repository.officiele-overheidspublicaties.nl/bwb/_manifest.xsd)

* one **wti**, wettechnische informatie, listing relations
  * e.g. related laws, when changes were posted in the Staatsblad  (VERIFY)
  * [schema](https://repository.officiele-overheidspublicaties.nl/Schema/BWB-WTI/2015-2/xsd/wti_2015-2.xsd)

* one or more **toestand** files that are law text, the most recent one being the most relevant
  * [schema](https://repository.officiele-overheidspublicaties.nl/Schema/BWB-toestand/2015-2/xsd/toestand_2015-2.xsd)




SRU searches give one result record for each toestand/revision
- for which the manifest and wti are identical (VERIFY)
- which in terms of _metadata_ only really vary in when it is valid

For example, [BWBR0045754 (Wet open overheid, Woo)](http://zoekservice.overheid.nl/sru/Search?operation=searchRetrieve&version=1.2&x-connection=BWB&query=dcterms.identifier%20==%20BWBR0045754) gives two results, the same variants reflected in [its manifest](https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0045754/manifest.xml), whereas [BWBR0035917 (Wet langdurige zorg) has over twenty revisions](https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0035917/manifest.xml) and [BWBR0005290 (Book 7 of the civil code) has over a hundred and sixty](https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0005290/manifest.xml).

Note also that the manifest or WTI do not give an URL for the toestand XML - the search results themselves do.


For a sense of size
- Most manifests are on the order of a few hundred KByte at most
- Simple WTIs are 10KB, the more complex are mostly up to 3MB, with a few in the tens of MByte
- Most toestand files are under 100 or 200KB, with a few at a megabyte or two or ten
As total collection, this amounts to 
- 2.5GByte in _current_ toestand XMLs, about 47GByte in old versions of toestand XMLs (VERIFY).
- 3.5Gbyte in WTI XMLs
- 100MByte in XML manifests

#### BWB XML notes

* `algemene-informatie`
  * citeertitel, <!--afkorting,-->
  * rechtsgebied(en) (e.g. _Ruimtelijke ordening en milieu - Milieurecht_), overheidsdomein (e.g. _Landbouw, natuur en voedsel_)
  * soort regeling (bijv. 'wet'), identificatienummer (BWB)

* `wijzigingen`

* `gerelateerde-regelgeving`
  * bijv. regelgeving die op deze regeling is gebaseerd (gedelegeerde regelgeving)

* `owms` (Overheid.nl Web Metadata Standaard)
  * https://standaarden.overheid.nl/owms/terms
  * mostly dublin core?

voorbeeld (data): https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0035917/BWBR0035917.WTI

voorbeeld (geformatteerd): https://wetten.overheid.nl/BWBR0035917/2022-07-01/0/informatie


#### Manifest XML notes

TODO

#### Toestand  XML notes 

The consistent-enough wrapping is (see also the schema):
**toestand**
* _attributes_
  * inwerkingtreding (optional)
  * bwb-id
  * bwb-ng-vast-deel
* subelements
  * bwb-inputbestand (required but may be empty)
  * bwb-wijzigingen (required but may be empty)
  * redactionele-correcties (optional)
  * **wetgeving** 
    * intitule
    * citeertitel
    * **(general content root)**
    * meta-data

Notes:
* intitule tends to be more descriptive, citeertitel more succinct


**The actual contents are more varied**
* The name of that general content root tag mentioned above is not always the same - usually `wet-besluit`, but also `regeling` or `circulaire` due to context
  Some of them basically set by `soort`, others correlating more loosely (e.g. rijksKB seems to exist with both wet-besluit and regeling)

* The structure under that is also varied - the content text is mostly in `artikel` tags, but the levels between the content root and artikel tags may apparently be any of at least

        wet-besluit/wettekst
        wet-besluit/wettekst/paragraaf
        wet-besluit/wettekst/hoofdstuk/titeldeel
        wet-besluit/wettekst/afdeling
        wet-besluit/wettekst/titeldeel
        wet-besluit/wettekst/afdeling/paragraaf
        wet-besluit/wettekst/hoofdstuk 
        wet-besluit/wettekst/hoofdstuk/titeldeel
        wet-besluit/wettekst/hoofdstuk/paragraaf
        wet-besluit/wettekst/regeling/regeling-tekst
        wet-besluit/wettekst/deel/hoofdstuk/afdeling
        wet-besluit/wettekst/deel/hoofdstuk/afdeling/paragraaf
: ...again with only moderate correlation to `soort`




#### Small entries

There are a good amount of entries (mostly in wet, AMvB, also in KB, regeling, reglement) that are not very contentful. 
Many are also not standalone, because AFAICT there is no consolidated content in these repositories.

* most common are modifications of laws - [https://nl.wikipedia.org/wiki/Wijzigingswet wijzigingswet] or practical variant of the concept
: often with a name (intitule) including wijzigingswet, veegwet, herzieningswet, aanpassingswet, intrekkingswet, verzamelwet, etc. (doesn't seem mentioned in the metadata? at least not explicitly (VERIFY))
: these may have very little text, other than mentioning _that_ they modify another - and may not mention what final form that other law takes. As such, the laws they modify will often get a new consolidated version of its text (VERIFY)
: and depending on your goals, these modifying ones can be seen as a formality rather than useful content
: they may not have a citeertitel (VERIFY)
: e.g. [BWBR0012984](https://wetten.overheid.nl/BWBR0012984/), [BWBR0013009](https://wetten.overheid.nl/BWBR0013009/)

* there are some other very short laws, e.g. to settle something more precisely, add exceptions - and these seem to be considered standalone augmentations, _not_ things to be consolidated into the law they augment (VERIFY)
: e.g. [BWBR0016470](https://wetten.overheid.nl/BWBR0016470/), [BWBR0017076](https://wetten.overheid.nl/BWBR0017076/), [BWBR0017533](https://wetten.overheid.nl/BWBR0017533/)
: or e.g. [BWBR0014295](https://wetten.overheid.nl/BWBR0014295/) which seems to just say "permits as meant by BWBR0013889 should be done via a form, and look in the Staatscourant for the actual form we're talking about"

* and some things that are arguably inbetween, like Goedkeuringsbesluiten 
: e.g. [BWBR0017028](https://wetten.overheid.nl/BWBR0017028/), [BWBR0017276](https://wetten.overheid.nl/BWBR0017276/)

## Fetching

In [1]:
import re, collections, datetime, pprint, random

import wetsuite.helpers.etree
import wetsuite.helpers.notebook
import wetsuite.helpers.date
import wetsuite.helpers.localdata
import wetsuite.helpers.koop_parse
import wetsuite.datacollect.koop_sru 
import wetsuite.datasets

In [2]:
# contains toestand, manifest, and wti downloads
bwb_fetched = wetsuite.helpers.localdata.LocalKV( 'bwb_fetched.db', str, bytes )

In [3]:
def bwb_search_callback( search_record_node ):
    ''' Takes a single XML search record node, does something useful with it.
        Specifically:
        - fetches the toestand XML it refers to if we didn't have it already
        - if we fetched a new toestand, also re-fetches the manifest and WTI (it would be updated too)

        Notes
        - BWB records follow http://standaarden.overheid.nl/sru/gzd.xsd
          Right now we merge all the parts of a record into one dict, 
            which throws away some structure (on top of the already removed namespaces)
            but is easier to deal with.
    '''
    #display( wetsuite.helpers.etree.debug_color( search_record_node ) ) # debug line for later reference, if you want to extract more out of these search records
    meta_dict = wetsuite.helpers.koop_parse.bwb_searchresult_meta( search_record_node )

    # fetch toestand XML, if we don't have it already
    _, toestand_came_from_cache = wetsuite.helpers.localdata.cached_fetch( bwb_fetched,  meta_dict['locatie_toestand'],  force_refetch=False )

    # If we got a toestand we didn't previously have, assume their manifest and WTI probably changed, so need to be re-fetched
    #   (note: this implementation would overdo it, in that if we see _multiple_ new versions, we force this refetch each time. TODO: be cleverer than that)
    force_refetch_meta = (not toestand_came_from_cache)
    _, man_cached = wetsuite.helpers.localdata.cached_fetch( bwb_fetched,  meta_dict['locatie_manifest'],  force_refetch=force_refetch_meta )
    _, wti_cached = wetsuite.helpers.localdata.cached_fetch( bwb_fetched,  meta_dict['locatie_wti'],       force_refetch=force_refetch_meta )

    if (not toestand_came_from_cache or not man_cached or not wti_cached): # fetched anything new? Mention that.
        print( "FETCHED new data for %s - %r"%( meta_dict['identifier'], meta_dict ) )

In [4]:
# This is intended as a "update with recent changes"  (we previously did a lot more fetching)
sru_bwb       = wetsuite.datacollect.koop_sru.BWB(  )
some_time_ago = wetsuite.helpers.date.date_weeks_ago(10)
_ = sru_bwb.search_retrieve_many('dcterms.modified >= %s'%( some_time_ago.strftime('%Y-%m-%d') ), up_to=20000, at_a_time=500, callback=bwb_search_callback)

## Take that downloaded store, extract useful things into datasets

CONSIDER: smaller subset to start with, e.g. just 2023

In [5]:
# go through all fetched URLS and group  
# - manifest
# - wti
# - all toestanden
# ...per BWB-id.
# We assume URL structure is consistent, which they seem to be.

bwbr_groups = collections.defaultdict(dict)  #  bwbr -> { toestanden:   latest_toestand:    wti:    manifest:  }

print("Grouping relevant URLs")

for url in wetsuite.helpers.notebook.ProgressBar( bwb_fetched.keys() ):

    # both filters for basic URLs we care about at all (in case other things got dropped in),
    # and filters for URLs with BWBR  - which implies skipping BWBV (verdragen/treaties), BWBW (?)
    # (the matching here and below is a little hacky, though, clean up?)
    bwbr = re.search('/bwb/(BWBR[0-9]{7})', url)
    if bwbr is not None:
        bwbr = bwbr.groups()[0] # the BWBR-and-number text

        if url.endswith('manifest.xml'): # e.g. https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0019805/manifest.xml
            bwbr_groups[bwbr]['manifest_url'] = url
            continue

        if url.endswith('.WTI'):         # e.g.  https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0016700/BWBR0016700.WTI
            bwbr_groups[bwbr]['wti_url'] = url
            continue

        toestand_match =  re.search('/bwb/(BWBR[0-9]{7})(/[0-9].*[.]xml)', url) 
        if toestand_match is not None: # e.g. #https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0001840/2002-03-21_0/xml/BWBR0001840_2002-03-21_0.xml
            _, sortname = toestand_match.groups() # assume that date is lexically sortable
            # those will be something like 'BWBR0001821'  and  '/1998-01-01_0/xml/BWBR0001821_1998-01-01_0.xml'
            if 'toestanden' not in bwbr_groups[bwbr]:
                bwbr_groups[bwbr]['toestanden'] = []
            bwbr_groups[bwbr]['toestanden'].append( (sortname,url) )
            continue
        
        print( "SKIP / LOOKAT   %s"%url )


print( 'We have %d Unique BWB-id groups'%len(bwbr_groups) )

print( "Finding latest versions of each" )
for bwbr, details in wetsuite.helpers.notebook.ProgressBar( list( bwbr_groups.items() ) ): # within each BWB-id
    for key, url in sorted( details['toestanden'], reverse=True ): # latest first, then use only the first
        bwbr_groups[bwbr]['latest_toestand_url'] = url
        break

Grouping relevant URLs


  0%|          | 0/212330 [00:00<?, ?it/s]

We have 38513 Unique BWB-id groups
Finding latest versions of each


  0%|          | 0/38513 [00:00<?, ?it/s]

In [6]:
# Now do some extraction and also make that datasets
# ...keep mind mind that all the extraction could use some refinement.

bwb_latestonly_xml = wetsuite.helpers.localdata.LocalKV( 'bwb-mostrecent-xml.db', str, bytes ) # bwbr -> xmlbytes
bwb_latestonly_xml._put_meta('description_short',
                             'Raw XML for the latest revision from each BWB-id')
bwb_latestonly_xml._put_meta('description','''
Maps from the BWB-id to the XML file as a bytestring, e.g. 

'BWBR0019090' -> b'<?xml version="1.0" encoding="UTF-8"?><toestand xmlns...'

'''+wetsuite.datasets.generated_today_text())

bwb_latestonly_text = wetsuite.helpers.localdata.LocalKV( 'bwb-mostrecent-text.db', str, str )
bwb_latestonly_text._put_meta('description_short',
                             'Plain text for the latest revision from each BWB-id')
bwb_latestonly_text._put_meta('description','''
Maps from the BWB-id to plain text without any of the structure.
                               
'BWBR0025942': 'De bij dit besluit gevoegde ‘ selectielijst voor de neerslag...'
                              
'''+wetsuite.datasets.generated_today_text())

bwb_latestonly_meta = wetsuite.helpers.localdata.MsgpackKV( 'bwb-mostrecent-meta-struc.db', str, None )
bwb_latestonly_meta._put_meta('description_short',
                             'Metadata structure text for the latest revision from each BWB-id')
bwb_latestonly_meta._put_meta('description','''
Maps from the BWB-id to metadata,
where that metadata comes from the toestand itself, from the manifest file, and from the Wetstechnische informatie (WTI) file. 

For example:                              
                              
'BWBR0017744':{
  'bwb-id': 'BWBR0017744',
  'intitule': 'Regeling inzake de bijdragen van de gebruikers in de kosten van de landelijk raadpleegbare deelverzameling GBA (Bijdragenregeling LRD)',
  'citeertitel': 'Bijdragenregeling LRD ',
  'soort': 'ministeriele-regeling',
  'inwerkingtredingsdatum': '2005-01-01',
  'latest_toestand_url': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0017744/2005-01-01_0/xml/BWBR0017744_2005-01-01_0.xml',
  'wti_url': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0017744/BWBR0017744.WTI',
  'wti': {'algemene_informatie': {'citeertitels_withdate': [['2005-01-01',
      '9999-12-31',
      'Bijdragenregeling LRD ']],
    'citeertitels_distinct': ['Bijdragenregeling LRD '],
    'eerstverantwoordelijke': 'Binnenlandse Zaken en Koninkrijksrelaties',
    'identificatienummer': 'BWBR0017744',
    'rechtsgebieden': [['Openbare orde en veiligheidsrecht', None]],
    'overheidsdomeinen': ['Openbare orde en veiligheid']},
   'related': [['grondslagen',
     'BWBR0006933',
     'jci1.3:c:BWBR0006933&artikel=6',
     'Artikel 6, achtste lid'],
  ...

'''+wetsuite.datasets.generated_today_text())


print("Writing latest-toestand-XML dataset")

for bwbr, details in wetsuite.helpers.notebook.ProgressBar( bwbr_groups.items() ): # within each BWB-id
    bwb_latestonly_xml.put(bwbr, bwb_fetched.get( details['latest_toestand_url'] ), commit=False) # postponed commit makes this much faster
bwb_latestonly_xml.commit()


print("Parsing further metadata, writing meta and text datasets")

#for bwbr, details in wetsuite.helpers.notebook.ProgressBar( random.sample( list(bwbr_groups.items()), 100) ): # debug: test on a few
for bwbr, details in wetsuite.helpers.notebook.ProgressBar( bwbr_groups.items() ): # within each BWB-id

    toestand_tree = wetsuite.helpers.etree.fromstring( bwb_fetched.get( details['latest_toestand_url'] ) )
    text          = wetsuite.helpers.koop_parse.bwb_toestand_text(toestand_tree)

    meta_dict     = wetsuite.helpers.koop_parse.bwb_toestand_usefuls(toestand_tree)

    meta_dict['latest_toestand_url'] = details['latest_toestand_url']

    wti_url       = details['wti_url']
    if wti_url is not None:
        meta_dict['wti_url']       = wti_url
        wti_tree                   = wetsuite.helpers.etree.fromstring( bwb_fetched.get( wti_url ) )
        meta_dict['wti']           = wetsuite.helpers.koop_parse.bwb_wti_usefuls(wti_tree)

    manifest_url  = details['manifest_url']
    if manifest_url is not None:
        meta_dict['manifest_url']  = manifest_url
        manifest_tree              = wetsuite.helpers.etree.fromstring( bwb_fetched.get( manifest_url ) )
        meta_dict['manifest']      = wetsuite.helpers.koop_parse.bwb_manifest_usefuls(manifest_tree)

        # redundant, but sometimes nice to have more accessible
        version_dates = list()
        for expression in manifest_tree.findall('expression'):
            version_dates.append( expression.find('metadata/datum_inwerkingtreding').text )
            meta_dict['version_dates'] = version_dates

    bwb_latestonly_text.put( bwbr, text      )
    bwb_latestonly_meta.put( bwbr, meta_dict )

Writing latest-toestand-XML dataset


  0%|          | 0/38097 [00:00<?, ?it/s]

Parsing further metadata, writing meta and text datasets


  0%|          | 0/38097 [00:00<?, ?it/s]