<a href="https://colab.research.google.com/github/WetSuiteLeiden/data-collection/blob/masterapi_koop_repos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## This notebook's goal

Showing you how to address KOOP's SRU repositories, to access BWB, CVDR, and some others,
aiming to get machine-readable data and/or human-readable documents.

Whether you will need this depends on your needs.
- if finding the items via the respective website (wetten.overheid.nl and e.g. lokaleregelgeving.overheid.nl) then this adds nothing for you
- if you need a multitude of results, then this may be much more effective than clicking 'save as' many times
- if you want to do your own experiments with the text and/or metadata, you may wish to create partial or even whole copies, and this is one way to do that

If you use something like colab you don't even need to install anything on your own computers.

That does _not_ take away that this is on the low-level technical side, 
so you need to be comfortable with python, 
prepared to work with a new query language,
and it can't hurt to think think about bulk requests and perhaps caching.

This may be a little overwhelming, and if you were wanting to experiment with metadata/text,
**you may well prefer to load datasets we made** (based on us previously doing a lot of fetching).

In [None]:
# For local installs you can install the wetsuite package once.  
# In colab you get a disposable environment each time,  and will have to start with this install each time. 
!pip3 --quiet install -U wetsuite

In [1]:
# imports we'll be using
import pprint, datetime, random

import dateutil.parser

import wetsuite.datasets
import wetsuite.datacollect.koop_sru
import wetsuite.helpers.koop_parse
import wetsuite.helpers.date

## On SRU

SRU ([Search/Retrieval via URL](https://en.wikipedia.org/wiki/Search/Retrieve_via_URL)) was created as a search API with simpler and standard formats and interchange, so is easier to implement than specialized protocols.

You could _almost_ use it without a library, but not quite.

It has just two operations:
* `explain` - "hello server, please describe yourself"
* `searchRetrieve` - "I would like result items for this query please"

Exactly how flexible any search can be is depends on the choices made in the server you are talking to - it varies how many metadata fields are exposed usefully and which query operations are supported. Which `explain` will summarize. 
So some will let you do "I want all documents belonging to dossier X edited in the last year that contain the word Y in their text", other little more "I want recent documents".

We provide a basic class to interact with the repositories we are most interested in.
We'll introduce its functions by example, and programmers may also care to read e.g. the output of `help( wetsuite.datacollect.koop_sru.BWB() )`.

# Searching KOOP's repositories via their SRU API

KOOP's SRU repositories give access to the data behind `wetten.overheid.nl`, `lokaleregelgeving.overheid.nl`, and others.

## Example: Query basics, on the BWB (Basis WettenBestand) 

KOOP's BWB repository can be seen as the data equivalent of https://wetten.overheid.nl

There is some [technical introduction to searching the BWB with SRU](https://www.overheid.nl/sites/default/files/pdf/Handleiding%2BSRU%2BBWB.pdf).

Notably, this route to the BWB does not seem to allow searching the body text ([wetten.overheid.nl](https://wetten.overheid.nl/) does).
As such, this may mainly be useful for [known-item searches](https://en.wikipedia.org/wiki/Known-item_search), date ranges, and such.

In [9]:
sru_bwb = wetsuite.datacollect.koop_sru.BWB()  # object that knows where to fetch from and how

In [10]:
# Ask for a self-decripion of the API. 
# This is mainly useful to figure out the names of indices that you can search in (e.g. titel, modified, etc.)
pprint.pprint( sru_bwb.explain_parsed() )

{'database/numRecs': '135488',
 'description': 'Gemeenschappelijke zoekdienst van overheid.nl voor BWB Online',
 'explain_url': 'http://zoekservice.overheid.nl/sru/Search?&version=1.2&x-connection=BWB&operation=explain',
 'extent': 'Dutch national legislation',
 'host': 'zoekservice.overheid.nl',
 'indices': [('dcterms', 'identifier'),
             ('dcterms', 'modified'),
             ('dcterms', 'type'),
             ('overheid', 'authority'),
             ('overheidbwb', 'rechtsgebied'),
             ('overheidbwb', 'overheidsdomein'),
             ('overheidbwb', 'onderwerpVerdrag'),
             ('overheidbwb', 'titel'),
             ('overheidbwb', 'afkorting'),
             ('overheidbwb', 'wetsfamilie'),
             ('overheidbwb', 'geldigheidsdatum'),
             ('overheidbwb', 'zichtdatum'),
             ('overheidbwb', 'bekendmaking'),
             ('overheidbwb', 'dossiernummer')],
 'port': '80',
 'sets': [('dcterms',
           'http://purl.org/dc/terms/',
           'I

**The query syntax** is Common Query Language.

You can get fairly far copying examples, or guessing variations based on them. 

For the more technically minded:
- parts of a query are ***`indexname operator term`***, e.g. 
  - `overheidbwb.titel = kip` 
  - `dcterms.modified > 2022-01-01`
  - `body any woning` 
  - use doublequotes when there's a space or one of `<>=/()` in the term, e.g. `dcterms.title = "wet kip"`
- the operators supported vary per field and per server (some servers get fancy, many do not), so unless you stick to the very basics, then check its `explain`, and/or its documentation
  - dates and numbers mostly have: `<` `<=` `>`, `>=` `=` 
  - text field often have most of:
    - `any`:      you can see `body any "foo bar"`  as short for   `body any foo  OR  body any bar`
    - `all`:      you can see `body all "foo bar"`  as short for   `body any foo  AND  body any bar`
    - `==`:      exact match
    - `exact`:    exact match
    - `adj`       exact phrase search - these words should appear adjacent as specified
    - `=`:        server choice, e.g. for text might be `==` or `adj` if present
- you can combine multiple of those `index operator term` chunks, using AND and OR, and brackets, see e.g. the CVDR example below

**Some details vary per server, and per repository in it**, e.g. 
  - which indexes (mostly named fields) are in searchable, and what they are called 
    - you can do an `explain` to find out.
  - how search results actually point to the actual documents that they describe
  - there may be shorthands for index names, e.g. BWB allows 'titel' meaning 'overheidbwb.titel'
  - The more detailed question you have, the more you have to figure out repository details. We try to provide helper functions.

In [13]:
# we will try a few queries, and repeat "now print a brief summary of a result", 
#  so let's define the following to reuse it:
def print_bwb_results(records):
    ' takes a list of etree object '
    for i, record in enumerate( records ): 
        print('\n***  Record %d of %d  ***'%(i+1, sru_bwb.num_records()))
        # each record is an ElementTree style object - which is clunky whenever you just want to just pick out a few values to show
        #   we provide functions that parse that into python data structures, in this case: 
        meta = wetsuite.helpers.koop_parse.bwb_searchresult_meta(record)
        pprint.pprint(meta)

### Known-item search

In [14]:
#Other examples

# Grondwet
#sru_bwb.search_retrieve_many('dcterms.identifier==BWBR0001840', callback=bwb_callback)  

# Reglement verkeersregels en verkeerstekens, e.g. if you wanted to see how images work
#sru_bwb.search_retrieve_many('dcterms.identifier==BWBR0004825', callback=bwb_callback)  


print_bwb_results( sru_bwb.search_retrieve_many( 'dcterms.identifier = BWBR0045754', up_to=5 ) ) 
# up_to=5:  if we happen to have a lot of results, don't fetch and print all



***  Record 1 of 4  ***
{'authority': 'Binnenlandse Zaken en Koninkrijksrelaties',
 'created': '2022-05-01',
 'creator': 'Ministerie van Binnenlandse Zaken en Koninkrijksrelaties',
 'geldigheidsperiode_einddatum': '2022-07-31',
 'geldigheidsperiode_startdatum': '2022-05-01',
 'identifier': 'BWBR0045754',
 'language': 'nl',
 'locatie_manifest': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0045754/manifest.xml',
 'locatie_toestand': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0045754/2022-05-01_0/xml/BWBR0045754_2022-05-01_0.xml',
 'locatie_wti': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0045754/BWBR0045754.WTI',
 'modified': '2023-02-01',
 'overheidsdomein': 'Overheid, bestuur en koninkrijk',
 'rechtsgebied': 'Bestuursrecht',
 'title': 'Wet open overheid',
 'toestand': 'http://wetten.overheid.nl/id/BWBR0045754/2022-05-01/0',
 'type': 'wet',
 'zichtperiode_einddatum': '9999-12-31',
 'zichtperiode_startdatum': '2022-05-01'}

***  Rec

It turns out each version over time gets its own search result.

Also meaning that, to get a current version, you might want to filter by things
such as that `geldigheidsperiode_einddatum` (we do something similar in the CVDR example below) -- or perhaps do it yourself based on the metadata of the often just handful of results.

### Title search

In [15]:
print_bwb_results( sru_bwb.search_retrieve_many( 'overheidbwb.titel any textiel', up_to=5 ) )


***  Record 1 of 6  ***
{'authority': 'Volksgezondheid, Welzijn en Sport',
 'created': '2015-07-02',
 'creator': 'Ministerie van Binnenlandse Zaken en Koninkrijksrelaties',
 'geldigheidsperiode_einddatum': '2022-04-13',
 'geldigheidsperiode_startdatum': '2001-04-13',
 'identifier': 'BWBR0012348',
 'language': 'nl',
 'locatie_manifest': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0012348/manifest.xml',
 'locatie_toestand': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0012348/2001-04-13_0/xml/BWBR0012348_2001-04-13_0.xml',
 'locatie_wti': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0012348/BWBR0012348.WTI',
 'modified': '2022-04-15',
 'overheidsdomein': 'Economie en ondernemen',
 'rechtsgebied': 'Ondernemingspraktijk',
 'title': 'Warenwetbesluit formaldehyde in textiel',
 'toestand': 'http://wetten.overheid.nl/id/BWBR0012348/2001-04-13/0',
 'type': 'AMvB',
 'zichtperiode_einddatum': '9999-12-31',
 'zichtperiode_startdatum': '2001-04-1

### Changes this year

In [18]:
first_of_this_year = wetsuite.helpers.date.date_first_day_in_year( datetime.date.today().year )

query = 'dcterms.modified >= %s'%wetsuite.helpers.date.yyyy_mm_dd( first_of_this_year )
print(query)
print_bwb_results( sru_bwb.search_retrieve_many( query, up_to=5 ) )

dcterms.modified >= 2024-01-01

***  Record 1 of 11534  ***
{'authority': 'Veiligheid en Justitie',
 'created': '2015-07-01',
 'creator': 'Ministerie van Binnenlandse Zaken en Koninkrijksrelaties',
 'geldigheidsperiode_einddatum': '2002-06-30',
 'geldigheidsperiode_startdatum': '2002-01-01',
 'identifier': 'BWBR0001827',
 'language': 'nl',
 'locatie_manifest': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0001827/manifest.xml',
 'locatie_toestand': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0001827/2002-01-01_0/xml/BWBR0001827_2002-01-01_0.xml',
 'locatie_wti': 'https://repository.officiele-overheidspublicaties.nl/bwb/BWBR0001827/BWBR0001827.WTI',
 'modified': '2024-04-27',
 'overheidsdomein': 'Rechtspraak',
 'rechtsgebied': 'Burgerlijk procesrecht',
 'title': 'Wetboek van Burgerlijke Rechtsvordering (geldt in geval van '
          'digitaal procederen)',
 'toestand': 'http://wetten.overheid.nl/id/BWBR0001827/2002-01-01/0',
 'type': 'wet',
 'zichtperi

## CVDR
The CVDR repository can be seen as the data equivalent of https://lokaleregelgeving.overheid.nl

Checking what we can search:

In [19]:
sru_cvdr = wetsuite.datacollect.koop_sru.CVDR()

pprint.pprint( sru_cvdr.explain_parsed() ) # seeing which indexes are here. 
# This one has a more complex information model, so you can dig a little deeper to see what you can do with it.

{'database/numRecs': '292130',
 'description': 'Gemeenschappelijke zoekdienst van overheid.nl voor Centrale '
                'Voorziening Decentrale Regelgeving',
 'explain_url': 'http://zoekservice.overheid.nl/sru/Search?&version=1.2&x-connection=cvdr&operation=explain',
 'extent': 'Lokale regelingen of the Dutch government',
 'host': 'zoekservice.overheid.nl',
 'indices': [('dcterms', 'identifier'),
             ('dcterms', 'title'),
             ('dcterms', 'language'),
             ('dcterms', 'creator'),
             ('dcterms', 'modified'),
             ('dcterms', 'isFormatOf'),
             ('dcterms', 'alternative'),
             ('dcterms', 'source'),
             ('dcterms', 'isRatifiedBy'),
             ('dcterms', 'subject'),
             ('dcterms', 'issued'),
             (None, 'workid'),
             (None, 'bronformaat'),
             (None, 'organisatieType'),
             (None, 'sorteerTitel'),
             (None, 'gemeente'),
             (None, 'provincie'),
   

### Damocles

There is a continuation in the wetsuite-notebooks repository that builds on this to try to find implementations of the [Damocleswet](https://nl.wikipedia.org/wiki/Wet_Damocles).

## Other related code

In [20]:
# there are a bunch of helper functions to help you deal with search results (e.g. parsing metadata and identifiers) 
# ...and to some degree the documents.  One or two are used above.    
# TODO: document, explain, demonstrate more

# there are also some more specific tools, like:

# "given a CVDR work id (or specific expression ID implying the work), find all knovn expression IDs for that work ID"
wetsuite.helpers.koop_parse.cvdr_versions_for_work( 'CVDR165982' ) 
#   will also accept expression IDs, e.g. CVDR165982_1, which it treats as its work ID.
#   Note that this does a search, so will not be fast to do for a large list of them.

['CVDR165982_1', 'CVDR165982_2']

# Officiele publicaties

There is some more technical detail in https://www.koopoverheid.nl/binaries/koop/documenten/instructies/2021/02/09/handleiding-voor-het-uitvragen-van-de-collectie-officiele-publicaties/Handleiding+SRU2.0+v1.2+28052021.pdf also touches on details

## PLOOI

https://kia.pleio.nl/file/download/e7fad70c-b2f6-4fd3-ac85-e3b8d0cd9ead/plooi-technische-handreiking.pdf

https://kia.pleio.nl/file/download/a56145c5-89be-4445-8e10-ecbc5458c895/plooi-handreiking-voor-informatie.pdf

In [None]:
sru_plooi = wetsuite.datacollect.koop_sru.PLOOI( verbose=False )
pprint.pprint( sru_plooi.explain_parsed() ) # seeing which indexes are here. 

In [None]:
def handle_plooi_record(rec):
    print( rec )


#sru_plooi.search_retrieve_many( 'plooi.informatiecategorie any Wob', up_to=5, callback=handle_plooi_record )

sru_plooi.search_retrieve_many( 'dcterms.type = "beslissing op verzoek art. 3 Wob"', up_to=5, callback=handle_plooi_record )


# It seems that individual results are documents that can be part of a larger request, e.g. 
#  https://open.overheid.nl/Details/ronl-0347ff50b0d03b10060fe4bf242431d97d85a3ad/1
#  https://open.overheid.nl/Details/ronl-5719be0c0a840f7c63413cf1c00d8d5eab3177c1/1
# are part of
#  https://open.overheid.nl/Details/ronl-6d82bce1a0afdcbc0d5640b9992bac1631d830c5/1

# The inventaris also refers to the things already public.



# When things cross sites

If you've stared at the raw XML of SRU responses then you might have noticed that, 
when search results point at the documents they describe,
they link to different places. 


For example, while some seem like singular collections...
* [BWB (SRU) seems to link to](http://zoekservice.overheid.nl/sru/Search?&version=1.2&x-connection=BWB&operation=searchRetrieve&startRecord=1&maximumRecords=500&query=dcterms.modified%20%3E%3D%202024-04-04) a few related files under `https://repository.officiele-overheidspublicaties.nl/bwb/`

* [CVDR (SRU) seems to link to](http://zoekservice.overheid.nl/sru/Search?&version=1.2&x-connection=cvdr&operation=searchRetrieve&query=dcterms.modified%20%3E%3D%202024-03-16) a few related files (HTML and XML for each?) under e.g. `https://repository.officiele-overheidspublicaties.nl/cvdr/`


...beyond that, links frequently cross sites a little more:
* [POD (SRU) seems to link to](https://repository.overheid.nl/sru?&version=1.2&x-connection=pod&operation=searchRetrieve&startRecord=1&maximumRecords=10&query=dt.title=%20water)
  - `https://zoek.officielebekendmakingen.nl/wsb-2015-3811.html` but also, sort-of-equivalently, to:
  - `https://repository.overheid.nl/frbr/officielepublicaties/wsb/2015/wsb-2015-3811/` (a set of documents under)

* [OEP (SRU) seems to link to](https://zoek.officielebekendmakingen.nl/sru/Search?version=1.2&operation=searchRetrieve&x-connection=oep&startRecord=1&maximumRecords=10&query=title=%20algemene) both e.g.:
  - `https://zoek.officielebekendmakingen.nl/gmb-2024-216934.html` but also, mostly-equivalently, to:
  - `ftps://bestanden.officielebekendmakingen.nl/2024/05/17/gmb/gmb-2024-216934/`

* The [zoek.officielebekendmakingen.nl](https://zoek.officielebekendmakingen.nl) 
  - mainly to itself, e.g. https://zoek.officielebekendmakingen.nl/gmb-2024-213033 to https://zoek.officielebekendmakingen.nl/gmb-2024-213033.pdf
  - but sometimes also to FRBR, [e.g. for kamerstukken](https://zoek.officielebekendmakingen.nl/0000221773)

* The wetten.overheid.nl site [links to bekendmakingen](https://wetten.overheid.nl/BWBR0002303/1959-01-01/0/informatie#tab-wijzigingenoverzicht) on (the [info page](https://zoek.officielebekendmakingen.nl/stb-1958-557), not a specific document)

Notes:
* alternative links are not necessariyl equivalent. 
  - e.g. For the OEP example, the first is a HTML file, the latter is a directory that also contains, in this case, metadata, a PDF, an ODT, an XML, and a zipped version of that HTML document. 
  - its SFTP URL is _wrong_ -- FTPS is classic FTP over TSS -- a completely different protocol, and that will _not_ be working

* For reference -- these aren't the only places to get that content either. Consider e.g. 
  - `https://repository.overheid.nl/frbr/officielepublicaties/gmb/2024/gmb-2024-216934/` (for the same set of files)
  - `https://repository.overheid.nl/frbr/officielepublicaties/gmb/2024/gmb-2024-216934/1/html/gmb-2024-216934.html` (for the HTML in that set)
  And _different parts_ of the SRU interface (e.g. POD) link to this FRBR store instead