<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-dev/blob/main/examples/datacollect_koop_repos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## This notebook's goal

This is a continuation of the koop_sru_repos notebook in the wetsuite-datacollect repository,
focused on a somewhat more real example

### Damocles
Let's try looking for Amsterdam's policy around on [Wet damocles](https://nl.wikipedia.org/wiki/Wet_Damocles).

As [the relevant SRU manual](https://data.overheid.nl/sites/default/files/dataset/d0cca537-44ea-48cf-9880-fa21e1a7058f/resources/Handleiding%2BSRU%2B2.0.pdf) mentions in passing, `dt.spatial` refers to where it applies, `dt.creator` refers to who is responsible for creating the document. For this case we assume they are the same. Also, this repository lets us write `creator` instead of `dt.creator`, etc., nice for a bit of readability in these examples.

In [13]:
import pprint
import random

import dateutil

import wetsuite.datacollect.koop_sru
import wetsuite.helpers.koop_parse

In [16]:
sru_cvdr = wetsuite.datacollect.koop_sru.CVDR()

#pprint.pprint( sru_cvdr.explain_parsed() ) # seeing which indexes are here. 
# This one has a more complex information model, so you can dig a little deeper to see what you can do with it.

In [17]:
# we'll be playing with queries, so make 'show results' a minimal amount of typing away
def print_cvdr_results(records):  
    ' takes a list of etree object '
    print('fetched %d records\n'%len(records))
    for i, record in enumerate( records ):
        print('***  Record %d of %d  ***'%(i+1, sru_cvdr.number_of_records) )
        meta = wetsuite.helpers.koop_parse.cvdr_meta(record, flatten=True) # flatten smushes down possibly-repeated fields into a single value. Good enough (only) for presentation.
        pprint.pprint( meta )

In [18]:
# See if we can search for amsterdam
results = sru_cvdr.search_retrieve_many( '(creator any Amsterdam) ', up_to=1 )
print_cvdr_results( results ) # show just one of many results, we only check whether it works

fetched 1 records

***  Record 1 of 4166  ***
{'alternatieveIdentifier': '',
 'alternative': 'Verordening op de vastgoedregistratie',
 'betreft': 'nieuwe regeling',
 'creator': 'Amsterdam (overheid:Gemeente)',
 'gedelegeerdeRegelgeving': '<al>Geen</al>',
 'identifier': 'CVDR108223_1',
 'indeling': 'overig',
 'inwerkingtredingDatum': '2008-10-01',
 'isFormatOf': 'Gemeenteblad 2008, afd. 3A, nr. 182/461',
 'isRatifiedBy': 'gemeenteraad (overheid:BestuursorgaanGemeente)',
 'issued': '2008-10-01',
 'kenmerk': 'Gemeenteblad 2008, afd. 1, nr. 461',
 'language': 'nl',
 'modified': '2018-01-30',
 'omgevingswet': 'nee',
 'onderwerp': 'Ruimtelijke ordening, grondbeleid en bouwen',
 'opvolgerVan': '',
 'organisatietype': 'Gemeente',
 'preferred_url': 'https://lokaleregelgeving.overheid.nl/CVDR108223/1',
 'publicatieurl_xhtml': 'https://repository.officiele-overheidspublicaties.nl/cvdr/CVDR108223/1/html/CVDR108223_1.html',
 'publicatieurl_xml': 'https://repository.officiele-overheidspublicaties.nl

In [19]:
# seems to work.  Now also require 'damocles' in the body text
print_cvdr_results( sru_cvdr.search_retrieve_many( '(creator any Amsterdam) AND (body any damocles)', up_to=5 ) ) 

fetched 0 records



Nothing. Hm. Maybe it's called 'damoclesbeleid'?

In [20]:
print_cvdr_results( sru_cvdr.search_retrieve_many( '(creator any Amsterdam) AND (body any damocles  OR  body any damoclesbeleid)', up_to=5 ) )

fetched 1 records

***  Record 1 of 1  ***
{'alternatieveIdentifier': '',
 'alternative': 'Verzamelbesluit van de burgemeester van de gemeente Amsterdam '
                'verband houdende met de herindeling van de gemeenten '
                'Amsterdam en Weesp',
 'betreft': 'nieuwe regeling',
 'creator': 'Amsterdam (overheid:Gemeente)',
 'identifier': 'CVDR674918_1',
 'indeling': 'overig',
 'inwerkingtredingDatum': '2022-03-25',
 'isFormatOf': 'gmb-2022-138618 '
               '(https://zoek.officielebekendmakingen.nl/gmb-2022-138618)',
 'isRatifiedBy': 'burgemeester (overheid:BestuursorgaanGemeente)',
 'issued': '2022-03-07',
 'kenmerk': 'Onbekend.',
 'language': 'nl',
 'modified': '2022-03-25',
 'omgevingswet': 'nee',
 'onderwerp': '',
 'opvolgerVan': '',
 'organisatietype': 'Gemeente',
 'preferred_url': 'https://lokaleregelgeving.overheid.nl/CVDR674918/1',
 'publicatieurl_xhtml': 'https://repository.officiele-overheidspublicaties.nl/cvdr/CVDR674918/1/html/CVDR674918_1.html',
 'pub

Not actually what we want - it's about reorganization and just happens to mention [Damoclesbeleid gemeente Weesp](https://lokaleregelgeving.overheid.nl/CVDR622223/1). 

If it exists, it probably isn't ***called*** damocles. 

Let's widen that to also include things that mention one of `drugs softdrugs harddrugs handelshoeveelheid opiumwet 13b` AND mention one of `sluiting herstelsanctie bestuursdwang`.

This is a practical consideration: we _will_ get too many results, but what we want should at least be in there,  and filtering out can be easier than continuing to guess. 

In [13]:
print_cvdr_results( sru_cvdr.search_retrieve_many( '(creator any "Amsterdam") AND ( (body any "damoclesbeleid damocles") OR (body any "drugs softdrugs harddrugs handelshoeveelheid opiumwet 13b") AND (body any "sluiting herstelsanctie bestuursdwang"))', up_to=5 ) )



fetched 5 records

***  Record 1 of 114  ***
{'alternatieveIdentifier': '',
 'alternative': 'Beleidsregels sluitingen en heropeningen Amsterdam',
 'betreft': 'nieuwe regeling',
 'creator': 'Amsterdam (overheid:Gemeente)',
 'identifier': 'CVDR640125_1',
 'inwerkingtredingDatum': '2020-05-08',
 'isFormatOf': 'gmb-2020-115757 '
               '(https://zoek.officielebekendmakingen.nl/gmb-2020-115757)',
 'isRatifiedBy': 'burgemeester (overheid:BestuursorgaanGemeente)',
 'issued': '2020-04-24',
 'kenmerk': 'Onbekend.',
 'language': 'nl',
 'modified': '2020-05-08',
 'onderwerp': '',
 'opvolgerVan': '',
 'organisatietype': 'Gemeente',
 'preferred_url': 'https://lokaleregelgeving.overheid.nl/CVDR640125/1',
 'publicatieurl_xhtml': 'https://repository.officiele-overheidspublicaties.nl/cvdr/CVDR640125/1/html/CVDR640125_1.html',
 'publicatieurl_xml': 'https://repository.officiele-overheidspublicaties.nl/cvdr/CVDR640125/1/xml/CVDR640125_1.xml',
 'redactioneleToevoeging': '<al>Deze regeling vervangt

There it is, plus a bunch of unrelated and expired entries.  We'll get to the expiry part of that in the next section.

### Damocles per municipality

We have a list of municipalities:

In [None]:
gem = wetsuite.datasets.load('gemeentes')
print( gem.description )

In [18]:
# Showing one random example of gemeente data.
#   in this example we only actually care about 'Namen', though
pprint.pprint( random.choice( gem.data ) )

{'Aantal inwoners': '74298',
 'Bevat plaatsen': ['Blokker', 'Hoorn NH', 'Zwaag'],
 'CBSCode': '0405',
 'Namen': ['Hoorn', 'Gemeente Hoorn'],
 'OWMS URI': 'http://standaarden.overheid.nl/owms/terms/Hoorn_(gemeente)',
 'Oppervlakte': [52, 'km2'],
 'Organisatiecode': 'gm0405',
 'Overlaps with': [['Hoogheemraadschap Hollands Noorderkwartier'],
                   ['Noord-Holland']],
 'Predecessors': [],
 'Raad': [['Fractie Tonnaer', 6],
          ['Hoorn lokaal', 4],
          ['ÉénHoorn', 4],
          ['GroenLinks', 4],
          ['VVD', 3],
          ['D66', 3],
          ['PvdA', 3],
          ['CDA', 2],
          ['Liberaal Hoorn', 2],
          ['Sociaal Hoorn', 2],
          ['De Realistische Partij', 1],
          ['ChristenUnie', 1]],
 'Service area of': [['Afvalbeheer Westfriesland', 'ABWF'],
                     ['Gemeentelijke Gezondheidsdienst Hollands Noorden',
                      'GGD HN',
                      'GGD Hollands Noorden',
                      '1620'],
       

For each municipality, we pick out 'Namen' and put it in a query:

In [20]:
for gemeente_dict in gem.data[65:70]: 
    # that indexing around 60 is looking for den haag  with its other name, 
    # to check that the code and search are not not tripping over that.      
    #   (-35:-30  exposes a current repo bug)

    # We probably want to search in the index called 'creator'
    # When there are multiple names, we accept any name.
    # doublequotes because there's spaces in some names.
    query_gemeente_names = ' OR '.join( '(creator = "%s")'%naam   for naam in gemeente_dict['Namen'] )

    # this is the name requirement, plus the query we settled on earlier
    query = '(%s) AND ( (body any "damoclesbeleid damocles")  OR  (body any "drugs softdrugs harddrugs handelshoeveelheid opiumwet 13b") AND (body any "sluiting herstelsanctie bestuursdwang"))'%( 
        query_gemeente_names
    )

    ## search and fetch only first page, just so that num_records is filled in to report
    cvdr = wetsuite.datacollect.koop_sru.CVDR()
    cvdr.search_retrieve( query ) 
    print( "\n == %3d  hits for   %s == "%(cvdr.num_records(), ' / '.join(gemeente_dict['Namen'])) )

    ## search and fetch all, summarizing each record as we go  (callback style instead)
    def show_brief( record ): 
        meta = wetsuite.helpers.koop_parse.cvdr_meta( record, flatten=True )
        uit = meta.get('uitwerkingtredingDatum', None)  # ignore things that are expired, because they were probably replaced by something else also in the results  (side note: the expiry data doesn't look 100% correct)

        # old policies are still in here, and we can reasonably assume that ones that expired will probably be replaced by another in the results, so we can just hide them. 
        # Yes, this can also be done in the query
        if uit not in (None,'')  and  (dateutil.parser.parse(uit.split('+')[0]).date() < datetime.date.today()):  # TODO: push newer code that avoids the need for that + nonsense
            pass
        else:
            print( "  %15s  %10s..%-10s  %s"%( meta.get('identifier'), meta.get('inwerkingtredingDatum'),  meta.get('uitwerkingtredingDatum',''),  meta.get('title')) )
            #print('    URL: %s'%meta.get('publicatieurl_xml') )     # 'publicatieurl_xml' points to text in structured XML.  There is also 'publicatieurl_xhtml' (more browser-presentable),  and 'preferred_url' (a link to the page that lokaleregelgeving.overheid.nl would also send you to)
            
            if False: # If you wanted to extract the text, this would be a (very crude) start:
              xml_data = wetsuite.helpers.net.download( meta.get('publicatieurl_xml') )
              tree = etree.strip_namespace( etree.fromstring( xml_data ) )
              for al in tree.find('body/regeling/regeling-tekst').getiterator('al'):
                  print(  ''.join( etree.all_text_fragments(al) )  )

    cvdr.search_retrieve_many( query, callback=show_brief ) # all results, and show brief summary, mainly just titles


 ==  56  hits for   Den Haag / Gemeente Den Haag / 's-Gravenhage == 
     CVDR645629_1  2020-11-10..            Beleidsregel toezicht bedrijfsmatige activiteiten 2020
     CVDR674619_1  2022-03-24..            Beleidsregel bestuurlijke boete, sluiting en beheerovername op grond van de Woningwet Den Haag 2022
     CVDR690428_1  2023-01-01..            Beleidsregel beoordeling levensgedrag Den Haag 2023
     CVDR11313_53  2022-12-01..            Algemene plaatselijke verordening voor de gemeente Den Haag

 ==  25  hits for   Den Helder / Gemeente Den Helder == 
     CVDR657606_1  2021-05-15..            Beleidsregel van de burgemeester van de gemeente Den Helder, houdende regels over sluiting van lokalen en woningen op grond van artikel 13b Opiumwet (Damoclesbeleid Den Helder 2021)
     CVDR674768_1  2022-03-26..            Beleidsregels van de burgemeester van de gemeente Den Helder, houdende regels omtrent coffeeshops (Beleid coffeeshops Den Helder 2022)
     CVDR627607_1  2019-09-20.

Always think and check, rather than trust automation blindly.

In this case, consider:
- the above search doesn't have a good hit for Den Haag. 
  - They do actually have a policy, but [on their website](https://denhaag.raadsinformatie.nl/modules/13/Overige_bestuurlijke_stukken/113642) rather than in in CVDR, because they're not quite part of it yet. There are other cases like this, which you will probably only really find out by hand and/or with experience.
<!-- -->

- [CVDR19959/1](https://lokaleregelgeving.overheid.nl/CVDR19959/1) and [CVDR375267/1](https://lokaleregelgeving.overheid.nl/CVDR375267/1) look to me like the same thing, for Deventer, and both mention they are current. 
  - I can't tell offhand whether that's correct, or they e.g. forgot to mark the older one as ended when the newer one was introduced. There are a handful more cases like these, so it might instead have some practical or legal reason I am not aware of.
<!-- -->

- Municipality mergers means names change over time, e.g. `Kollumerland en Nieuwkruisland` (a.k.a. `Kollumerland ca.`), `Dongeradeel`, en `Ferwerderadeel` are now `Noardeast-Fryslân`.
  - Presumably they don't re-issue all policy on that day - which probably means most active policy is still under the old name? 
    - TODO: actually look into that - it might be worth finding all historic names and mentioning them in the gemeente list/dataset - somehow.
<!-- -->

- Municipality naming converntions may throw you off. Consider e.g.:
  - `Den Haag` is also known as `'s-Gravenhage`, and `Den Bosch` is also known as `'s-Hertogenbosch`
  - abbreviations, e.g. `Nuenen, Gerwen en Nederwetten` may appear as `Neunen c.a.` - and I would assume also often just `Neunen`
  - somewhat less officially, Frisian towns should be assumed to an have two equivalent names. This may be subtle (`Dantumadeel` versus `Dantumadiel`) or less so (`Leeuwarden` versus `Ljouwert`). These may be used in documents. They do not seem to be part of the mentioned gemeente list/dataset. This merits some investigation as well.

<!-- -->

- There is a `Bergen` (municipality _and_ town) in Noord Holland and a `Bergen` (municipality _and_ town) in Limburg. 
  - In [this government list](https://organisaties.overheid.nl/export/Gemeenten.csv) the municipalities are called `Bergen (L)` and `Bergen NH` but it seems a poor idea to assume that is precisely how they appear in all use. You should proably assume searches by name will mix these two, for you to resolve manually (it would be nice if we could search by gemeentecode/organisatiecode, here gm0893 and gm0373 respectively - again, see that mentioned gemeente list/dataset).
