<a href="https://colab.research.google.com/github/WetSuiteLeiden/example-notebooks/blob/main/datasets/dataset_intro_by_doing__cvdr__docstructure_part1_inspect.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# (only) in colab, run this first to install wetsuite from (the most recent) source. 
#    (this should soon simplify to something like   !pip3 install --upgrade wetsuite)
# For your own setup, see wetsuite's install guidelines.
!pip3 install -U wetsuite

## Purpose of this notebook

Explore what is in the CVDR dataset(s), and what you could easily do with it. 

In [1]:
import collections, random, pprint

import wetsuite.helpers.etree
import wetsuite.helpers.net
import wetsuite.datasets
import wetsuite.helpers.koop_parse

In [2]:
import wetsuite.datasets
import wetsuite.helpers.koop_parse

ds = wetsuite.datasets.load('cvdr-mostrecent-xml').data
    
for ident, data in ds.random_sample(1000):
    try:
        meta = wetsuite.helpers.koop_parse.cvdr_meta(data, flatten=True)
        #print(meta)
    except Exception as e:
        print( 'ERR', e, ident, data[:300] )

ERR got XML that seems to be neither a document or a search result record 708574 b'<lvbbu:Consolidaties schemaversie="1.2.0" xmlns:map="http://marklogic.com/xdmp/map" xmlns:xsi="http:'
ERR got XML that seems to be neither a document or a search result record 706852 b'<lvbbu:Consolidaties schemaversie="1.2.0" xmlns:map="http://marklogic.com/xdmp/map" xmlns:xsi="http:'
ERR got XML that seems to be neither a document or a search result record 696552 b'<lvbbu:Consolidaties schemaversie="1.2.0" xmlns:map="http://marklogic.com/xdmp/map" xmlns:xsi="http:'
ERR got XML that seems to be neither a document or a search result record 696170 b'<lvbbu:Consolidaties schemaversie="1.2.0" xmlns:map="http://marklogic.com/xdmp/map" xmlns:xsi="http:'


## For an idea of what is contained: the XML form

### Quick sidetrack: Raw document

Before we go to more convenience, let's get a feel for what is in the lower level XML.

In [2]:
url = 'https://repository.officiele-overheidspublicaties.nl/CVDR/CVDR621050/2/xml/CVDR621050_2.xml' # a cherry-picked example
example_tree = wetsuite.helpers.etree.fromstring( wetsuite.helpers.net.download( url ) ) # fetch and parse

# for ease of walking, and consistency with other tutorials, let's remove XML namespaces
example_tree = wetsuite.helpers.etree.strip_namespace( example_tree )

# print the document _mostly_ as we fetched it, but reindented to make the structure more visual 
#   (note: you shouldn't reindent during data _processing_, because it changes the document)
#print( wetsuite.helpers.etree.debug_pretty( example_tree ) ) 

# If we care about the content, we could skip the metadata and focus just on the body
#print( wetsuite.helpers.etree.debug_pretty( example_tree.find('body') ) )
wetsuite.helpers.etree.debug_color( example_tree )

### Back on track: Metadata

The metadata header contains various useful things, but it's in a few different places, and it is varied in structure.

A lot of items are plain name-value combinations, including:
* owmskern's    `<identifier>CVDR641872_2</identifier>`
* owmskern's    `<title>Nadere regels jeugdhulp gemeente Pijnacker-Nootdorp 2020</title>`
* owmskern's    `<language>nl</language>`
* owmskern's    `<modified>2022-02-17</modified>`
* owmsmantel's  `<alternative>Nadere regels jeugdhulp gemeente Pijnacker-Nootdorp 2020</alternative>`
* owmsmantel's  `<subject>maatschappelijke zorg en welzijn</subject>`
* owmsmantel's  `<issued>2022-02-08</issued>`
* owmsmantel's  `<rights>De tekst in dit document is vrij van auteursrecht en databankrecht</rights>`

However, they are allowed to have attributes, including e.g.:
* owmskern's    `<type scheme="overheid:Informatietype">regeling</type>  (except there's no variation in that value anyway)`
* owmskern's    `<creator scheme="overheid:Gemeente">Pijnacker-Nootdorp</creator>`
* owmsmantel's  `<isRatifiedBy scheme="overheid:BestuursorgaanGemeente">college van burgemeester en wethouders</isRatifiedBy>`
* owmsmantel's  `<isFormatOf resourceIdentifier="https://zoek.officielebekendmakingen.nl/gmb-2022-66747">gmb-2022-66747</isFormatOf>`
* owmsmantel's  `<source resourceIdentifier="https://lokaleregelgeving.overheid.nl/CVDR641839">Verordening jeugdhulp gemeente Pijnacker-Nootdorp 2020</source>`

Our `cvdr_meta()` function with the default `flatten=False` you will get structured data close to the original XML:

        'creator': [{'attr': {'scheme': 'overheid:Gemeente'}, 'text': 'Zuidplas'}],


Yet if you just want to show a person a moderately readable summary, like here,
you can use `flatten=True` to ask it to creatively smush those into a single string, 
yielding something like:

        'creator': 'Zuidplas (overheid:Gemeente)',

To see the difference of this flattening in practice
(and see why using flatten in real data processing may prove messier than doing it properly)...

In [7]:
# First, let's switch to real data, from a dataset - picking one random document from it
cvdr_xml = wetsuite.datasets.load('cvdr-mostrecent-xml')
from_url, xml_bytes = cvdr_xml.data.random_choice() # you could re-run this until you get an interesting one.
parsed_example = wetsuite.helpers.etree.fromstring( xml_bytes )

# metadata In flattened form, for readability
display( wetsuite.helpers.koop_parse.cvdr_meta(parsed_example, flatten=True) )

{'identifier': 'CVDR663205_1',
 'title': 'Regeling ambtelijke organisatie provincie Groningen 2022',
 'language': 'nl',
 'type': 'regeling (overheid:Informatietype)',
 'creator': 'Groningen (overheid:Provincie)',
 'modified': '2022-01-01',
 'isFormatOf': 'prb-2021-9780 (https://zoek.officielebekendmakingen.nl/prb-2021-9780)',
 'alternative': 'Regeling ambtelijke organisatie provincie Groningen 2022',
 'source': 'artikel 158, eerste lid, van de Provinciewet (1.0:c:BWBR0005645&artikel=158&lid=1&g=2021-07-10),  artikel 100 van de Provinciewet (1.0:c:BWBR0005645&artikel=100&g=2021-07-10),  artikel 103 van de Provinciewet (1.0:c:BWBR0005645&artikel=103&g=2021-07-10)',
 'isRatifiedBy': 'gedeputeerde staten (overheid:BestuursorgaanProvincie)',
 'subject': 'bestuur en recht',
 'issued': '2021-10-12',
 'rights': 'De tekst in dit document is vrij van auteursrecht en\n                    databankrecht',
 'inwerkingtredingDatum': '2022-01-01',
 'betreft': 'nieuwe regeling',
 'kenmerk': 'K11882',
 

In [8]:
# non-flattened original, may e.g. be preferable for more structured ingest into something else
display( wetsuite.helpers.koop_parse.cvdr_meta(parsed_example, flatten=False) )

{'identifier': [{'text': 'CVDR663205_1', 'attr': {}}],
 'title': [{'text': 'Regeling ambtelijke organisatie provincie Groningen 2022',
   'attr': {}}],
 'language': [{'text': 'nl', 'attr': {}}],
 'type': [{'text': 'regeling', 'attr': {'scheme': 'overheid:Informatietype'}}],
 'creator': [{'text': 'Groningen', 'attr': {'scheme': 'overheid:Provincie'}}],
 'modified': [{'text': '2022-01-01', 'attr': {}}],
 'isFormatOf': [{'text': 'prb-2021-9780',
   'attr': {'resourceIdentifier': 'https://zoek.officielebekendmakingen.nl/prb-2021-9780'}}],
 'alternative': [{'text': 'Regeling ambtelijke organisatie provincie Groningen 2022',
   'attr': {}}],
 'source': [{'text': 'artikel 158, eerste lid, van de Provinciewet',
   'attr': {'resourceIdentifier': '1.0:c:BWBR0005645&artikel=158&lid=1&g=2021-07-10'}},
  {'text': 'artikel 100 van de Provinciewet',
   'attr': {'resourceIdentifier': '1.0:c:BWBR0005645&artikel=100&g=2021-07-10'}},
  {'text': 'artikel 103 van de Provinciewet',
   'attr': {'resourceIden

### Source references,

That XML metadata also explicitly contains some source references. 

Do not expect these to be complete in the legal-basis sense,
but they can be quite useful when they are there.

While listed in the metadata above, they can stand some cleanup and interpretation
(in fact a little more than what we currently do)

In [9]:
for typ, origref, specref, parts, source_text in  wetsuite.helpers.koop_parse.cvdr_sourcerefs(parsed_example):
    print(f"""
    type:                     {typ}
    URL-like reference:       {origref}
    more specific reference:  {specref}
    parts:                    {parts}
    source_text:              {source_text}""")


    type:                     BWB
    URL-like reference:       1.0:c:BWBR0005645&artikel=158&lid=1&g=2021-07-10
    more specific reference:  BWBR0005645
    parts:                    OrderedDict([('artikel', ['158']), ('lid', ['1']), ('g', ['2021-07-10'])])
    source_text:              artikel 158, eerste lid, van de Provinciewet

    type:                     BWB
    URL-like reference:       1.0:c:BWBR0005645&artikel=100&g=2021-07-10
    more specific reference:  BWBR0005645
    parts:                    OrderedDict([('artikel', ['100']), ('g', ['2021-07-10'])])
    source_text:              artikel 100 van de Provinciewet

    type:                     BWB
    URL-like reference:       1.0:c:BWBR0005645&artikel=103&g=2021-07-10
    more specific reference:  BWBR0005645
    parts:                    OrderedDict([('artikel', ['103']), ('g', ['2021-07-10'])])
    source_text:              artikel 103 van de Provinciewet


### Text.

If you do not care about structure, you can flatten everything into one string.

Some creativity is involved, so do not expect this to be entirely regular, or pretty

In [10]:
print( wetsuite.helpers.koop_parse.cvdr_text(parsed_example) )

Begripsbepalingen 
Begripsomschrijvingen
In deze regeling wordt verstaan onder:
Basisteam: een team dat samenhangende reguliere werkzaamheden uitvoert;
Bijzonder organisatieonderdeel: organisatieonderdeel dat is ingesteld op basis van samenwerking met andere partijen;
Domein: organisatieonderdeel met een samenhangende functie voor de organisatie.
Multiteam: een tijdelijk team dat een project of programma uitvoert en specifiek met dat doel is ingesteld.
Opgave: het samenstel van activiteiten voortvloeiende uit een of meerdere doelstellingen van het coalitieprogramma of coalitieakkoord.
Programma: een cluster van meerdere projecten gericht op een specifieke opgave.
Project: een in tijd en middelen begrensde activiteit om specifieke doelen te realiseren op basis van een vastgesteld projectplan.

De ambtelijke organisatie bestaat hiërarchisch uit een directie, domeinen en basisteams.

Er worden vier domeinen onderscheiden, te weten: uitvoering, organisatieondersteuning, bestuurszaken en be

## HTML form

There is an almost-equivalent dataset in HTML form. 
Some will find this easier to ingest, though it is also a little less complete and precise than the XML.


In [13]:
cvdr_html = wetsuite.datasets.load('cvdr-mostrecent-html')

key, html_bytes = cvdr_html.data.random_choice() 

#render that HTML in notebook (we generally wouldn't)
from IPython.core.display import HTML
HTML( html_bytes.decode('utf8') )

#display( html_bytes )

0,1
Organisatie,Amsterdam
Organisatietype,Gemeente
Officiële naam regeling,Verordening binnentreden woningen ter handhaving van voorschriften
Citeertitel,Verordening binnentreden woningen ter handhaving van voorschriften
Vastgesteld door,gemeenteraad
Onderwerp,bestuur en recht
Eigen onderwerp,

Datum inwerkingtreding,Terugwerkende kracht tot en met,Datum uitwerkingtreding,Betreft,Datum ondertekeningBron bekendmaking,Kenmerk voorstel
07-06-1997,,,nieuwe regeling,"Gemeenteblad 1997, afd. 3, nr. 42","Gemeenteblad 1997, afd. 1, nr. 261"


## (quite optional:) Diving deeper into the XML

When you want more structure than plain text, you quickly have to dive deeper, in a way specific to a data source.

This gives an overview of the structure of the text documents (in XML form) in the CVDR repository,
and some ideas of how to process that as more than flat text.


### Why do this?

**What do these XML documents actually give us?**

In terms of natural language, legal texts are fairly precise and decently structured, by merit of needing to be unambiguous. What XML can and often does add to that is grouping natural sentences into ever narrower portions, like like chapters and paragraphs, structuring lists and tables.


**What might we want to do?**

The amount you care about this structure would vary with your research question
  - e.g. to summarize the subjects mentioned, you might only care about the words and phrases in it

  - to categorize portions of a document (say policy into introduction, motivation, implementation details, or case law into claim, court judgment, etc), we may only care to distinguish) we may only care to separate section

  - You might to group adjacent paragraphs (well, alineas - there's some loss in translation here) when they belong to the same portion/argument - this would certainly be easier to do if you still have the document structure
  
  - to find the most normative statements, and extract their meaning, you care more about structure and context

  - in laws, you might very much like to keep "this text comes from from article 1, lid 4"
  

**What can we add?**

We should certainly make an effort to make typical tasks easier

It should also be pointed out that we can only do so much.
The more precise your research question the more you may have to dig into the details.

Regardless of your question, you may run into the fact that while _in theory_ this all gives you fragments of well-defined and self-contained text,
in practice things are messier. People are quite pragmatic and gloss over a lot (consider e.g. how you would interpret the list in artikel 2 lid 2 in the first example below), but code tends to be stupidly literal.

So let's look at some documents, and see what is necessary.

#### Some side questions - paths to specific elements (you can skip this)

First question: does that structure always looks like that, or does it act more like a free-form document in the sense of allowing whatever you put in?

There is an XML schema (e.g. [included in the following PDF; TODO: find in XML form](https://www.koopoverheid.nl/binaries/koop/documenten/instructies/2017/10/23/cvdr-handleiding-deel-6-deel-6-metadata-xml-schema-en-webservices/IPM_dr_4_0_deel_6-Metadata_XML-schema_Webservices-1.pdf)) that should help answer that
(if you don't know, a schema settles which elements (and attributes) are allowed to appear placed within which others).

It's long and _seems_ detailed, but it turns out that it is actually relatively forgiving in what you can place where.
So it may be more useful to inspect how real documents are made.

From the above, we have the question of how consistent the parts under `regeling` are - are things always nested the same way, or not?

To find out, let's see the path you have to walk *between* `regeling` to each child node, to get a feel of both common and uncommon nesting.
And let's do that to a _lot_ of documents.

In [15]:
cvdr_xml = wetsuite.datasets.load('cvdr-mostrecent-xml')

cvdr_parsed = [] # list of (source url, etree object) tuples

cvdr_urls = cvdr_xml.data.keys()
cvdr_urls_subset = random.sample(cvdr_urls, 20000) # 160K is a bit much in RAM, and a sizeable random selection should be enough

for cvdr_id in cvdr_urls_subset: 
    bytestring = cvdr_xml.data.get( cvdr_id )
    tree = wetsuite.helpers.etree.fromstring( bytestring )
    tree = wetsuite.helpers.etree.strip_namespace( tree )
    cvdr_parsed.append( (url, tree) )

In [16]:
# Run the path counter.__name__
#  First questions: how common are each of the broad parts in the regeling?
count_paths = collections.defaultdict( int )

for url, tree in cvdr_parsed:
    # We stop two-deep to only get the direct things under regeling, to avoid verbosity
    for path, count in wetsuite.helpers.etree.path_count( tree.find('body/regeling'), max_depth=2 ).items():
        count_paths[path] += count

for path, count in sorted( count_paths.items() ):
    print( '%6d  %s'%(count,path))

 19898  regeling
 19898  regeling/aanhef
  9351  regeling/bijlage
  4053  regeling/nota-toelichting
     3  regeling/officiele-inhoudsopgave
 14322  regeling/regeling-sluiting
 19898  regeling/regeling-tekst


In [17]:
# Out of curiosity, what's in the aanhef?
count_paths = collections.defaultdict( int )
for url, tree in cvdr_parsed:
    for path, count in wetsuite.helpers.etree.path_count( tree.find('body/regeling/aanhef'), max_depth=4 ).items():
        count_paths[path] += count

for path, count in sorted( count_paths.items() ):
    print( '%6d  %s'%(count,path))

 19898  aanhef
   748  aanhef/afkondiging
  2619  aanhef/afkondiging/al
    11  aanhef/afkondiging/al/cursief
    12  aanhef/afkondiging/al/extref
     1  aanhef/afkondiging/al/noot
    10  aanhef/afkondiging/al/onderstreept
   678  aanhef/afkondiging/al/vet
    37  aanhef/afkondiging/lijst
    89  aanhef/afkondiging/lijst/li
     1  aanhef/context
     1  aanhef/context/context.lijst
     1  aanhef/context/context.lijst/li
 19893  aanhef/preambule
163824  aanhef/preambule/al
  1402  aanhef/preambule/al/cursief
  1857  aanhef/preambule/al/extref
     8  aanhef/preambule/al/inf
    40  aanhef/preambule/al/noot
   610  aanhef/preambule/al/onderstreept
   131  aanhef/preambule/al/plaatje
   209  aanhef/preambule/al/sup
 19979  aanhef/preambule/al/vet
     5  aanhef/preambule/kop
     6  aanhef/preambule/kop/label
  5336  aanhef/preambule/lijst
 17935  aanhef/preambule/lijst/li
    25  aanhef/wie
     1  aanhef/wij


So broadly, the body seems to contain
* `intitule`
* `regeling`
  * `aanhef` (apparently always there), mostly `preambule` text
  * `regeling-tekst` is the main text (always there), more on that below
  * `regeling-sluiting` is some formalities (usually there)
  * `bijlage` may be additional useful stuff (there half the time)

#### The main question - body structure

Our initial question was about the paths under `regeling-tekst`

In [18]:
count_paths = collections.defaultdict( int )

for url, tree in cvdr_parsed:
    for path, count in wetsuite.helpers.etree.path_count( tree.find('body/regeling/regeling-tekst') ).items():
        count_paths[path] += count

for path, count in sorted( count_paths.items() ):
    if count > 100:
        print( '%6d  %s'%(count,path))

 19898  regeling-tekst
   119  regeling-tekst/afdeling/artikel/lid
   132  regeling-tekst/afdeling/artikel/lid/al
   119  regeling-tekst/afdeling/artikel/lid/lidnr
119538  regeling-tekst/artikel
296243  regeling-tekst/artikel/al
  8446  regeling-tekst/artikel/al/cursief
   199  regeling-tekst/artikel/al/cursief/onderstreept
  1541  regeling-tekst/artikel/al/extref
   262  regeling-tekst/artikel/al/extref/onderstreept
   156  regeling-tekst/artikel/al/noot
   171  regeling-tekst/artikel/al/noot/noot.al
   156  regeling-tekst/artikel/al/noot/noot.nr
  3496  regeling-tekst/artikel/al/onderstreept
   578  regeling-tekst/artikel/al/plaatje
   578  regeling-tekst/artikel/al/plaatje/illustratie
   453  regeling-tekst/artikel/al/sup
 38079  regeling-tekst/artikel/al/vet
  1888  regeling-tekst/artikel/al/vet/cursief
   849  regeling-tekst/artikel/al/vet/onderstreept
119538  regeling-tekst/artikel/kop
115064  regeling-tekst/artikel/kop/label
110072  regeling-tekst/artikel/kop/nr
108390  regeling

#### Say you want to process this text in a more structured way that completely flat

What if we selected all the `<al>` tags and extracted their `.text` ?

Let's use a helper that visualizes a selection within an etree, to see if our selection makes sense
(note: this works only if we're in an ipython style notebook)


In [4]:
wetsuite.helpers.notebook.etree_visualize_selection( 
    example_tree, 
    '//body/regeling/regeling-tekst//al', # single slash is directly under, double slash is anywhere under. See XPath syntax
    reindent=True, mark_subtree=True
)

#### What about the structure?

In [part 2](dataset_intro_by_doing__cvdr__docstructure_part2_use.ipynb) we use what we have learned to do some useful extraction.