Continuing from the rechtspraak datacollection notebook in our datacollect repository...

## What does the data XML look like, and what can I easily do with it?

In [1]:
import collections
import pprint

import wetsuite.datasets
import wetsuite.helpers.net
import wetsuite.helpers.localdata
import wetsuite.helpers.etree
import wetsuite.datacollect.rechtspraaknl

In [2]:
rechtspraak_sample = wetsuite.datasets.load('rechtspraaknl-sample-xml')

#if False: # let's show have a cherry-picked example
#xmlbytes = wetsuite.helpers.net.download('https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RBZWB:2020:5807') 
#else: # or a random example   (note that a lot of them will be without text, that's normal)
_, xmlbytes = rechtspraak_sample.data.random_choice()

display( wetsuite.helpers.etree.debug_color( xmlbytes ) ) # print indented and colored, as an indication what the XML looks like

## tl;dr: our function that extracts the broadly interesting content

In [5]:
pprint.pprint(   wetsuite.datacollect.rechtspraaknl.parse_content( xmlbytes )   )

{'bodytext': '\n'
             '\n'
             'Bij besluit van 10 april 2024 (het bestreden besluit) heeft '
             'verweerder aan eiser de maatregel van bewaring op grond van '
             'artikel 59, eerste lid, aanhef en onder a, van de '
             'Vreemdelingenwet 2000 (Vw) opgelegd.\n'
             '\n'
             'Eiser heeft tegen het bestreden besluit beroep ingesteld. Dit '
             'beroep moet tevens worden aangemerkt als een verzoek om '
             'toekenning van schadevergoeding.\n'
             '\n'
             'De rechtbank heeft het beroep op 24 april 2024 op zitting te '
             'Breda behandeld. Daarbij is gebruik gemaakt van een '
             'videoverbinding. Eiser is verschenen, bijgestaan door zijn '
             'gemachtigde. Als tolk is verschenen A. Khabote. Verweerder heeft '
             'zich laten vertegenwoordigen door zijn gemachtigde.\n'
             '\n'
             'Overwegingen\n'
             '\n'
             '1. Eis

For now, though, let's explore what we have a little longer.

## How we got there: inspect the fetched documents, looking for its text

Like in the exploration of the BWB and CVDR data, let's point out there are [schemas](https://www.rechtspraak.nl/SiteCollectionDocuments/Schema-Open-Data-voor-de-Rechtspraak.zip)
to the text's structure, but we should take a look at how they're followed or not.

And, regardless of that, of how we should flatten that text when we want to,
which we do for this dataset.

One of the things we do is counting paths (like in the [bwb_docstructure](koop_bwb_docstructure.ipynb) notebook).

As of this writing, that has guided the just-mentioned `rechtspraaknl.parse_content()` implementations, though this needs more work.

In [None]:
count_paths = collections.defaultdict(int)

for key, xmldoc_bytes in rechtspraak_sample.data.random_sample( 10000 ): # a smallish selection is nice for review and debugging
#for key, xmldoc_bytes in rechtspraak_sample.data.items(): # actually, all ~150K items in the sample dataset only takes ~five minutes, so we could just do that
    tree = wetsuite.helpers.etree.fromstring( xmldoc_bytes )
    tree = wetsuite.helpers.etree.strip_namespace( tree )

    # if it contains a body of text, it will be either in a <uitspraak> tag or a <conclusie> tag
    uitspraak = tree.find('uitspraak')
    conclusie = tree.find('conclusie')
    
    for node in (uitspraak, conclusie):
        if node is None: # one or both of those will be None (not there)
            continue

        # do the counting we just mentioned we'd be doing
        for path, count in wetsuite.helpers.etree.path_count( node ).items():
            count_paths[path] += count

        # also, as debug, double-check the parse_content function (that we've already written by now) to see if it complains at all
        try:
            parsed = wetsuite.datacollect.rechtspraaknl.parse_content( tree )
            #print( parsed['bodytext'] )
        except Exception as e:
            print( 'ERROR', e )
            print( wetsuite.helpers.etree.tostring( node ).decode('u8') )
            #raise

In [None]:
path_count = list( count_paths.items() ) # will be  alist of (path, count)

# sort by path
path_count.sort( key=lambda x:x[0] ) 

# print out   count, path
for path, count in path_count: 
    print('%7d   %s'%(count, path))

   2758   conclusie
    224   conclusie/bridgehead
     17   conclusie/bridgehead/nr
   1431   conclusie/conclusie.info
     98   conclusie/conclusie.info/bridgehead
      2   conclusie/conclusie.info/bridgehead/nr
      4   conclusie/conclusie.info/informaltable
      4   conclusie/conclusie.info/informaltable/tgroup
     26   conclusie/conclusie.info/informaltable/tgroup/colspec
      4   conclusie/conclusie.info/informaltable/tgroup/tbody
     86   conclusie/conclusie.info/informaltable/tgroup/tbody/row
    327   conclusie/conclusie.info/informaltable/tgroup/tbody/row/entry
    327   conclusie/conclusie.info/informaltable/tgroup/tbody/row/entry/para
      7   conclusie/conclusie.info/informaltable/tgroup/tbody/row/entry/para/emphasis
      2   conclusie/conclusie.info/mediaobject
      2   conclusie/conclusie.info/mediaobject/imageobject
      2   conclusie/conclusie.info/mediaobject/imageobject/imagedata
     21   conclusie/conclusie.info/orderedlist
     77   conclusie/conclusie.i

It seems there are specific and varied ideas about what should be in what structure (e.g. para in parablock in paragroup,  overview stuff in uitspraak.info).

But at the same time, that structure is missing from a lot of documents.

----

Others have noticed this, and have tried to do something about it. For example, [this thesis](https://digitalheir.github.io/java-rechtspraak-library/) noticed that stronger structure and metadata only appeared around 2012 or so.

It takes the documents before then, which amount to a variant of plain text (in blocks), and attempts to assign reasonable semantic markup.

## 