Continuing from the rechtspraak datacollection notebook in our datacollect repository...

## What does the data XML look like, and what can I easily do with it?

In [10]:
import collections
import pprint

import wetsuite.datasets
import wetsuite.helpers.net
import wetsuite.helpers.localdata
import wetsuite.helpers.etree
import wetsuite.datacollect.rechtspraaknl

In [27]:
rechtspraak_sample = wetsuite.datasets.load('rechtspraaknl-sample-xml')

#if False: # let's show have a cherry-picked example
#xmlbytes = wetsuite.helpers.net.download('https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RBZWB:2020:5807') 
#else: # or a random example   (note that a lot of them will be without text, that's normal)
_, xmlbytes = rechtspraak_sample.data.random_choice()

display( wetsuite.helpers.etree.debug_color( xmlbytes ) ) # print indented and colored, as an indication what the XML looks like

It seems 
* there are specific ideas about what should be in what structure (e.g. para in parablock in paragroup,  overview stuff in uitspraak.info)

But at the same time, that structure is missing from a lot of documents.

TODO: see if that's a thing over time.

In [28]:
pprint.pprint(   wetsuite.datacollect.rechtspraaknl.parse_content( example_tree )   )

{'bodytext': '\n'
             '\n'
             '1\n'
             'Procesverloop\n'
             '\n'
             '1.1.\n'
             '\n'
             'Het verloop van de procedure blijkt uit het verzoekschrift van '
             'het CIZ, ingekomen ter griffie op 28 februari 2024.\n'
             'Bij het verzoekschrift zijn de volgende bijlagen gevoegd:\n'
             '\n'
             'het indicatiebesluit op grond van artikel 3.2.3 van de Wet '
             'langdurige zorg van 8 februari 2024;\n'
             '\n'
             'de medische verklaring, opgesteld en ondertekend door [naam 1], '
             'specialist ouderengeneeskunde, van 19 februari 2024;\n'
             '\n'
             'de aanvraag voor een rechterlijke machtiging van 23 februari '
             '2024.\n'
             '\n'
             '1.2.\n'
             'Op 19 maart 2024 is de mondelinge behandeling aangehouden omdat '
             '[naam 2], casemanager bij Careyn, niet aanwezig was. Gebleken is '

## Inspect the fetched documents, looking for its text

Like in the exploration of the BWB and CVDR data, let's point out there are [schemas](https://www.rechtspraak.nl/SiteCollectionDocuments/Schema-Open-Data-voor-de-Rechtspraak.zip)
to the text's structure, but we should take a look at how they're followed or not.

And, regardless of that, of how we should flatten that text when we want to,
which we do for this dataset.

One of the things we do is counting paths, like in the [bwb_docstructure](koop_bwb_docstructure.ipynb) notebook.

As of this writing, that has guided how rechtspraaknl.parse_content() is implemented, though this needs more work.

In [35]:
count_paths = collections.defaultdict(int)

for key, xmldoc_bytes in rechtspraak_sample.data.random_sample( 1000 ): # we want a small selection to get only a reasonable amount of things to review
    tree = wetsuite.helpers.etree.fromstring( xmldoc_bytes )
    tree = wetsuite.helpers.etree.strip_namespace( tree )

    #print(  )
    #print( '-----------------------------------' )
    #print( key )

    uitspraak = tree.find('uitspraak')
    conclusie = tree.find('conclusie')
    
    if uitspraak is not None:

        for path, count in wetsuite.helpers.etree.path_count( uitspraak ).items():
            count_paths[path] += count

        try:
            parsed = wetsuite.datacollect.rechtspraaknl.parse_content( tree )
            #print( parsed['bodytext'] )
        except Exception as e:
            print( 'ERROR', e )
            print( wetsuite.helpers.etree.tostring(uitspraak).decode('u8') )
            #raise

    elif conclusie is not None:
        for path, count in wetsuite.helpers.etree.path_count( conclusie ).items():
            count_paths[path] += count

        try:
            parsed = wetsuite.datacollect.rechtspraaknl.parse_content( tree )
            #print(   parsed['bodytext'] )
        except Exception as e:
            print( 'ERROR', e )
            print( wetsuite.helpers.etree.tostring(conclusie).decode('u8') )
            #raise


In [37]:
# Show those counted paths
path_count = list( count_paths.items() ) # will be  alist of (path, count)
path_count.sort( key=lambda x:x[0] ) # sort by path
for path, count in path_count:
    print('%7d   %s'%(count, path))

     21   conclusie
      2   conclusie/bridgehead
     13   conclusie/conclusie.info
    179   conclusie/conclusie.info/para
      3   conclusie/conclusie.info/para/emphasis
    112   conclusie/conclusie.info/parablock
    156   conclusie/conclusie.info/parablock/para
     38   conclusie/conclusie.info/parablock/para/emphasis
    370   conclusie/footnote
    371   conclusie/footnote/para
    169   conclusie/footnote/para/emphasis
      1   conclusie/informaltable
      1   conclusie/informaltable/tgroup
      3   conclusie/informaltable/tgroup/colspec
      1   conclusie/informaltable/tgroup/tbody
      2   conclusie/informaltable/tgroup/tbody/row
      6   conclusie/informaltable/tgroup/tbody/row/entry
      6   conclusie/informaltable/tgroup/tbody/row/entry/para
      3   conclusie/informaltable/tgroup/tbody/row/entry/para/emphasis
      1   conclusie/orderedlist
      2   conclusie/orderedlist/listitem
      2   conclusie/orderedlist/listitem/para
     68   conclusie/para
     10  

## 