If you want to start playing with this without installation, try: &nbsp; 
<a href="https://colab.research.google.com/github/WetSuiteLeiden/data-collection/blob/main/split-experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# (only) in colab, run this first to install wetsuite from (the most recent) source.   For your own setup, see wetsuite's install guidelines.
#!pip3 install -U --no-cache-dir --quiet https://github.com/knobs-dials/wetsuite-dev/archive/refs/heads/main.zip

# Purpose of this notebook

Figure out how to open various formats of documents, and split them into smaller pieces of text.

Probably at a paragraph-like level, and providing document-structural hints where we can -- 
while explaining why we can _not_ go as far as guaranteeing anything like te natural structure of a document.


Note: this notebook is more about the _development_ of this idea, discussion of the tradeoffs, and such. 
There will be a simpler notebook in wetsuite-notebooks later that gives some examples as to its _use_ on existing datasets
(and the sources they come from).

This entire notebook may disappear into a short summary later.

## Extracting, but also splitting

It is probably useful if this project tries to take each kind of raw document that it
points to as usable (and/or provides in datasets),
as well as other kinds you are likely to use.

The combination of opening varied data formats, and splitting them as well,
seems counter to the 'each piece of code does one thing well' philosophy,
and in part it is, but other notebooks already address how to dig into specific structures of XML, PDF, and others.




Here, we address a certain ease. 
There are also varied methods that would benefit from receiving such text into bite-sized chunks,
and/or for those to be split in a way that is at least _somewhat_ useful.

You might e.g.
- try to ignore introductory things like "Wij Beatrix, bij de gratie Gods" thing in laws,
  - because there's a bunch of named enitities there, and few to none of them are even relevant to the law.
  - seeing that within the text, and knowing we have a chunk of text no larger than a paragraph, makes it easy to ignore.

- up to trying to e.g. find similar court decisions based on some content analysis 
  e.g.
   - focus on the introduction to extract topic, focus on decision to estimate what was done with that topic,
   - possibly ignoring discussion/argumentation in the middle
   - possibly ignoring the definitions (can be a good indicator, but only useful in comparisons if both documents have it)

- split text whenever it seems to be switching to different topic 

  - e.g. by tring to figure out what section each paragraph belongs to 
    (whether hinted at by the document or even analysed)

- try tricks like feeding smaller chunks at a time into a "nearby words" type method
  to see which relations come out most _consistently_ throughout a document.
  - (you may then care more about the amount of distance and less about the split being natural)

...meaning an automated way to suggests which parts to ignore and/or what parts are,
even if crude, could be a start. 

If we can, we want to support such approaches, at least a little.

### Why we can't go very far in this

For one, goals vary.

 It depends a little on who our userbase is.
* Legal researchers will often find one topic, one data source
  - The abovementioned mix issue isn't necessarily a problem, in that documents should be fairly consistent within that

* NLP researchers, on the other hand, will probably just take the "more text is better approach"
  - The abovementioned may or may not be an issue, because they may care _only_ about having a lot of text,
    and doing any labeling themselves
  - but e.g. mixing output from different sources will mean a mix of quality

You can often do somehwat better once you focus on a specific kind of document, source, reseach topic, and such.

<!-- -->

Also, input varies.

Consider that you will expect something sensible for _every_ document.
Maybe one document (e.g. XML) input you can give very well-labeled sections, citations, vocabulary references for one document type.
Maybe another (e.g. PDF) gives nothing more than typeset text, and it takes work to guarantee as little as unbroken sentences.
The common denominator between the two isn't a lot.

Inventing a middle way and wrangling each format into that probably has value,
yet that kind of creativity, if present at all, should probably be tools in your hands as a researcher,
should be verifiable and not just trusted, not transparently and quietly decided for you.

<!-- -->

So this notebook is tools and examples for, also to help instill how well, or how poorly, it works on the documents you work on.

Even _the way_ it currently works should be considered just one preliminary way of _maybe_ doing it -- you should treat it as provisionary,
as a "I'll take out all the bits I need and do it myself (and in the cases it happens to already do what I want, great)":

<!--
## On an implementation level

How we envision this is:
 * in part about getting text out of different formats - HTML, XML, PDF, possibly document formats
 * in part about getting small fragments of text out of each, with some supporting information
 * and let _you_ decide when to join or split those fragments, based on the added information.

We would probably end up with a list of handlers like
 - if you say you recognize this format
 - read it yourself and hand out the parts
 - suggest how to split it


The "I recognize this" should probably have a specificity, e.g. 
 - I know this is a PDF, from Officiele Publicaties, and a specific waterschappen-specific template
 - I know this is a PDF, from Officiele Publicaties
 - I know this is a PDF, I'll give you ''something''

One reason for this approach is that this can be extended and refined over time.
-->

<!--
This still begs a number of questions
- how to do that with different input formats?
  - beyond the "detect type how?" and "open file how?" level, also...

- how granular should the output be? Sections? Paragraphs? Sentences? Blocks that might actually be split at arbitrary points belong together?
  - it's probably easier to join later than to split later, so smallish is good. Paragraphs?

- can we provide intermediate data that leaves some decisions up to you? It would e.g. be nice if you could use the same  thing to 
  - get sections
  - get paragraphs
  - break up the thing to get e.g. ~200 words at a time, almost regardless of structure


- how much are users expected to do, how much smartness could be merged in later?
  - e.g. "hey this header says 1.  and the next one says 2." needs some refinement but can later be quite useful
  - and does that imply that the meta could also use hints like "hint:[ ('feature', 'new-section'), (probability, 0.7)}"

- can we have any document-type-specific handling? (e.g. remove headers and footers from PDF)

- can we have any document-set-specific handling? (e.g. "kamervraag PDFs use a neat template, we can ease _just_ the text out of it fairly easily")
  - and have that be extensible, so we can incrementally improve it?

- how much of that is for future prohjects because it really is to omuch now?

- how useful is it to point back to the original?
  - in XML laws this might be useful. In most others not so much. This might be out of scope, really.


Indeed, to provide a relatively universal intermediate document format is a nontrivial exercise even when you care only
about the _aesthetics_ of the result, let alone when seeming to make any promises about the structure you hand over.
-->

## Some experiments

TODO
<!--

is a scope of 'how good/bad is it to split here',
which you can force into smaller and larger chunks via some parameters.


What you can expect
- a stream of (metadata, text)

- a function that handles handles that at somewhat higher level, like
  - get section-sized things
  - get paragraph-sized fragments
  - break up the thing to get e.g. ~200 words at a time, almost regardless of structure

- ...which suggests that the intended length of those text fragments should be on the order of a paragraph (or similar, often-larger unbroken text chunk), 
 - because it's easier to join later than to split later.
 - if you wantedsentence splitting, you might want to do that in your own post-processing; this seems out of scope for our cruder goals

- metadata will probably 
  - not be things like header, section, paragraph, sentence
  - be more like "document seems to indicate this is three-deep, and the last header was 'intro'; do with that information what you will"

- for this to always be crude.
  - Do not expect this to be very structured, unless you can restrict yourself to a document set that is uniform enough document format that the _output_ is  similarly uniform.
  -->

In [45]:
import random, pprint, collections, warnings

import wetsuite.helpers.koop_parse
import wetsuite.helpers.localdata
import wetsuite.helpers.etree
import wetsuite.helpers.escape
import wetsuite.helpers.util
import wetsuite.helpers.split
import wetsuite.datasets
import wetsuite.extras.pdf
import wetsuite.helpers.notebook

# Collect some varied documents

...primarily to see how many get handled decently

In [34]:
example_docs = {} # some_indicative_id -> docbytes

if 1: # BWB XML
    bwb  = wetsuite.datasets.load('bwb-mostrecent-xml')
    for bwbid, docbytes in bwb.data.random_sample(100):
        example_docs['xml:'+bwbid] = docbytes

if 1: # CVDR XML
    cvdr = wetsuite.datasets.load('cvdr-mostrecent-xml')
    for cvdrid, docbytes in cvdr.data.random_sample(100):
        example_docs['xml:CVDR'+cvdrid] = docbytes

if 1: # CVDR HTML
    cvdr = wetsuite.datasets.load('cvdr-mostrecent-html')
    for cvdrid, docbytes in cvdr.data.random_sample(100):
        example_docs['html:CVDR'+cvdrid] = docbytes

if 1: # Rechtspraak XML
    rechtspraak_xml = wetsuite.helpers.localdata.LocalKV('rechtspraak_fetched.db', key_type=str, value_type=bytes, read_only=True)
    for rsurl, xmlbytes in rechtspraak_xml.random_sample(100):
        example_docs[rsurl] = xmlbytes
    # TODO:
    #rechtspraak  = wetsuite.datasets.load('rechtspraaknl-xml')
    ##for bwbid, docbytes in bwb.data.random_sample(100):
    ##    example_docs[bwbid] = docbytes
    #rechtspraak.data.random_choice()
    #or maybe cache-fetching the URLs mentioned in rechtspraaknl-struc ?

if 1:
    # these should soon be datasets, but for now are internal collection stores
    bus_data = wetsuite.helpers.localdata.LocalKV( 'bus_data.db', key_type=str, value_type=bytes )

    for path in bus_data.random_keys(2000):
        if 'metadata' in path or 'changelog' in path: # (should filter this out above, actually)
            continue
        if '.xml' in path and random.uniform(0,1) < 0.5: # much of this is xml (data or metadata), try to bring up the HTML and PDF
            continue
        if 'gmb' in path and random.uniform(0,1) < 0.9: # roughly three quarters of this store is gmb, try to balance that a little bit
            continue
        bytedoc = bus_data.get( path )
        example_docs[ 'bus:'+path ] = bytedoc

if 0: # CONSIDER: Woo PDF ?
    pass

if 1: # some cherry-picked PDFs
    example_docs[ 'simple2page' ] = wetsuite.helpers.net.download('https://zoek.officielebekendmakingen.nl/wsb-2022-9718.pdf')
    example_docs[ '2023D51633' ] = wetsuite.helpers.net.download('https://www.tweedekamer.nl/downloads/document?id=2023D51633')
    #example_docs[ 'stb-1952-10' ] = wetsuite.helpers.net.download('https://repository.overheid.nl/frbr/officielepublicaties/stb/1952/stb-1952-10/1/pdf/stb-1952-10.pdf')
    #example_docs[ 'stb-1975-102' ] = wetsuite.helpers.net.download('https://repository.overheid.nl/frbr/officielepublicaties/stb/1975/stb-1975-102/1/pdf/stb-1975-102.pdf')
    example_docs[ '3col' ] = wetsuite.helpers.net.download('https://zoek.officielebekendmakingen.nl/stcrt-1995-28-p9-SC1944.pdf')

### See what kind of mix of documents we have

In [35]:
doc_pairs = list( example_docs.items() )
random.shuffle( doc_pairs )
for key, by in doc_pairs[:20]:
    print( '%-70s %s'%(key, by[:60]) )

bus:/2020/02/14/stb/stb-2020-54/stb-2020-54.xml                        b'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?>\r\n<officiele-public'
bus:/2022/06/02/ah/ah-tk-20212022-2945/ah-tk-20212022-2945.xml         b'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?>\r\n<officiele-public'
xml:CVDR686207                                                         b'<?xml version="1.0" encoding="utf-8"?><cvdr xmlns="http://st'
xml:BWBR0043402                                                        b'<?xml version="1.0" encoding="UTF-8"?><toestand xmlns:xsi="h'
https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:GHSHE:2011:557 b'<?xml version="1.0" encoding="utf-8"?>\r\n<open-rechtspraak>\r\n'
xml:CVDR713038                                                         b'<?xml version="1.0" encoding="utf-8"?><cvdr xmlns="http://st'
bus:/1998/03/03/ah/ah-tk-19971998-770/ah-tk-19971998-770.xml           b'<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE vraagdoc PUB'
xml:CVDR662450             

### See what portion gets detected as something known

No fragment output shown yet,
just seeing how many of the documents we gave it seem to be covered by the parsers/splitters currently registered.

In [37]:
count = collections.defaultdict(int)

for someid,docbytes in wetsuite.helpers.notebook.ProgressBar( doc_pairs ):
    options = wetsuite.helpers.split.decide( docbytes )
    if len(options)==0:
        #print( '\n------- %s -------'%someid )
        #print( 'NO PARSER' )        
        count['no-parser-applied'] += 1

    else:
        score, splitter = options[0]
        if score > 100:
            count['only-lowscore'] += 1
            continue

        with warnings.catch_warnings(): # temporarily disable warnings
            warnings.simplefilter("ignore")
            text = ' '.join( txt  for _,_,txt in splitter.fragments() )
        if len( text.strip() ) == 0:
            count['no-output'] += 1
            continue

        count['seems-okay'] += 1

dict( count )

  0%|          | 0/646 [00:00<?, ?it/s]

{'seems-okay': 532, 'no-output': 102, 'no-parser-applied': 12}

### See what the splitter gives us

Start with just one random one, already split

In [44]:
# re-run until it gives output -- the no-output ones are likely metadata-only rechtspraak doc (but there are also things we have yet to implement )
key, docbytes = random.choice( doc_pairs )
score, splitter = wetsuite.helpers.split.decide( docbytes )[0] # assumes we'll find one

display( wetsuite.helpers.split.SplitDebug( splitter.fragments() ) )

meta,intermediate,len,text
"{'hints': [],  'part': '/open-rechtspraak/uitspraak/uitspraak.info',  'path': '/open-rechtspraak/uitspraak/uitspraak.info/bridgehead'}","{'raw': b'<bridgehead role=""bold"">RECHTBANK ZE'  b'ELAND-WEST-BRABANT</bridgehead>\n ',  'rawtype': 'xml'}",30,'RECHTBANK ZEELAND-WEST-BRABANT'
"{'hints': [],  'part': '/open-rechtspraak/uitspraak/uitspraak.info',  'path': '/open-rechtspraak/uitspraak/uitspraak.info/para[1]'}","{'raw': b'<para/>\n ', 'rawtype': 'xml'}",0,''
"{'hints': [],  'part': '/open-rechtspraak/uitspraak/uitspraak.info',  'path': '/open-rechtspraak/uitspraak/uitspraak.info/parablock[1]'}","{'raw': b'<parablock>\n <para>Cluster II H'  b'andelszaken</para>\n </parablock>\n'  b' ',  'rawtype': 'xml'}",23,'Cluster II Handelszaken'
"{'hints': [],  'part': '/open-rechtspraak/uitspraak/uitspraak.info',  'path': '/open-rechtspraak/uitspraak/uitspraak.info/para[2]'}","{'raw': b'<para/>\n ', 'rawtype': 'xml'}",0,''
"{'hints': [],  'part': '/open-rechtspraak/uitspraak/uitspraak.info',  'path': '/open-rechtspraak/uitspraak/uitspraak.info/parablock[2]'}","{'raw': b'<parablock>\n <para>Breda</para>'  b'\n </parablock>\n ',  'rawtype': 'xml'}",5,'Breda'
"{'hints': [],  'part': '/open-rechtspraak/uitspraak/uitspraak.info',  'path': '/open-rechtspraak/uitspraak/uitspraak.info/para[3]'}","{'raw': b'<para/>\n ', 'rawtype': 'xml'}",0,''
"{'hints': [],  'part': '/open-rechtspraak/uitspraak/uitspraak.info',  'path': '/open-rechtspraak/uitspraak/uitspraak.info/para[4]'}","{'raw': b'<para/>\n ', 'rawtype': 'xml'}",0,''
"{'hints': [],  'part': '/open-rechtspraak/uitspraak/uitspraak.info',  'path': '/open-rechtspraak/uitspraak/uitspraak.info/parablock[3]'}","{'raw': b'<parablock>\n <para>zaaknummer /'  b' rekestnummer: C/02/402562 / HA RK 2'  b'2-205</para>\n </parablock>\n ',  'rawtype': 'xml'}",53,'zaaknummer / rekestnummer: C/02/402562 / HA RK 22-205'
"{'hints': [],  'part': '/open-rechtspraak/uitspraak/uitspraak.info',  'path': '/open-rechtspraak/uitspraak/uitspraak.info/para[5]'}","{'raw': b'<para/>\n ', 'rawtype': 'xml'}",0,''
"{'hints': [],  'part': '/open-rechtspraak/uitspraak/uitspraak.info',  'path': '/open-rechtspraak/uitspraak/uitspraak.info/parablock[4]'}","{'raw': b'<parablock>\n <para>\n <em'  b'phasis role=""bold"">Beschikking van 1'  b'5 februari 2023</emphasis>\n </p'  b'ara>\n </parablock>\n ',  'rawtype': 'xml'}",32,'Beschikking van 15 februari 2023'


Give a summary of all the stuff we just selected:

In [39]:
for key, docbytes in doc_pairs:
    if 'metadata' in key: # we care about documents; skip metadata-only, in case the above accidentally filtered them in
        continue

    thresh = 1500
    splitters = wetsuite.helpers.split.decide(docbytes, thresh=thresh)
    if len(splitters)==0:
        print( 'WARN for  %-30s - no splitter says it applies  (under threshold %s)'%(key, thresh) )
        # "stop on the first problem case and display it so we can add it" style logic:
        #if b'<html' in docbytes:
        #    display( wetsuite.helpers.etree.debug_color(docbytes) )
        #    break
        continue
    
    for score, fragproc in splitters: # use each processor that said they would be useful
        with warnings.catch_warnings(): # temporarily disable warnings
            warnings.simplefilter("ignore")
            frags = fragproc.fragments()
        if len(frags) == 0:
            print( 'WARN for  %-30s - no output from splitter %s'%(key, type(fragproc).__name__ ) )
            #if 'html.zip' in key:
            #    print(wetsuite.helpers.util.get_ziphtml(docbytes))
            #    print( fragproc.soup )
        else:
            textlen = sum(  list(len(fragtext)  for _,_,fragtext in frags))
            if textlen < 100:
                print( f'WARN for  {key:^50s} - {type(fragproc).__name__} gave {textlen} chars of text' )
            else:
                print( f'INFO for  {key:^50s} - {type(fragproc).__name__} gave {textlen} chars of text' )

            #for o1,o2,o3 in frags:
            #    print( type(o2) )

            if 0: # you may want to disable this print-everything when doing "are we missing anything" debug because it is a _lot_ of output
                #display( wetsuite.helpers.etree.debug_color(docbytes) )
                display( wetsuite.helpers.split.SplitDebug( frags ) )
                break
    #else:
    #    break

    #break

INFO for   bus:/2020/02/14/stb/stb-2020-54/stb-2020-54.xml   - Fragments_XML_OP_Stb gave 525 chars of text
INFO for  bus:/2022/06/02/ah/ah-tk-20212022-2945/ah-tk-20212022-2945.xml - Fragments_XML_BUS_Kamer gave 10295 chars of text
INFO for                    xml:CVDR686207                   - Fragments_XML_CVDR gave 101245 chars of text
INFO for                   xml:BWBR0043402                   - Fragments_XML_BWB gave 18317 chars of text
WARN for  https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:GHSHE:2011:557 - no output from splitter Fragments_XML_Rechtspraak
INFO for                    xml:CVDR713038                   - Fragments_XML_CVDR gave 111380 chars of text
INFO for  bus:/1998/03/03/ah/ah-tk-19971998-770/ah-tk-19971998-770.xml - Fragments_XML_BUS_Kamer gave 4201 chars of text
INFO for                    xml:CVDR662450                   - Fragments_XML_CVDR gave 49660 chars of text
INFO for                   xml:BWBR0039397                   - Fragments_XML_BWB gav