<a href="https://colab.research.google.com/github/WetSuiteLeiden/data-collection/blob/master/web_kansspelautoriteit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# (only) in colab, run this first to install wetsuite from (the most recent) source. 
#    (this should soon simplify to something like   !pip3 install --upgrade wetsuite)
# For your own setup, see wetsuite's install guidelines.
!pip3 install -U wetsuite

## Purpose of this notebook

Collect data from the contents of the PDFs under https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/

We are not aware of any API, so are currently collecting information based on scraping web pages.

Many of these do contain a text later, but in the process of writing this we found it is incomplete in some cases, 
so instead we uniformly apply OCR to all documents, and then spend some time polishing that OCR for better-flowing text output.

We combine some specific information, from distinct parts of this process, to produce augmented data.

### Consider you may not want to read this

This is how a dataset was created.
If you only care about how to _use_ that dataset, see [using_dataset_kansspelautoriteit.ipynb](../using_dataset_kansspelautoriteit.ipynb) instead.

You would only read the below if you care about all the nitty gritty of how it was made,
and pick out some ideas or code to yourself do something similar in terms of 
extracting data from a website,
reorganizing that data, 
applying OCR,
and doing some specific polishing and extaction.

### Preparation
imports we will use, and some helper functions.



In [1]:
#imports and helpers we'll use later.
# Here in part to signal that you may not have some of the additional libraries this uses that are not part of core wetsuite

import hashlib, urllib, json, textwrap, re, time, pprint, random

import bs4
import numpy

import wetsuite.helpers.localdata
import wetsuite.helpers.meta
import wetsuite.helpers.net
import wetsuite.helpers.format
import wetsuite.helpers.notebook
import wetsuite.helpers.strings
import wetsuite.helpers.date
import wetsuite.extras.pdf # the pdf and ocr helpers are thin and specific wrappers around other libraries
import wetsuite.extras.ocr # that you could easily use more directly, but they still make code a little simpler
import wetsuite.datasets

from wetsuite.extras.ocr import doc_extent, page_fragment_filter, bbox_max_x, bbox_max_y, bbox_min_x, bbox_min_y, bbox_height

In [2]:
def hash(data: bytes):
    ' calculate SHA1 hash of some bytesting data '
    s1h = hashlib.sha1()
    s1h.update( data )
    return s1h.hexdigest()


def find_eur_money(s:str, minimum=0):
    ''' Given a string, returns a list of substrings that look like money amounts, e.g.
            find_eur_money('   EUR 10,-  ')       == ['10']
            find_eur_money('   EUR 10.    ')      == ['10']
            find_eur_money('   EUR100.000,-')     == ['100000']
            find_eur_money('   EUR100000  ')      == ['100000']
            find_eur_money('   \u20ac100.000')    == ['100000']   # but not commas, because Dutch uses that as a digit.  That should probably be configurable
            find_eur_money('   \u20ac 100000')    == ['100000']
            find_eur_money('   \u20ac 100000  ')  == ['100000']
        (where \u20ac is the euro symbol's unicode codepoint, expressed how python takes it)
    '''
    ret = []
    # the with-text-context was partially for debug, but might actually be useful to return
    for _before, _match_str, match_object, _after in wetsuite.helpers.strings.findall_with_context(r"(?:EUR|\u20ac)\s*([0-9.]+)\b", s, 20):
        cap = match_object.groups()[0].replace('.','')
        try:
            if int(cap) < minimum:
                continue
        except ValueError as ve: # not parseable as number? Remove
            pass
        #print( '[%s]%s[%s] -> %r'%(before, match_str, after, cap) )
        ret.append(cap)
    return ret

### Fetching the data

In [3]:
# Use a local store so that we only need to fetch the PDFs once, only render PDF pages once 
pdfstore   = wetsuite.helpers.localdata.LocalKV('kansspel_pdfstore.db', key_type=str, value_type=bytes )     # URL -> PDF file bytestring
ocrstore   = wetsuite.helpers.localdata.LocalKV('kansspel_ocrstore.db', key_type=str, value_type=bytes )     # page-specific key -> json as bytes

#### Fetching list of cases and their basic summary

What we have is a website we can look at. 

What we want is metadata and text per case.

Let's start by scraping the webpage to figure out all cases, all documents, and what cases they relate to.

These document fetches are cached but the list-of-case and detail-page fetches are not, because the amount of cases will change, and case text and details can change as they progress.

In [None]:
# The purpose of this section is to fill the following variable:
extracted_cases = []  # list of  (casename, [list of document dicts], [same-sized list of ocr results])

# You're not expected to grasp all the following nested code- 
#   the main reason it is hard to read is that we spend a lot of code 
#   tring to get structured information from a not-so-structured HTML page.


maxpage  = 9999999   # will be set to the actual number of pages by the first (well, every) page we fetch
cur_page = 0         # zero-based counting in the pagination

print( "FETCHING CASE SUMMARIES" )
while cur_page <= maxpage:
    page_url = 'https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/?pager_page=%d'%cur_page
    print( page_url )
                
    page_data = wetsuite.helpers.net.download(page_url)
    time.sleep(2) # be somewhat nice to the server
    soup = bs4.BeautifulSoup( page_data, 'lxml' )

    # get the amount of pages, from the pagination links
    pagelinks = soup.select('a[class~="pager_step"][class~="pagina"]')   # CSS selector looking for both of those classes set in a whitespace-token list
    maxpage = int( pagelinks[-1].get('data-page') )                      # -1: last of those we see on the page.  Actually wrong on the last page?

    print( "page %d of %d"%( cur_page+1, maxpage+1 ) ) # numbering is zero-based,  print out one-based for humans

    # fetch all links to specific case detail pages
    for detail_page_a in soup.select('#results a[class~="siteLink"]'): # pick out the links (URLs) of the detail page of each case
        detail_page_url = detail_page_a.get('href') # these are already absolute  (otherwise we'd have to urljoin them)
        case_name = detail_page_a.text.replace('/','_')
        case_dict = {
            'name':case_name,
            'date_range':[],
            'money':[],
            'ecli':[],
            'case_detail_url':detail_page_url,
        }

        print( f'  Case: {repr(case_name):40s}', end='' )

        # Note: the date shown here and on the detail page may be the start date?   It may still useful to distinguish cases for repeat offenders


        ### fetch that case's detail page, and find all PDF links on it ##########################################################
        detail_page_data = wetsuite.helpers.net.download( detail_page_url )
        detail_soup = bs4.BeautifulSoup( detail_page_data, 'lxml' )

        # This section used to be three lines long, 
        #   until we decided that hey, maybe that status would be nice to have.
        #   then we discovered this is a free-form mess
        #   and this is the hand-crafted combination of of exception cases that will probably break in the future.


        #print( detail_page_url )
        
        ## Construct a list of dicts, one for each document,  with keys 'title', 'status', and 'pdf_url'
        doc_dicts = [] #  will be that list of dicts
        cur       = {} # 'currently collecting into this' dict 

        def check_add_clear():
            global doc_dicts, cur
            ''' if there's sensible content, we add it to our list of docs.
                clears cur for next document
            '''
            # if effectively empty, do nothing at all
            if len(cur.keys())==1  and  'title' in cur  and  cur['title']=='': # effectively empty. Also one of a handful of exception cases
                pass
            elif len(cur)>0: # effectively empty
                # fail if it's incomplete, so that I fix it   (or accept it as an edge case)
                if 'title' not in cur:
                    print('SKIP: scraping code is missing title, cur=%r'%cur)
                
                # temporarily removed, the case that triggers this is fixable
                elif 'pdf_url' not in cur:
                    print( 'SKIP: scraping code is missing pdf_url, cur=%r'%cur)
                
                #if 'status' not in cur:
                #    raise ValueError('scraping code is missing status, cur=%r'%cur) that happens, see e.g. https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/n1-interactive/ or https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/1x-corp-exinvest/
                
                #print("ADD %r"%cur)
                else:
                    doc_dicts.append(cur)
            cur = {}


        # this whole part used to be short, and readable(ish), but the pages are a bit messy so it ended up with a lot of exception cases.
        pdflinks = list( detail_soup.select('a[class~="importLink"][class~="pdf"]') )   # (now) used only to find the content block that contains these

        # (and things like select('div[class~="grid-blok"] div[class~="grid-element"] div[class~="grid-inside"] div[class~="iprox-content"]') is not specific enough, that's a general template
        if len(pdflinks)==0:
            print("WARNING - no content?")
        else:
            # We expect a sequence of one or more of: 
            # - h2              (title)
            # - p with a inside (link)
            # - h2-or-h3        (the header saubg "status")
            # - p               (the actual status text)            

            # ...but I've seen an initial paragraph in front of it. 
            # ...and a header and a paragraph in front of it.
            child = pdflinks[0].parent.previous_element # try to position on the first document's title header. Parent would be the p
            while child.name is None and child.previous_sibling is not None: # find previous non-text node.   ...there must be a better way of doing this.
                child = child.previous_sibling
            #print( 'Chosen starting spot: ', child )

            while child is not None:
                #print( child )
                if child.name: # filter out text nodes (iirc)
                    pdflinks = list( child.select('a[class~="importLink"][class~="pdf"]') )
                    has_pdflinks = len(pdflinks) > 0
                    alltext = (''.join(child.findAll(string=True))).strip().lower().strip(':')

                    # augmenting some things we find
                    case_dict['money'].extend( find_eur_money(alltext, minimum=5001) )
                    case_dict['ecli'].extend( wetsuite.helpers.meta.findall_ecli(alltext, rstrip_dot=True) )

                    #print( "LOOKING AT %r"% str(child).strip() )
                    if child.name in ('h2','h3'):
                        if has_pdflinks: # weird case in https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/vriendenloterij/
                            cur['pdf_url'] = urllib.parse.urljoin( detail_page_url, pdflinks[0].get('href') )   # make relative URLs absolute
                            if 'title' not in cur: # would be overwritten in almost all cases
                                cur['title'] = alltext # but helps deal with https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/1x-corp-exinvest/
                        else:
                            if alltext == 'status': # header that just says 'status'
                                #print('HEADER - "status"')
                                pass 
                            else: # probably a title
                                #print('HEADER - title?')
                                check_add_clear() # starts a new one, so:
                                cur['title'] = child.text
                    elif child.name == 'p':
                        if len(cur)==0: # dict empty?
                            pass # probably an initial paragraph
                        else:
                            if has_pdflinks: #elif 'pdf_url' not in cur:
                                #print('FIRST P - PDF URL')
                                cur['pdf_url'] = urllib.parse.urljoin( detail_page_url, pdflinks[0].get('href') )   # make relative URLs absolute
                            else: #elif 'pdf_url' in cur: # status text
                                #print('SECOND P - STATUS TEXT')
                                cur['status'] = child.text
                child = child.next_sibling
            check_add_clear()

            #pprint.pprint( doc_dicts )
            print( '    # documents: %d'%len(doc_dicts) )


        extracted_cases.append( (case_dict, doc_dicts, []) )
        #break  #  during debug: stop after first case on page

    cur_page += 1
    #break  #  during debug: stop after first page

In [7]:
# Each case is a structure like  (case_dict, doc_dicts, will_explain_below)  
# and, at this point in the notebook, is _incomplete_
pprint.pprint( random.choice(extracted_cases) )

({'case_detail_url': 'https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/hillside-new-media-malta-plc/',
  'date_range': [],
  'ecli': [],
  'money': [],
  'name': 'Hillside New Media Malta Plc'},
 [{'pdf_url': 'https://kansspelautoriteit.nl/publish/library/32/15402_01-304-549_beslissing_op_bezwaar.pdf',
   'status': 'In deze zaak is de beslissing op bezwaar genomen en kan beroep '
             'worden ingesteld bij de rechtbank.',
   'title': 'Beslissing op bezwaar Hillside'},
  {'pdf_url': 'https://kansspelautoriteit.nl/publish/library/32/15402_bac_advies_hillside.pdf',
   'title': 'Advies BAC\xa0Hillside'},
  {'pdf_url': 'https://kansspelautoriteit.nl/publish/library/32/15402_01-304-539_openbaarmakingsbesluit_woo_bob.pdf',
   'status': 'Tegen dit openbaarmakingsbesluit kan bezwaar worden gemaakt.',
   'title': 'Openbaarmakingsbesluit Hillside'},
  {'pdf_url': 'https://kansspelautoriteit.nl/publish/library/32/15402_01-285-941_besluit-signed_openbare_versie.pdf',
   'tit

### Fetch the PDFs, OCR them

What we have is 
- the basic fetched metadata 
- and links to the PDF.  

We want the PDF documents, and their contents.

So in the below, we do three tasks: 
- 'fetch PDFs'
- 'renders page to image to be able to OCR'
- 'OCRs those images'
(and later a fourth, 'take the raw OCR output and try to do smart things')

<span style="opacity:0.6">
We also do much of it at once, entangled somewhat.
We _could_ separate the fetch, render, and OCR parts more,
which is good practice for modular, maintanable, and adaptable code,
yet doing more at once means we don't have to use quite as many temporary variables,
and this notebook is single-purpose anyway - it just needs get the job done.
</span>

This produces OCR results in a rawer form, namely a list of (text fragments, its positions) and does not use those contents yet.
It sets this data on the same data structure (that third tuple item inited as []) - the section below actually uses it.
Which is perhaps somewhat confusing.

Doing the OCR for all of a hundred documents takes a few hours, which is why we also cache the raw OCR results.

That means that if you run this entire notebook every month or two, 
it should only take ten minutes or so, because it's only fetching and OCRing documents that are new.

In [None]:
print( "FETCHing PDFs,  OCRing pages" )

for case_i, case_tuple in wetsuite.helpers.notebook.ProgressBar( enumerate(extracted_cases), description="cases..." ):
    case_dict, case_doc_dicts, _ = case_tuple
    case_name = case_dict['name']

    print()
    #print( "ENUM: %s"%case_i)
    print( "NAME: %s"%case_name)
    #print( "DICTS: %s"%pprint.pformat(case_doc_dicts))
    for case_doc_dict in case_doc_dicts:   # for each PDF document in the case
        pdf_url = case_doc_dict['pdf_url']

        print( "== %s =="%wetsuite.helpers.format.url_basename( pdf_url ) )
        pdfbytes, _ = wetsuite.helpers.localdata.cached_fetch( pdfstore, pdf_url )
        doc_page_fragments = [] # list of lists:   [   [page1fragment1,page1fragment2], [page2fragment1,page2fragment2], etc.  ]   

        ## Render PDF images as images
        page_images = list( wetsuite.extras.pdf.pages_as_images(pdfbytes, dpi=200) )  # TODO: cache these too
        # high DPI and antialiasing does a _little_ better on things like periods and colons, but less than you'ld think.

        ## For each page image, get OCR. This ic cached because this is sloooow, and PDFs are unlikely to change
        for page_i, page_image in enumerate(page_images):
            page_key = 'ocr::%s::%s'%(page_i, pdf_url) # should be unique and stable

            if page_key in ocrstore:
                #print('     OCR - CACHED  for page %d of %d'%(page_i+1, len(page_images))) # uncomment to convince yourself cached results are working
                page_ocr_results = json.loads( ocrstore.get(page_key) )
            else: # generate and cache
                print('     OCR - DOING  for page %d of %d'%(page_i+1, len(page_images)))
                page_ocr_results = wetsuite.extras.ocr.easyocr( page_image, use_gpu=True ) # CONSIDER: prefer but don't require gpu?
                ocrstore.put( page_key, json.dumps(page_ocr_results).encode('utf8') )

            # optional debug: Draw OCR results on the page it came from, and save as PNG, for some basic inspection:
            #eval_image = wetsuite.extras.ocr.easyocr_draw_eval( page_image, page_ocr_results )
            #eval_image.save('%s__%s__page_%03d-boxes.png'%(case_name,  hash(pdfbytes), page_i+1))

            # (the followng is more lines than it needs to be, because it kept open the option of mering PDF text and OCR results into the same sort of structure)
            page_fragments = []   # fragments of text in a page,  which in the case of EasyOCR will typically be lines at a time
            for bbox, text, cert in page_ocr_results:
                page_fragments.append( (bbox, text, cert) ) 
            doc_page_fragments.append( page_fragments ) # yes, this is currently just page_ocr_results, the idea was that the above might augment/simplify that

        extracted_cases[case_i][2].append( doc_page_fragments )

In [None]:
# A case now looks like:
extracted_cases[0]

# (cut from this notebook output because it's fairly verbose)

### Analyse raw OCR output into structured text

Let's take those OCR fragments and do some slightly clever things:
separate off headers, group into paragraphs and such.

A lot of this isn't necessary to produce the text,
**but** with bit of tweaking and creative analysis _specific to this document layout_ -- 
they all seem to be based on the same template -- we can give cleaner output. 

<!-- -->

This code is long because a bunch of specific augmentation goes on here.
It's separate so that we can tweak and re-run it easily.

Yet it's mostly simple math on numbers, so should take less than a minute.

In [11]:
# At the end, our results go into this.
dataset_cases = []


verbose = False # when debugging this, say more about what we are doing


for case_i, (case_dict, case_doc_dicts, case_ocrdata) in enumerate(extracted_cases):  # for each case...

    case_name = case_dict['name']
    print( 'CASE %d: %s'%(case_i, case_name) )

    # we are about to add one dict per case document (with keys like 'url', 'pages')
    case_docs = [] 
    # also, we want an idea of the length of time this case has dragged on, 
    # so we collect the dates we see from case documents into the case in general
    case_doc_dates = set()    # (maybe counting can be a little more robust)

    # sanity check of our own code above - because the 'for each document' code is about to assume these are matching lists.
    if len(case_doc_dicts) != len(case_ocrdata):
        raise ValueError("len(case_doc_dicts) != len(case_ocrdata) - did you forget to run the OCR cell, or run it multiple times?")


    ## for each document in this case, analyze the OCR fragments and sort it into more directly useable data
    for doc_i in range(len(case_doc_dicts)): # (case_ocrdata should have the same length)
        pdf_dict = case_doc_dicts[doc_i]
        doc_ocr  = case_ocrdata[doc_i]

        pdf_url = pdf_dict['pdf_url']
        print( '  DOC %d:  %s'%(doc_i,   wetsuite.helpers.format.url_basename( pdf_url ) ) )

        doc_pages = [] # the main output for a document - a list of dicts that each detail a page
        doc_dates = [] # picks out dates from the headers

        doc_wide_extent = doc_extent( doc_ocr ) # the area outside there is no text at all, throughout all of the document's pages
        #print( 'doc_min_x, doc_max_x, doc_min_y, doc_max_y', doc_wide_extent )

        for page_i, page in enumerate( doc_ocr ): # page is now all text framents on a page, a list of all (bbox, text, cert) 
            # we aim to split header, body, and footer.
            page_contents = {
                'head_fragments':[],
                'body_fragments':[],
                'misc_fragments':[],
                'foot_fragments':[],
                'body_text':[],
            }

            if verbose:
                print( '   PAGE %s ---------------------------------------------------------------------------'%(page_i) )

            ### Determine header/footer positions,  so that we can later have logic that extract text that hopefully flows between pages
            head_y_ary, foot_y_ary = [], []
            # - Top margin defined by  
            #   lowest box extent of "Kansspenautoriteit" (not really necessary), "OPENBAAR", and "Ons kenmerk" plus one extra box's worth
            matches = page_fragment_filter( page, r'^Kansspelautoriteit$', q_max_y=0.25, extent=doc_wide_extent )
            for bbox, text, cert in matches:
                #print ('    [page %d] "Kansspelautoriteit" MATCH: %s %s %s'%(page_i, bbox, text, cert))
                head_y_ary.append( bbox_max_y(bbox) )
            matches = page_fragment_filter( page, r'^OPENBAAR$', q_max_y=0.3, q_max_x=0.35, extent=doc_wide_extent )
            for bbox, text, cert in matches:
                #print ('    [page %d] "OPENBAAR' MATCH" %s %s %s'%(page_i, bbox, text, cert))
                head_y_ary.append( bbox_max_y(bbox) )
            matches = page_fragment_filter( page, r'Ons kenmerk', q_max_y=0.25, extent=doc_wide_extent )
            for bbox, text, cert in matches:
                #print ('    [page %d] "Ons kenmerk" MATCH: %s %s %s'%(page_i, bbox, text, cert))
                head_y_ary.append( bbox_max_y(bbox) + 1.2*bbox_height(bbox) ) # we expect one more line of the same height below it (and a little more, for expected whitespace)
            
            # - Bottom margin defined by highest box y of "agina [0-9]+ van [0-9]+" in the bottom right
            matches = page_fragment_filter( page, r'Pagina', q_min_x=0.7, q_min_y=0.75, extent=doc_wide_extent ) # the rest, e.g. (\s*[0-9]+\s*van\s*[0-9]+)?, is optional because it's sometimes detected separately, or not at all
            for bbox, text, cert in matches: # 
                #print ('    [page %d] Pagina MATCH: %s %s %s'%(page_i, bbox, text, cert))
                foot_y_ary.append( bbox_min_y(bbox)  )
            if len(foot_y_ary)==0: # look harder for page - a lone number to the bottom right that matches roughly with the page number we think it is is probably also the page number
                pages_around = '|'.join( str(pag)  for pag in range(page_i-1, page_i+2) )
                matches = page_fragment_filter( page, r'^(%s)$'%pages_around, q_min_y=0.75, q_min_x=0.7, extent=doc_wide_extent )
                for bbox, text, cert in matches: # (\s*[0-9]+\s*van\s*[0-9]+)?
                    #print ('    [page %d] Bare pagina MATCH: %s %s %s'%(page_i, bbox, text, cert))
                    foot_y_ary.append( bbox_min_y(bbox)  )

            if len(head_y_ary)==0:
                head_bot_y = None
            else:
                head_bot_y = max(head_y_ary)

            if len(foot_y_ary)==0:
                foot_top_y = None
            else:
                foot_top_y = min(foot_y_ary)

            if verbose:
                print("    page head_y: ", head_bot_y) # TODO: call this head_bot_y (and probably rename h)
                print("    page foot_y: ", foot_top_y) # TODO: call this foot_top_y


            ### Figure out some things about the page
            # - list-iten X position:
            #   Most of these documents have numbering on their headers and paragraphs
            #   The numbers in those are _not_ detected consistently by OCR,
            #   (nor are they linguistic information), so we attempt to remove them.
            #   We like to be sure (to not remove such things from actual text), so 
            lnum_righty = []
            matches = page_fragment_filter( page, r'^[0-9.]+$', q_max_x=0.2, q_min_y=head_bot_y,q_max_y=foot_top_y, extent=doc_wide_extent ) 
            for bbox, text, cert in matches:
                if verbose:
                    print ('    [page %d] LI NUM MATCH: %s %s %s'%(page_i, bbox, text, cert))
                lnum_righty.append( bbox_max_x(bbox)  )            
            if len(lnum_righty) < 4: # not sure enough - be more conservative
                lnum_righty = doc_wide_extent[0] + 20 # TODO: avoid that constant
            else:
                lnum_righty = max(lnum_righty)

            # - the median (not average) text-box height lets us estimate what is normal text size and what is probably headers
            box_heights = []
            for bbox, text, cert in page:
                box_heights.append( bbox_height( bbox ) )
            median_boxheight = numpy.median(box_heights)



            ### Group and process fragments   (in passes, to make logic like 'is this the last thing in the body' easier)
            # - separate into head, body, foot
            # - polish the body, e.g. 
            #   'is this the last thing on the page AFTER we removed the footer' logic easier
            #   replace '-' at end of paragraph with '.'

            ## Sort fragments into header, body, and footer, based on those "bottom Y of header" and "top Y of footer" we figure out earlier
            prev_topy, prev_boty = 0,0 # 'position of last box' can be useful for "is this same line" and "was there space indicating a new paragraph" sort of logic
            for frag_i, (bbox, text, cert) in enumerate(page):
                topleft, topright, botright, botleft = bbox
                topy, boty = topleft[1], botright[1]

                text = re.sub('[_-]\s*$','.', text) # mistaken for period sometimes. Could be more thorough, but this is a start

                if head_bot_y is not None and topy < head_bot_y :
                    page_contents['head_fragments'].append( (bbox, text, cert) )

                elif foot_top_y is not None and boty > foot_top_y: # CONSIDER: also remove numbers from left of boxes that start at the -- IF we think it's such a number.
                    page_contents['foot_fragments'].append( (bbox, text, cert) )

                elif wetsuite.helpers.strings.is_numeric(text)  and  topleft[0] < lnum_righty+5: # looks like a header number, we'd like to remove that
                    pass
                    #page_contents['misc_fragments'].append( (bbox, text, cert) )

                else: # anything else is probably useful body
                    #print( '      %12s %-12s  fs:%-5s  %s '%( topleft, botright, boxheight, text ) )
                    page_contents['body_fragments'].append( (bbox, text, cert) )

                    # look, in this body text, for things that look like money amounts
                    case_dict['money'].extend( find_eur_money(text, minimum=5001) )

                prev_topy, prev_boty = topy, boty

            ## Header stuff. Little smartness yet.
            head_text = [] 
            for body_frag_i, (bbox, text, cert) in enumerate(page_contents['head_fragments']):
                head_text.append(text)
                # date in a header is a good indication it is the _document_ date and not that of some event
                _, dts = wetsuite.helpers.date.find_dates_in_text(text) 
                for dt in dts:
                    if dt is not None:
                        case_doc_dates.add( dt )
                        doc_dates.append( wetsuite.helpers.date.format_date(dt) )
                # CONSIDER: getting out kenmerk

            ## Footer stuff. Nothing smart.
            foot_text = [] 
            for body_frag_i, (bbox, text, cert) in enumerate(page_contents['foot_fragments']):
                foot_text.append(text)

            ## Figure out body's paragraphs, seprate where sensible
            body_text = [] 
            temp_par  = []
            # this is part of a "collect fragments into sentences and split paragraphs " logic, ignore if ou wish
            prev_topy, prev_boty, prev_boxheight, prev_text = 0,0, 0, ''
            def flush_par():
                global temp_par, body_text # (nonlocal?)
                if len(temp_par)>0:
                    body_text.append( ' '.join(temp_par) )
                    temp_par=[] 

            for body_frag_i, (bbox, text, cert) in enumerate(page_contents['body_fragments']):
                topleft, topright, botright, botleft = bbox
                topy, boty = topleft[1], botright[1]
                boxheight = bbox_height(bbox) # is a good indicator of font size

                same_line            = (boty - prev_boty) < 0.6*boxheight
                current_line_shorter = len(text) < 0.5 * len(prev_text) 
                prev_line_shorter    = 0.5 * len(prev_text) < len(text)

                #if topy-prev_boty > -5:
                #    print( "                                                   [ydist %d (%s->%s)]"%(topy-prev_boty, prev_boty, topy) )

                #if topy < prev_boty by roughly boxheight it's the same line

                if topy < prev_boty-200:
                    if verbose:
                        print( "LARGE DECREASE IN Y HUH?")
                        print( '      %12s %-12s  fs:%-5s  %s '%( topleft, botright, boxheight, text ) )
                    #continue
                    break
                    
                if prev_boty!=0  and  topy-prev_boty > median_boxheight: 
                    if verbose:
                        print( "                                                   [YSEP %d (%s->%s)]"%(topy-prev_boty, prev_boty, topy) )
                    flush_par()

                elif prev_boty!=0  and  topy-prev_boty > 0.6*median_boxheight: 
                    if verbose:
                        print( "                                                   [YSEP %d (%s->%s)]"%(topy-prev_boty, prev_boty, topy) )
                    flush_par()

                elif (boxheight > 1.25 * prev_boxheight)  and  current_line_shorter  and  not same_line:  # larger text, and shorter
                    # A large title of a new section is generally  caught by YSEP, actually
                    #print( 'size diff;   line diff?  botdiff is %d,  relative to 0.5*boxheight=%d'%( (boty-prev_boty),  0.5*boxheight ) )

                    if verbose:
                        print( "                                                   [LARGER_TEXT %s->%s]"%(prev_boxheight, boxheight) )
                    flush_par()

                elif boxheight < 0.8 * prev_boxheight  and  prev_line_shorter  and not same_line: # font smaller than the previous line, and the previous line was shorter
                    if verbose:
                        print( "                                                   [SMALLER_TEXT %s->%s]"%(prev_boxheight, boxheight) )
                    flush_par()

                temp_par.append( text )

                if verbose:
                    print( '      %12s %-12s  fs:%-5s  %s '%( topleft, botright, boxheight, text ) )

                prev_topy, prev_boty, prev_boxheight, prev_text = topy, boty,  boxheight, text

            flush_par()

            del page_contents['body_fragments']
            del page_contents['head_fragments']
            del page_contents['foot_fragments']
            del page_contents['misc_fragments']

            page_contents['head_text'] = head_text
            page_contents['body_text'] = body_text
            page_contents['foot_text'] = foot_text

            doc_pages.append( page_contents )

            #print( 'body_text' )
            #pprint.pprint( page_contents['body_text'] )
            #for par in page_contents['body_text']:
            #    for line in textwrap.wrap(par):
            #        print( '[%s]'%line )
            #    print()

        case_docs.append( 
            {
                'url':pdf_url, 
                'pages':doc_pages,
                'header_dates':doc_dates,
                'status':pdf_dict.get('status'),
            }
        )

        # summarize document
        #pprint.pprint( doc_contents )

        if 0:
            for page in doc_pages:
                for temp_par in page['body_text']:
                    for line in textwrap.wrap(temp_par):
                        print( '%s'%line )
                    print()


    date_range = ()
    if len(case_doc_dates)>0:
        date_range = (
            wetsuite.helpers.date.format_date( min(case_doc_dates) ), 
            wetsuite.helpers.date.format_date( max(case_doc_dates) )
        )

    dataset_cases.append( { 
                'name': case_name,
                'docs': case_docs,
          'date_range': date_range,

        # CONSIDER: maybe just start with case_dict so we don't have to manually do:
               'money': case_dict['money'],
                'ecli': case_dict['ecli'],
     'case_detail_url': case_dict['case_detail_url'],
    } )

CASE 0: Organisator illegale bingo
  DOC 0:  20241127_01-329-549-besluit_openbaarmaking_lod_illegale_bingo.pdf
CASE 1: Stichting het Pad der Natuurlijke Energie
  DOC 0:  20241022_01-327-052_lod_stichting_pne_besluit_tot_openbaarmaking.pdf
CASE 2: Alimaniere Sociedad de Responsabilidad
  DOC 0:  openbaarmakingsbesluit_boete_alimaniere.pdf
  DOC 1:  15575_01-328-681_besluit_openbaarmaking_ov_.pdf
CASE 3: Optdeck Service Limited
  DOC 0:  18052_01-330-231_openbaarmakingsbesluit_optdeck_woo_ov.pdf
CASE 4: Anonieme aanbieder
  DOC 0:  01-325-424_besluit_openbaarmaking_ov-gelakt.pdf
CASE 5: FBC B.V.
  DOC 0:  20241211_01-331-289_besluit_lod_fbc_b-v_ov_.pdf
  DOC 1:  20241211_01-331-290_besluit_openbaarmaking_fbc_b-v-.pdf
CASE 6: Vriendenloterij N.V.
  DOC 0:  bob_lod_vl_openbare_versie.pdf
  DOC 1:  besluit_openbaarmaking_bob_vl_openbare_versie.pdf
  DOC 2:  20230523_01-291-294_openbaar_besluit_last_onder_dwangsom_vl_spellen.pdf
  DOC 3:  20230523_01-291-296_openbare_versie_openbaarmakingsb

In [12]:
print( len(dataset_cases) )

136


In [23]:
# a case now looks like:
pprint.pprint( random.choice( dataset_cases ) )

{'case_detail_url': 'https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/bingoal-nederland-0/',
 'date_range': ('2023-04-04', '2023-10-31'),
 'docs': [{'header_dates': ['2023-10-31', '2023-10-31'],
           'pages': [{'body_text': ['Besluit van de raad van bestuur van de '
                                    'Kansspelautoriteit op de bezwaren van '
                                    'Bingoal Nederland B.V. tegen het '
                                    'sanctiebesluit van 4 april 2023 tot het '
                                    'opleggen van een bestuurlijke boete van € '
                                    '400.000,. (kenmerk 15404/01.288.553) en '
                                    'het openbaarmakingsbesluit van 4 april '
                                    '2023 (kenmerk 15404/01.288.554) .',
                                    'Zaak: 15404 Kenmerk: 15404.001 01.300.880 '
                                    '15404.002 01.300.880',
                               

# Write out

Now write that augmented structure into something we can call a dataset.

In [24]:
kse = wetsuite.helpers.localdata.MsgpackKV('kansspelautoriteit-sancties-struc.db', key_type=str)


kse._put_meta('description_short', 'Metadata and plain text form of the set of PDFs you can find under https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/ ' )

kse._put_meta('description', '''This is a plaintext form of the set of documents you can find under https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/ as PDFs.

        Since almost half of those PDFs do not have a text stream, this data is entirely OCR'd,
        so expect some typical OCR errors.  The OCR quality seems fairly decent, and some effort was made to remove headers and footers,
        yet there are some leftovers  like _ instead of . and = instead of :


        The data is a fairly nested structure of python objects (or JSON, before it's parsed).
        - .data is a list of cases.

        - each case is a dict, with a 
            - 'name', 
            - 'docs' (a list) 
            - and some extracted information like mentioned money amounts, the apparent date span of the case

        - each document in that mentioned list is is a dict, with keys like
            - 'url' - to the PDF it came from
            - 'status' - from the detail page (if we could find it - not 100%) 
            - extracted informations like 'header_dates' (comes from PDF contents)
            - 'pages' (a list)

        - each page in that list is a dict, which has keys:
            - 'body_text' - a list, which containts text fragments that are _almost_ like paragraphs 
                except that text may continue between pages anyway - currently still up to you to detect - 
                plus the post-OCR processing isn't perfect.
            - 'foot_text' - generally just [ "Pagina 1 van 27" ]
            - 'head_text' - fragments like []"Kansspelautoriteit", "OPENBAAR"] but also the date and kenmerk lines


        (TODO: update this example)
        For example (body text edited for brevity), one case's dict, with one document:
            { # dict for a case
                'name': 'Toto Online B.V.',  # case's name
                'docs': [                    # list of PDF documents in this case
                    { # dict detailing first document in case
                        'url': 'https://kansspelautoriteit.nl/publish/library/32/01_278_071_15091_sanctiebesluit_toto_ov.pdf',
                        'pages': [
                            {  # first page's dict   (currently contains only body's text fragments; idea was to split off header contents)
                                'body_text':[  # first page's text fragments
                                    'Besluit van de raad van [more sentence]',
                                    'Zaak: 15091 Kenmerk: 15091 [more kenmerk]',
                                    'Besluit',
                                    'Inleiding',
                                    'De raad van bestuur van de Kansspelautoriteit [more paragraph]'
                                ]
                            },
                            { # second page's dict
                                'body_text': [   # second page text fragment
                                'heeft heeft ontvangen sinds hij daar [more paragraph]',
                                'De toezichthouders zijn naar aanleiding [more paragraph]'
                                ]
                            } 
                            # ...more pages
                        ], 
                    }, # end of document dict
                    # ...more documents
                ]
            }
        '''+wetsuite.datasets.generated_today_text())

for d in dataset_cases:
    kse.put( d.get('case_detail_url'), d)