<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-dev/blob/main/notebooks/extras/extras_methods_ocr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# (only) in colab, run this first to install wetsuite from (the most recent) source.   For your own setup, see wetsuite's install guidelines.
!pip3 install -U --no-cache-dir --quiet https://github.com/knobs-dials/wetsuite-dev/archive/refs/heads/main.zip

## Purpose of this notebook


## When you want to put more work into PDF

### More work into checking PDFs




In [25]:
import fitz

import wetsuite.helpers.net
import wetsuite.helpers.localdata
import wetsuite.extras.pdf

from importlib import reload
reload(wetsuite.helpers.net)

<module 'wetsuite.helpers.net' from '/usr/local/lib/python3.8/dist-packages/wetsuite/helpers/net.py'>

In [26]:
ff = wetsuite.helpers.localdata.LocalKV( 'frbr_fetched.db', str, bytes )

#example_pdf_bytes = wetsuite.helpers.net.download('https://www.tweedekamer.nl/downloads/document?id=2023D37764')
#example_pdf_bytes = ff.get('https://repository.overheid.nl/frbr/officielepublicaties/kst/20487-32/kst-20487-32-b2/1/pdf/kst-20487-32-b2.pdf')
#example_pdf_bytes = ff.get('https://repository.overheid.nl/frbr/officielepublicaties/kst/17050-284/kst-17050-284-b1/1/pdf/kst-17050-284-b1.pdf')
example_pdf_bytes = wetsuite.helpers.net.download('https://open.overheid.nl/documenten/dea3e2bb-4792-4e92-8cc3-61214e81a2fc/file')


In [28]:
#reload(wetsuite.extras.pdf)
with fitz.open( stream=example_pdf_bytes, filetype="pdf") as document:
    if wetsuite.extras.pdf.do_page_sizes_vary( document)[0]:
        print( "There is variation in page sizes")
    else:
        print( "No variation in page sizes")
    
    for pagenum, page in enumerate(document):
        print( "page %d sized %s with %d characters"%(
            pagenum, 
             wetsuite.extras.pdf.closest_paper_size_name(page.cropbox)[0:2],
            len(page.get_text().strip()),
) )

There is variation in page sizes
page 0 sized ('Letter', 'portrait') with 2294 characters
page 1 sized ('Letter', 'portrait') with 3519 characters
page 2 sized ('Letter', 'portrait') with 214 characters
page 3 sized ('Letter', 'portrait') with 2112 characters
page 4 sized ('Letter', 'portrait') with 3640 characters
page 5 sized ('Letter', 'portrait') with 1542 characters
page 6 sized ('A4', 'landscape') with 600 characters
page 7 sized ('A4', 'landscape') with 1036 characters
page 8 sized ('A4', 'landscape') with 711 characters
page 9 sized ('A4', 'landscape') with 2087 characters
page 10 sized ('A4', 'landscape') with 1829 characters
page 11 sized ('A4', 'landscape') with 535 characters
page 12 sized ('A4', 'landscape') with 969 characters
page 13 sized ('A4', 'landscape') with 502 characters
page 14 sized ('A4', 'landscape') with 471 characters
page 15 sized ('A4', 'landscape') with 2119 characters
page 16 sized ('A4', 'landscape') with 200 characters
page 17 sized ('Letter', 'portra

### Putting more work into text extraction

The underlying the library we're leveraging for PDFs, `pymupdf`,
is willing to give you differently-processed views on page data.
We can use this to our advantage.

<!-- -->

[As its documentation mentions](https://pymupdf.readthedocs.io/en/latest/app1.html) , in the way it views extraction:
- page - largely consists of blocks (that tend to be roughly paragraphs)
- block - consists of either lines and their characters, or an image or such
- line - consists of spans.
- span - adjacent characters with identical font properties (name, size, flags, color)

Different extraction functions/parameters give you
- relatively raw data (down to characters),
- said blocks,
- or output that did a little more creative decisionmaking into a document-like structure

<!-- -->

If you just care about the words it contains, almost any of those will do. 
The 'text' one just flattens into one string.

If on the other hand you want 
- natural reading order
- flowing text 
  - not broken up by 
    - headers and footers, 
    - page breaks,
  - split into paragraphs,
- tables and images,
- estimating which parts are headers and which parts of the text fall under which headers

...you may need to go down to a moderately raw forms to control what it really does.
Try uncommenting each of the get_text lines in the code block below
to see what each gives you, [and the documentation for that functionality](https://pymupdf.readthedocs.io/en/latest/textpage.html) for a litle more explanation.

In [3]:
import fitz
import bs4

from IPython.display import HTML

import wetsuite.helpers.net

In [4]:
example_pdf_bytes = wetsuite.helpers.net.download('https://zoek.officielebekendmakingen.nl/wsb-2022-9718.pdf') # simple, ~one page

In [6]:
for option, flags in (
    # block level
    ('blocks',None),
    ('html',  fitz.TEXTFLAGS_HTML & ~fitz.TEXT_PRESERVE_IMAGES),
    ('dict',  None),
    #('json',  None),

    # more analysed
    ('text',  None),
    ('xhtml', fitz.TEXTFLAGS_XHTML & ~fitz.TEXT_PRESERVE_IMAGES),

    # more low level
    ('words',None),
    #('xml',None),
    
    #('rawdict',),
    #('rawjson',),
):
    import html
    with fitz.open( stream=example_pdf_bytes, filetype="pdf") as document:
        for page in document:
            display( HTML( f'<h1>{option}</h1>' ) )
            res = page.get_text( option=option, flags=flags )
            if isinstance(res, str):
                display( HTML( '<pre>%s</pre>'% html.escape(res).replace('\n','<br/>') ) )
            else:
                #display( HTML( res ) )
                display( res )

            break

[(119.05500030517578,
  109.42204284667969,
  556.0970458984375,
  157.27801513671875,
  'Scheepvaartvergunning voor het overschrijden van de op de Mark en Dintel \n(riviervakken VI en VII)) op traject vanaf de Mandersluis - Stampersgat (Suiker \nUnie) en retour ter plaatse toegestane afmetingen van een schip. \n',
  0,
  0),
 (119.05500030517578,
  166.2091827392578,
  555.284912109375,
  208.31101989746094,
  'Besluitnummer 560746 ingevolge de Scheepvaartverkeerswet bekend gemaakt op 23 augustus 2022 \nvoor het overschrijden van de op de Mark en Dintel (riviervakken VI en VII) ter plaatse toegestane afme- \ntingen, met het motorbeunschip "Klazina" met scheepsnummer 02300591 tussen 1 september 2022 en \n30 januari 2023. \n',
  1,
  0),
 (119.05500030517578,
  217.8852081298828,
  124.09235382080078,
  228.7300262451172,
  '  \n',
  2,
  0),
 (119.05500030517578,
  238.4037628173828,
  353.51007080078125,
  249.1670379638672,
  'Bezwaarmogelijkheden met betrekking tot het besluit \n',


{'width': 595.276,
 'height': 841.89,
 'blocks': [{'number': 0,
   'type': 0,
   'bbox': (119.05500030517578,
    109.42204284667969,
    556.0970458984375,
    157.27801513671875),
   'lines': [{'spans': [{'size': 12.0,
       'flags': 20,
       'font': 'UniversLT-Bold',
       'color': 0,
       'ascender': 0.9380000233650208,
       'descender': -0.25,
       'text': 'Scheepvaartvergunning voor het overschrijden van de op de Mark en Dintel ',
       'origin': (119.05500030517578, 120.67803955078125),
       'bbox': (119.05500030517578,
        109.42204284667969,
        551.04296875,
        123.67803955078125)}],
     'wmode': 0,
     'dir': (1.0, 0.0),
     'bbox': (119.05500030517578,
      109.42204284667969,
      551.04296875,
      123.67803955078125)},
    {'spans': [{'size': 12.0,
       'flags': 20,
       'font': 'UniversLT-Bold',
       'color': 0,
       'ascender': 0.9380000233650208,
       'descender': -0.25,
       'text': '(riviervakken VI en VII)) op traject van

[(119.05500030517578,
  109.42204284667969,
  254.25900268554688,
  123.67803955078125,
  'Scheepvaartvergunning',
  0,
  0,
  0),
 (257.5950012207031,
  109.42204284667969,
  283.3590087890625,
  123.67803955078125,
  'voor',
  0,
  0,
  1),
 (286.69500732421875,
  109.42204284667969,
  305.36700439453125,
  123.67803955078125,
  'het',
  0,
  0,
  2),
 (308.7030029296875,
  109.42204284667969,
  384.85498046875,
  123.67803955078125,
  'overschrijden',
  0,
  0,
  3),
 (388.1910095214844,
  109.42204284667969,
  408.3869934082031,
  123.67803955078125,
  'van',
  0,
  0,
  4),
 (411.7229919433594,
  109.42204284667969,
  425.72698974609375,
  123.67803955078125,
  'de',
  0,
  0,
  5),
 (429.06298828125,
  109.42204284667969,
  443.72698974609375,
  123.67803955078125,
  'op',
  0,
  0,
  6),
 (447.06298828125,
  109.42204284667969,
  461.0669860839844,
  123.67803955078125,
  'de',
  0,
  0,
  7),
 (464.40301513671875,
  109.42204284667969,
  493.26300048828125,
  123.67803955078125

Say we looked at that and liked `xhtml` for its apparently great analysis of paragraphs, and marking it up with headers and such,
but noticed the header that is _visually_ on top of the page is structrally _below_ the page - it seems this does _not_ care about natural reading order.

Okay, then we looked at `html` form and noticed it's the next step more raw, and with positions to let us sort by.
How hard could it be?

In [None]:
#Say we wanted to 


def deal_with_page(pagenum, page):
    res = page.get_text( option='html', flags=fitz.TEXTFLAGS_HTML & ~fitz.TEXT_PRESERVE_IMAGES )

    #print( res )
    soup = bs4.BeautifulSoup(res, features='lxml')

    l = []
    for thing in soup.select('div>*'): # looks for p; keep in mind that spans inside sometimes have e.g. colors
        #print('-',thing)
        
        d = {'pagenum':pagenum}
        for kv in thing.get('style').split(';'):
            k, v = kv.split(':')
            d[k]=v
        del thing.attrs['style']
        for span in thing.find_all('span'):
            del span.attrs['style']
        #index: element for index, element in enumerate(my_list)
        #print(d, thing)          
        l.append( (float(d['top'].rstrip('pt')), d, thing) )  

    l.sort() # sort by top 
    for top, dict, thing in l:

        # ignore everything in the top inch, and bottom inch (these are pt, which are by definition 1/72 inch
        # note that this is somewhat of a dangerous assumption, and should involve more checks
        if top < 72  or  top > page.cropbox.y1-72:
            continue
        # for some idea, PDF allows about five different box sizes, but most of them are for preprint,
        # and we mostly care about mediabox and cropbox, which are _usually_ the same but if there is a difference we probably care about cropbox.
        # See also: https://pymupdf.readthedocs.io/en/latest/glossary.html#MediaBox

        print( dict, thing )

    # natural reading order is a little more complex - the list numbering actually appears _after_,
    # and _slightly_ below, the text to the right of it.
    # You can't just sort by y, then x position,  because we are now in the land of typesetting.
    #   y position can be offset a little from other things that are, to us, on the same line.
    # If it's less than some fraction of line-height away it's probably the same line.


with fitz.open( stream=example_pdf_bytes, filetype="pdf") as document:
    for pagenum, page in enumerate(document):
        deal_with_page(pagenum, page)

In [None]:
import wetsuite.helpers.shellcolor as sc

test_urls = [
    'https://repository.overheid.nl/frbr/officielepublicaties/kst/31700-VIII/kst-31700-VIII-77-b1/1/pdf/kst-31700-VIII-77-b1.pdf', # non-straight image of text of not the best quality, that OCR makes a bunch of mistakes on

    'https://kansspelautoriteit.nl/publish/library/32/last_onder_dwangsom_slots_dev.pdf', # 1 page of text, the rest is images-of-text
    #'https://kansspelautoriteit.nl/publish/pages/5492/00_082_720_openbare_versie_last_onder_bestuursdwang.pdf', # 5 pages of images-of-text
    #'https://kansspelautoriteit.nl/publish/pages/5491/sanctiebesluit_wedwinkel.pdf', # 25 pages of images-of-text

    #'https://zoek.officielebekendmakingen.nl/trb-2022-72.pdf',
    #'https://zoek.officielebekendmakingen.nl/stb-2022-1.pdf',
    #'https://zoek.officielebekendmakingen.nl/stb-2000-5.pdf',
    #'https://zoek.officielebekendmakingen.nl/gmb-2022-385341.pdf',
    #'https://zoek.officielebekendmakingen.nl/stcrt-2019-42172.pdf',
    #'https://zoek.officielebekendmakingen.nl/prb-2022-10190.pdf',
    #'https://zoek.officielebekendmakingen.nl/wsb-2022-9718.pdf',
]