<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-dev/blob/main/notebooks/extras/extras_methods_ocr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# (only) in colab, run this first to install wetsuite from (the most recent) source.   For your own setup, see wetsuite's install guidelines.
!pip3 install -U --no-cache-dir --quiet https://github.com/knobs-dials/wetsuite-dev/archive/refs/heads/main.zip

## Purpose of this notebook


## When you want to put more work into PDF

### More work into checking PDFs




In [7]:
import fitz
import bs4
from IPython.display import HTML

import wetsuite.helpers.net
import wetsuite.helpers.localdata
import wetsuite.extras.pdf

In [2]:
ff = wetsuite.helpers.localdata.LocalKV( 'frbr_fetched.db', str, bytes )

#example_pdf_bytes = wetsuite.helpers.net.download('https://www.tweedekamer.nl/downloads/document?id=2023D37764')
#example_pdf_bytes = ff.get('https://repository.overheid.nl/frbr/officielepublicaties/kst/20487-32/kst-20487-32-b2/1/pdf/kst-20487-32-b2.pdf')
#example_pdf_bytes = ff.get('https://repository.overheid.nl/frbr/officielepublicaties/kst/17050-284/kst-17050-284-b1/1/pdf/kst-17050-284-b1.pdf')

In [5]:
# You can sometimes do a good estimation of whether a PDF is concatenated pieces from specific variation,
# e.g. the page size and orientation: 
example_pdf_bytes = wetsuite.helpers.net.download('https://open.overheid.nl/documenten/dea3e2bb-4792-4e92-8cc3-61214e81a2fc/file')

with fitz.open( stream=example_pdf_bytes, filetype="pdf") as document:
    if wetsuite.extras.pdf.do_page_sizes_vary( document)[0]:
        print( "There IS variation in page sizes")
    else:
        print( "NO variation in page sizes")
    
    for pagenum, page in enumerate(document):
        print( "page %d sized %s with %d characters"%(
            pagenum, 
             wetsuite.extras.pdf.closest_paper_size_name(page.cropbox)[0:2],
            len(page.get_text().strip()),
) )

There IS variation in page sizes
page 0 sized ('Letter', 'portrait') with 2294 characters
page 1 sized ('Letter', 'portrait') with 3519 characters
page 2 sized ('Letter', 'portrait') with 214 characters
page 3 sized ('Letter', 'portrait') with 2112 characters
page 4 sized ('Letter', 'portrait') with 3640 characters
page 5 sized ('Letter', 'portrait') with 1542 characters
page 6 sized ('A4', 'landscape') with 600 characters
page 7 sized ('A4', 'landscape') with 1036 characters
page 8 sized ('A4', 'landscape') with 711 characters
page 9 sized ('A4', 'landscape') with 2087 characters
page 10 sized ('A4', 'landscape') with 1829 characters
page 11 sized ('A4', 'landscape') with 535 characters
page 12 sized ('A4', 'landscape') with 969 characters
page 13 sized ('A4', 'landscape') with 502 characters
page 14 sized ('A4', 'landscape') with 471 characters
page 15 sized ('A4', 'landscape') with 2119 characters
page 16 sized ('A4', 'landscape') with 200 characters
page 17 sized ('Letter', 'portra

### Putting more work into text extraction

The underlying the library we're leveraging for PDFs, `pymupdf`,
is willing to give you differently-processed views on page data.
We can use this to our advantage.

<!-- -->

[As its documentation mentions](https://pymupdf.readthedocs.io/en/latest/app1.html), in the way it views extraction, it has some concepts like:
- _page_ - largely consists of blocks (that tend to be roughly paragraphs)
- _block_ - consists of either lines and their characters, or an image or such
- _line_ - consists of spans.
- _span_ - adjacent characters with identical font properties (name, size, flags, color)


Different extraction functions/parameters give you
- relatively raw data (down to characters),
- said blocks,
- or output that did a little more creative decisionmaking into a document-like structure

If you just care about the words it contains, most of those will do,
and 'text' (the default) is simplest because it just flattens everything into one string.

If on the other hand you want 
- natural reading order
- flowing text 
  - not broken up by 
    - headers and footers, 
    - page breaks,
  - split into paragraphs,
- tables and images,
- estimating which parts are headers and which parts of the text fall under which headers

...then you may need to go down to a moderately raw forms to control what it really does.
Try uncommenting each of the get_text lines in the code block below
to see what each gives you, [and the documentation for that functionality](https://pymupdf.readthedocs.io/en/latest/textpage.html) for a litle more explanation.

In [4]:
example_pdf_bytes = wetsuite.helpers.net.download('https://zoek.officielebekendmakingen.nl/wsb-2022-9718.pdf') # simple, ~one page

In [9]:
for option, flags in (
    # block level
    ('blocks',None),
    ('html',  fitz.TEXTFLAGS_HTML & ~fitz.TEXT_PRESERVE_IMAGES),
    ('dict',  None),
    #('json',  None),

    # more analysed
    ('text',  None),
    ('xhtml', fitz.TEXTFLAGS_XHTML & ~fitz.TEXT_PRESERVE_IMAGES),

    # more low level
    ('words',None),
    #('xml',None),
    
    #('rawdict',),
    #('rawjson',),
):
    import html
    with fitz.open( stream=example_pdf_bytes, filetype="pdf") as document:
        for page in document:
            display( HTML( f'<h1>{option}</h1>' ) )
            res = page.get_text( option=option, flags=flags )
            if isinstance(res, str):
                display( HTML( '<pre>%s</pre>'% html.escape(res).replace('\n','<br/>') ) )
            else:
                #display( HTML( res ) )
                display( res )

            break

[(6.5,
  1.7655305862426758,
  36.030765533447266,
  12.914850234985352,
  'Docnrl\n',
  0,
  0),
 (129.60000610351562,
  34.82000732421875,
  278.9750061035156,
  46.025550842285156,
  'Mon, 22 Jan 2024 14:39:28 +0100\n',
  1,
  0),
 (135.35000610351562,
  49.165199279785156,
  407.0798034667969,
  60.37664794921875,
  '"<\n5.1.2.e\n5.1.2.e\n5.1.2.e\n',
  2,
  0),
 (129.85000610351562,
  75.16998291015625,
  348.6018981933594,
  86.37552642822266,
  'FW: Verzoek op grond van de Wet open overheid\n',
  3,
  0),
 (494.6499938964844,
  138.32620239257812,
  525.7505493164062,
  149.46234130859375,
  '(eerste\n',
  4,
  0),
 (148.10000610351562,
  149.31582641601562,
  386.7203674316406,
  160.5265350341797,
  '(tweede aanspreekpunt). Graag uitzetten bij de NLA.\n',
  5,
  0),
 (30.950000762939453,
  171.17965698242188,
  528.666015625,
  182.37326049804688,
  'Ik schat in dat het geen complex Woo verzoek is, maar het is wel omvangrijk.\n5.1.2.e\n',
  6,
  0),
 (20.899999618530273,
  193.

{'width': 612.0,
 'height': 792.0,
 'blocks': [{'number': 0,
   'type': 0,
   'bbox': (6.5, 1.7655305862426758, 36.030765533447266, 12.914850234985352),
   'lines': [{'spans': [{'size': 10.099783897399902,
       'flags': 0,
       'font': 'ArialMT',
       'color': 0,
       'ascender': 0.9052734375,
       'descender': -0.2119140625,
       'text': 'Docnrl',
       'origin': (6.5, 10.79998779296875),
       'bbox': (6.5,
        1.7655305862426758,
        36.030765533447266,
        12.914850234985352)}],
     'wmode': 0,
     'dir': (1.0, 0.0),
     'bbox': (6.5,
      1.7655305862426758,
      36.030765533447266,
      12.914850234985352)}]},
  {'number': 1,
   'type': 0,
   'bbox': (129.60000610351562,
    34.82000732421875,
    278.9750061035156,
    46.025550842285156),
   'lines': [{'spans': [{'size': 9.76195240020752,
       'flags': 0,
       'font': 'ArialMT',
       'color': 0,
       'ascender': 0.9052734375,
       'descender': -0.2119140625,
       'text': 'Mon,',
     

[(6.5,
  1.7655305862426758,
  36.030765533447266,
  12.914850234985352,
  'Docnrl',
  0,
  0,
  0),
 (129.60000610351562,
  34.84366989135742,
  150.77418518066406,
  46.02001190185547,
  'Mon,',
  1,
  0,
  0),
 (155.0500030517578,
  34.84229278564453,
  165.1241912841797,
  46.02033615112305,
  '22',
  1,
  0,
  1),
 (168.0,
  34.85087585449219,
  181.91653442382812,
  46.01832580566406,
  'Jan',
  1,
  0,
  2),
 (185.75,
  34.841678619384766,
  207.1478271484375,
  46.020477294921875,
  '2024',
  1,
  0,
  3),
 (210.6999969482422,
  34.84751510620117,
  249.22552490234375,
  46.01911163330078,
  '14:39:28',
  1,
  0,
  4),
 (252.0,
  34.85091018676758,
  278.9750061035156,
  46.01831817626953,
  '+0100',
  1,
  0,
  5),
 (179.5,
  49.165199279785156,
  191.6602325439453,
  60.37664794921875,
  '"<',
  2,
  0,
  0),
 (135.35000610351562,
  50.67775344848633,
  150.5413818359375,
  56.25862121582031,
  '5.1.2.e',
  2,
  1,
  0),
 (192.25,
  50.670284271240234,
  207.37979125976562,
 

Say we looked at that and liked `xhtml` for its apparently useful analysis of paragraphs, and marking it up with headers and such,
but noticed the header that is _visually_ on top of the page is structrally _below_ the page - it seems this does _not_ care about natural reading order.

Okay, then we looked at `html` form and noticed it's the next step more raw, and with positions to let us sort by.
How hard could it be?

In [13]:
#Say we wanted to 

# one of the things is that things we consider on the same line may be easily 0.5pt y-offset away (maybe half-text-height)
#  so when we sort, we should try to also group by "probably on same line"
def line_sort():
    pass


def deal_with_page(pagenum, page):
    res = page.get_text( option='html', flags=fitz.TEXTFLAGS_HTML & ~fitz.TEXT_PRESERVE_IMAGES )

    #print( res )
    soup = bs4.BeautifulSoup(res, features='lxml')

    l = []
    for thing in soup.select('div>*'): # looks for p; keep in mind that spans inside sometimes have e.g. colors
        #print('-',thing)
        
        d = {'pagenum':pagenum}
        for kv in thing.get('style').split(';'):
            k, v = kv.split(':')
            d[k]=v
        del thing.attrs['style']
        for span in thing.find_all('span'):
            del span.attrs['style']
        #index: element for index, element in enumerate(my_list)
        #print(d, thing)          
        l.append( (float(d['top'].rstrip('pt')), d, thing) )  

    l.sort( key=lambda tup:tup[0]) # sort by top 
    for top, dict, thing in l:

        # ignore everything in the top inch, and bottom inch (these are pt, which are by definition 1/72 inch
        # note that this is somewhat of a dangerous assumption, and should involve more checks
        if top < 72  or  top > page.cropbox.y1-72:
            continue
        # for some idea, PDF allows about five different box sizes, but most of them are for preprint,
        # and we mostly care about mediabox and cropbox, which are _usually_ the same but if there is a difference we probably care about cropbox.
        # See also: https://pymupdf.readthedocs.io/en/latest/glossary.html#MediaBox

        print( dict, thing )

    # natural reading order is a little more complex - the list numbering actually appears _after_,
    # and _slightly_ below, the text to the right of it.
    # You can't just sort by y, then x position,  because we are now in the land of typesetting.
    #   y position can be offset a little from other things that are, to us, on the same line.
    # If it's less than some fraction of line-height away it's probably the same line.


with fitz.open( stream=example_pdf_bytes, filetype="pdf") as document:
    for pagenum, page in enumerate(document):
        deal_with_page(pagenum, page)

        

{'pagenum': 0, 'top': '76.5pt', 'left': '20.1pt', 'line-height': '10.0pt'} <p><span>Subject:</span></p>
{'pagenum': 0, 'top': '76.7pt', 'left': '129.9pt', 'line-height': '9.4pt'} <p><span>FW:</span><span> </span><span>Verzoek</span><span> </span><span>op</span><span> </span><span>grond</span><span> </span><span>van</span><span> </span><span>de</span><span> </span><span>Wet</span><span> </span><span>open</span><span> </span><span>overheid</span></p>
{'pagenum': 0, 'top': '118.4pt', 'left': '20.9pt', 'line-height': '7.3pt'} <p><span>HOi</span><span> </span><span>5J2e</span></p>
{'pagenum': 0, 'top': '139.4pt', 'left': '494.6pt', 'line-height': '10.0pt'} <p><span>(eerste</span></p>
{'pagenum': 0, 'top': '139.5pt', 'left': '20.4pt', 'line-height': '9.8pt'} <p><span>Zou</span><span> </span><span>je</span><span> </span><span>ook</span><span> </span><span>dit</span><span> </span><span>Woo-verzoek</span><span> </span><span>willen</span><span> </span><span>registreren</span><span> </span><span>

In [None]:

test_urls = [
    'https://repository.overheid.nl/frbr/officielepublicaties/kst/31700-VIII/kst-31700-VIII-77-b1/1/pdf/kst-31700-VIII-77-b1.pdf', # non-straight image of text of not the best quality, that OCR makes a bunch of mistakes on

    'https://kansspelautoriteit.nl/publish/library/32/last_onder_dwangsom_slots_dev.pdf', # 1 page of text, the rest is images-of-text
    #'https://kansspelautoriteit.nl/publish/pages/5492/00_082_720_openbare_versie_last_onder_bestuursdwang.pdf', # 5 pages of images-of-text
    #'https://kansspelautoriteit.nl/publish/pages/5491/sanctiebesluit_wedwinkel.pdf', # 25 pages of images-of-text

    #'https://zoek.officielebekendmakingen.nl/trb-2022-72.pdf',
    #'https://zoek.officielebekendmakingen.nl/stb-2022-1.pdf',
    #'https://zoek.officielebekendmakingen.nl/stb-2000-5.pdf',
    #'https://zoek.officielebekendmakingen.nl/gmb-2022-385341.pdf',
    #'https://zoek.officielebekendmakingen.nl/stcrt-2019-42172.pdf',
    #'https://zoek.officielebekendmakingen.nl/prb-2022-10190.pdf',
    #'https://zoek.officielebekendmakingen.nl/wsb-2022-9718.pdf',
]