<a href="https://colab.research.google.com/github/WetsuiteLeiden/example-notebooks/blob/main/research-methods/methods_technical__pdf_part2__more_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# (only) in colab, run this first to install wetsuite from (the most recent) source.   For your own setup, see wetsuite's install guidelines.
!pip3 install wetsuite -U

## Purpose of this notebook


## When you want to put more work into PDF

### More work into checking PDFs




In [None]:
import html
import pprint

import fitz
import bs4
from IPython.display import HTML

import wetsuite.helpers.net
import wetsuite.helpers.localdata
import wetsuite.extras.pdf

In [3]:
# You can sometimes do a good estimation of whether a PDF is concatenated pieces from specific variation,
# e.g. the page size and orientation: 
example_pdf_bytes = wetsuite.helpers.net.download('https://open.overheid.nl/documenten/dea3e2bb-4792-4e92-8cc3-61214e81a2fc/file')

with fitz.open( stream=example_pdf_bytes, filetype="pdf") as document:
    if wetsuite.extras.pdf.do_page_sizes_vary( document)[0]:
        print( "There IS variation in page sizes")
    else:
        print( "NO variation in page sizes")
    
    for pagenum, page in enumerate(document):
        print( "page %d sized %s with %d characters"%(
            pagenum, 
             wetsuite.extras.pdf.closest_paper_size_name(page.cropbox)[0:2],
            len(page.get_text().strip()),
) )

There IS variation in page sizes
page 0 sized ('Letter', 'portrait') with 2294 characters
page 1 sized ('Letter', 'portrait') with 3519 characters
page 2 sized ('Letter', 'portrait') with 214 characters
page 3 sized ('Letter', 'portrait') with 2158 characters
page 4 sized ('Letter', 'portrait') with 3640 characters
page 5 sized ('Letter', 'portrait') with 1542 characters
page 6 sized ('A4', 'landscape') with 600 characters
page 7 sized ('A4', 'landscape') with 1036 characters
page 8 sized ('A4', 'landscape') with 711 characters
page 9 sized ('A4', 'landscape') with 2087 characters
page 10 sized ('A4', 'landscape') with 1829 characters
page 11 sized ('A4', 'landscape') with 535 characters
page 12 sized ('A4', 'landscape') with 969 characters
page 13 sized ('A4', 'landscape') with 502 characters
page 14 sized ('A4', 'landscape') with 471 characters
page 15 sized ('A4', 'landscape') with 2119 characters
page 16 sized ('A4', 'landscape') with 200 characters
page 17 sized ('Letter', 'portra

### Putting more work into text extraction

The underlying the library we're leveraging for PDFs, `pymupdf`,
is willing to give you differently-processed views on page data.
We can use this to our advantage.

<!-- -->

[As its documentation mentions](https://pymupdf.readthedocs.io/en/latest/app1.html), in the way it views extraction, it has some concepts like:
- _page_ - largely consists of blocks (that tend to be roughly paragraphs)
- _block_ - consists of either lines and their characters, or an image or such
- _line_ - consists of spans.
- _span_ - adjacent characters with identical font properties (name, size, flags, color)


Different extraction functions/parameters give you
- relatively raw data (down to characters),
- said blocks,
- or output that did a little more creative decisionmaking into a document-like structure

If you just care about the words it contains, most of those will do,
and 'text' (the default) is simplest because it just flattens everything into one string.

If on the other hand you want 
- natural reading order
- flowing text 
  - not broken up by 
    - headers and footers, 
    - page breaks,
  - split into paragraphs,
- tables and images,
- estimating which parts are headers and which parts of the text fall under which headers

...then you may need to go down to a moderately raw forms to control what it really does.
Try uncommenting each of the get_text lines in the code block below
to see what each gives you, [and the documentation for that functionality](https://pymupdf.readthedocs.io/en/latest/textpage.html) for a litle more explanation.

In [26]:
example_pdf_bytes = wetsuite.helpers.net.download('https://zoek.officielebekendmakingen.nl/wsb-2022-9718.pdf') # simple, ~one page

In [None]:
for option, flags in (
    # block level
    ('blocks',None),
    ('html',  fitz.TEXTFLAGS_HTML & ~fitz.TEXT_PRESERVE_IMAGES),
    ('dict',  None), # may also want to ignore blocks that are images,  if we can?
    #('json',  None),

    # more analysed
    ('text',  None),
    ('xhtml', fitz.TEXTFLAGS_XHTML & ~fitz.TEXT_PRESERVE_IMAGES),

    # more low level
    ('words',None),
    #('xml',None),
    
    #('rawdict',),
    #('rawjson',),
):
    with fitz.open( stream=example_pdf_bytes, filetype="pdf") as document:
        for page in document:
            display( HTML( f'<h1>{option}</h1>' ) )
            res = page.get_text( option=option, flags=flags )
            if isinstance(res, str):
                display('HTML')
                display( HTML( '<pre>%s</pre>'%(
                    html.escape(res).replace('\n','<br/>')
                    #[:5000] # limit output
                ) ) )

            elif isinstance(res, dict):  # just a few blocks, because this is verbose
                for block in res['blocks'][:3]:
                    display(block)

            elif isinstance(res, list):  # just a few items
                for block in res[:10]:
                    display(block)

            else:
                display( res )

            break # stop after just one page, actually

(119.05500030517578,
 109.42204284667969,
 556.0970458984375,
 157.27801513671875,
 'Scheepvaartvergunning voor het overschrijden van de op de Mark en Dintel \n(riviervakken VI en VII)) op traject vanaf de Mandersluis - Stampersgat (Suiker \nUnie) en retour ter plaatse toegestane afmetingen van een schip. \n',
 0,
 0)

(119.05500030517578,
 166.2091827392578,
 555.284912109375,
 208.31101989746094,
 'Besluitnummer 560746 ingevolge de Scheepvaartverkeerswet bekend gemaakt op 23 augustus 2022 \nvoor het overschrijden van de op de Mark en Dintel (riviervakken VI en VII) ter plaatse toegestane afme- \ntingen, met het motorbeunschip "Klazina" met scheepsnummer 02300591 tussen 1 september 2022 en \n30 januari 2023. \n',
 1,
 0)

(119.05500030517578,
 217.8852081298828,
 124.09235382080078,
 228.7300262451172,
 '  \n',
 2,
 0)

(119.05500030517578,
 238.4037628173828,
 353.51007080078125,
 249.1670379638672,
 'Bezwaarmogelijkheden met betrekking tot het besluit \n',
 3,
 0)

(119.05500030517578,
 258.795166015625,
 555.2759399414062,
 321.73504638671875,
 'Op grond van de Algemene wet bestuursrecht (Awb) kunnen belanghebbenden tegen dit besluit een \nbezwaarschrift indienen. De termijn voor het indienen van een bezwaarschrift is 6 weken, ingaande op \n24 augustus 2022. Het bezwaarschrift moet gericht zijn aan het dagelijks bestuur van waterschap Bra- \nbantse Delta, Postbus 5520, 4801 DZ te Breda. U dient op de envelop het woord ‘bezwaarschrift’ te \nvermelden. Wij verzoeken u om in het bezwaarschrift ook uw telefoonnummer en e-mailadres te ver- \nmelden. Het bezwaarschrift moet de volgende inhoud hebben: \n',
 4,
 0)

(119.05500030517578,
 331.3091735839844,
 249.29373168945312,
 342.3800354003906,
 '1. \nnaam en adres indiener; \n',
 5,
 0)

(119.05500030517578,
 342.18017578125,
 201.12936401367188,
 353.25103759765625,
 '2. \ndagtekening; \n',
 6,
 0)

(119.05500030517578,
 353.0511779785156,
 284.0110778808594,
 364.1220397949219,
 '3. \nhet nummer van de vergunning; \n',
 7,
 0)

(119.05500030517578,
 363.92218017578125,
 414.9588317871094,
 374.9930419921875,
 '4. \nde reden(en) waarom u zich niet met het besluit kan verenigen; \n',
 8,
 0)

(119.05500030517578,
 374.7931823730469,
 243.92835998535156,
 385.8640441894531,
 '5. \nhandtekening indiener. \n',
 9,
 0)

'HTML'

{'number': 0,
 'type': 0,
 'bbox': (119.05500030517578,
  109.42204284667969,
  556.0970458984375,
  157.27801513671875),
 'lines': [{'spans': [{'size': 12.0,
     'flags': 20,
     'font': 'UniversLT-Bold',
     'color': 0,
     'ascender': 0.9380000233650208,
     'descender': -0.25,
     'text': 'Scheepvaartvergunning voor het overschrijden van de op de Mark en Dintel ',
     'origin': (119.05500030517578, 120.67803955078125),
     'bbox': (119.05500030517578,
      109.42204284667969,
      551.04296875,
      123.67803955078125)}],
   'wmode': 0,
   'dir': (1.0, 0.0),
   'bbox': (119.05500030517578,
    109.42204284667969,
    551.04296875,
    123.67803955078125)},
  {'spans': [{'size': 12.0,
     'flags': 20,
     'font': 'UniversLT-Bold',
     'color': 0,
     'ascender': 0.9380000233650208,
     'descender': -0.25,
     'text': '(riviervakken VI en VII)) op traject vanaf de Mandersluis - Stampersgat (Suiker ',
     'origin': (119.05500030517578, 137.47802734375),
     'bbox': 

{'number': 1,
 'type': 0,
 'bbox': (119.05500030517578,
  166.2091827392578,
  555.284912109375,
  208.31101989746094),
 'lines': [{'spans': [{'size': 9.0600004196167,
     'flags': 4,
     'font': 'UniversLT',
     'color': 0,
     'ascender': 0.9470000267028809,
     'descender': -0.25,
     'text': 'Besluitnummer 560746 ingevolge de Scheepvaartverkeerswet bekend gemaakt op 23 augustus 2022 ',
     'origin': (119.05500030517578, 174.78900146484375),
     'bbox': (119.05500030517578,
      166.2091827392578,
      548.5830688476562,
      177.0540008544922)}],
   'wmode': 0,
   'dir': (1.0, 0.0),
   'bbox': (119.05500030517578,
    166.2091827392578,
    548.5830688476562,
    177.0540008544922)},
  {'spans': [{'size': 9.0600004196167,
     'flags': 4,
     'font': 'UniversLT',
     'color': 0,
     'ascender': 0.9470000267028809,
     'descender': -0.25,
     'text': 'voor het overschrijden van de op de Mark en Dintel (riviervakken VI en VII) ter plaatse toegestane afme- ',
     'ori

{'number': 2,
 'type': 0,
 'bbox': (119.05500030517578,
  217.8852081298828,
  124.09235382080078,
  228.7300262451172),
 'lines': [{'spans': [{'size': 9.0600004196167,
     'flags': 4,
     'font': 'UniversLT',
     'color': 0,
     'ascender': 0.9470000267028809,
     'descender': -0.25,
     'text': '  ',
     'origin': (119.05500030517578, 226.46502685546875),
     'bbox': (119.05500030517578,
      217.8852081298828,
      124.09235382080078,
      228.7300262451172)}],
   'wmode': 0,
   'dir': (1.0, 0.0),
   'bbox': (119.05500030517578,
    217.8852081298828,
    124.09235382080078,
    228.7300262451172)}]}

'HTML'

'HTML'

(119.05500030517578,
 109.42204284667969,
 254.25900268554688,
 123.67803955078125,
 'Scheepvaartvergunning',
 0,
 0,
 0)

(257.5950012207031,
 109.42204284667969,
 283.3590087890625,
 123.67803955078125,
 'voor',
 0,
 0,
 1)

(286.69500732421875,
 109.42204284667969,
 305.36700439453125,
 123.67803955078125,
 'het',
 0,
 0,
 2)

(308.7030029296875,
 109.42204284667969,
 384.85498046875,
 123.67803955078125,
 'overschrijden',
 0,
 0,
 3)

(388.1910095214844,
 109.42204284667969,
 408.3869934082031,
 123.67803955078125,
 'van',
 0,
 0,
 4)

(411.7229919433594,
 109.42204284667969,
 425.72698974609375,
 123.67803955078125,
 'de',
 0,
 0,
 5)

(429.06298828125,
 109.42204284667969,
 443.72698974609375,
 123.67803955078125,
 'op',
 0,
 0,
 6)

(447.06298828125,
 109.42204284667969,
 461.0669860839844,
 123.67803955078125,
 'de',
 0,
 0,
 7)

(464.40301513671875,
 109.42204284667969,
 493.26300048828125,
 123.67803955078125,
 'Mark',
 0,
 0,
 8)

(496.5989990234375,
 109.42204284667969,
 510.6029968261719,
 123.67803955078125,
 'en',
 0,
 0,
 9)

Say we looked at that and liked `xhtml` for its apparently useful analysis of paragraphs, and marking it up with headers and such,
but noticed the header that is _visually_ on top of the page is structrally _below_ the page - it seems this does _not_ care about natural reading order.

Okay, then we looked at `html` form and noticed it's the next step more raw, and with positions to let us sort by.
How hard could it be?

In [28]:
#Say we wanted to 

# one of the things is that things we consider on the same line may be easily 0.5pt y-offset away (maybe half-text-height)
#  so when we sort, we should try to also group by "probably on same line"
def line_sort():
    pass


def deal_with_page(pagenum, page):
    res = page.get_text( option='html', flags=fitz.TEXTFLAGS_HTML & ~fitz.TEXT_PRESERVE_IMAGES )
    soup = bs4.BeautifulSoup(res, features='lxml') # TODO: prefer lxml.html

    l = []
    for thing in soup.select('div>*'): # looks for p; keep in mind that spans inside sometimes have e.g. colors
        #print('-',thing)
        
        d = {'pagenum':pagenum}
        for kv in thing.get('style').split(';'):
            k, v = kv.split(':')
            d[k]=v
        del thing.attrs['style']
        for span in thing.find_all('span'):
            del span.attrs['style']
        #index: element for index, element in enumerate(my_list)
        #print(d, thing)          
        l.append( (float(d['top'].rstrip('pt')), d, thing) )  

    l.sort( key=lambda tup:tup[0] ) # sort by top 
    for top, dict, thing in l:

        # ignore everything in the top inch, and bottom inch (these are pt, which are by definition 1/72 inch
        # note that this is somewhat of a dangerous assumption, and should involve more checks
        if top < 72  or  top > page.cropbox.y1-72:
            continue
        # for some idea, PDF allows about five different box sizes, but most of them are for preprint,
        # and we mostly care about mediabox and cropbox, which are _usually_ the same but if there is a difference we probably care about cropbox.
        # See also: https://pymupdf.readthedocs.io/en/latest/glossary.html#MediaBox

        print( dict, thing )

    # natural reading order is a little more complex - the list numbering actually appears _after_,
    # and _slightly_ below, the text to the right of it.
    # You can't just sort by y, then x position,  because we are now in the land of typesetting.
    #   y position can be offset a little from other things that are, to us, on the same line.
    # If it's less than some fraction of line-height away it's probably the same line.


with fitz.open( stream=example_pdf_bytes, filetype="pdf") as document:
    for pagenum, page in enumerate(document):
        deal_with_page(pagenum, page)
        

{'pagenum': 0, 'top': '72.6pt', 'left': '119.1pt', 'line-height': '9.1pt'} <p><span>Officiële uitgave van het dagelijks bestuur van het Waterschap Brabantse Delta </span></p>
{'pagenum': 0, 'top': '111.1pt', 'left': '119.1pt', 'line-height': '12.0pt'} <p><b><span>Scheepvaartvergunning voor het overschrijden van de op de Mark en Dintel </span></b></p>
{'pagenum': 0, 'top': '127.9pt', 'left': '119.1pt', 'line-height': '12.0pt'} <p><b><span>(riviervakken VI en VII)) op traject vanaf de Mandersluis - Stampersgat (Suiker </span></b></p>
{'pagenum': 0, 'top': '144.7pt', 'left': '119.1pt', 'line-height': '12.0pt'} <p><b><span>Unie) en retour ter plaatse toegestane afmetingen van een schip. </span></b></p>
{'pagenum': 0, 'top': '167.5pt', 'left': '119.1pt', 'line-height': '9.1pt'} <p><span>Besluitnummer 560746 ingevolge de Scheepvaartverkeerswet bekend gemaakt op 23 augustus 2022 </span></p>
{'pagenum': 0, 'top': '178.0pt', 'left': '119.1pt', 'line-height': '9.1pt'} <p><span>voor het overschri

# other suggested test cases

...in case you want to experience more of the variation:

In [None]:
test_urls = [
    'https://repository.overheid.nl/frbr/officielepublicaties/kst/31700-VIII/kst-31700-VIII-77-b1/1/pdf/kst-31700-VIII-77-b1.pdf', # non-straight image of text of not the best quality, that OCR makes a bunch of mistakes on

    'https://kansspelautoriteit.nl/publish/library/32/last_onder_dwangsom_slots_dev.pdf', # 1 page of text, the rest is images-of-text

    #'https://www.tweedekamer.nl/downloads/document?id=2023D37764',
    #'https://repository.overheid.nl/frbr/officielepublicaties/kst/20487-32/kst-20487-32-b2/1/pdf/kst-20487-32-b2.pdf',
    #'https://repository.overheid.nl/frbr/officielepublicaties/kst/17050-284/kst-17050-284-b1/1/pdf/kst-17050-284-b1.pdf',

    #'https://kansspelautoriteit.nl/publish/pages/5492/00_082_720_openbare_versie_last_onder_bestuursdwang.pdf', # 5 pages of images-of-text
    #'https://kansspelautoriteit.nl/publish/pages/5491/sanctiebesluit_wedwinkel.pdf', # 25 pages of images-of-text

    #'https://zoek.officielebekendmakingen.nl/trb-2022-72.pdf',
    #'https://zoek.officielebekendmakingen.nl/stb-2022-1.pdf',
    #'https://zoek.officielebekendmakingen.nl/stb-2000-5.pdf',
    #'https://zoek.officielebekendmakingen.nl/gmb-2022-385341.pdf',
    #'https://zoek.officielebekendmakingen.nl/stcrt-2019-42172.pdf',
    #'https://zoek.officielebekendmakingen.nl/prb-2022-10190.pdf',
    #'https://zoek.officielebekendmakingen.nl/wsb-2022-9718.pdf',
]