<a href="https://colab.research.google.com/github/WetSuiteLeiden/data-collection/blob/master/web_woobesluit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Purpose of this notebook



Fetch documents from [rijksoverheid.nl/documenten](https://www.rijksoverheid.nl/documenten?type=Woo-besluit),
focusing first and foremost on [Woo](https://nl.wikipedia.org/wiki/Wet_openbaarheid_van_bestuur#Wet_open_overheid) requests.

We abstract out the "scrape result items from this site's pagination" into a module, 
but each type of document (Woo is just one) deserves its own handling, 
both in terms of what you want to do with the contents, and that you probably want to cache the results.

That website, asked for any type of document as an example search, broadly gives:
* Browse/search result pages 
  - looks the same for most - a title-and-link, and a summary
  - for Woo looks like https://www.rijksoverheid.nl/documenten?type=Woo%2Dbesluit&pagina=45

* A case's detail page 
  - the contents of these varies a lot with requested document type, and a little more within some docuument types. For some, it contains everything. For others, just a link to a download. For some, a mix.
  - for Woo looks like looks like: https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/05/22/besluit-op-woo-verzoek-over-alle-eendenhouderijen-nederland - that is, primarily a link to a 'attached' PDF

* additionally, various types of detail page downloads, embeds, and/or links elsewhere
  - for Woo this is typically a PDF, e.g. https://open.overheid.nl/documenten/76f1a787-3f8a-452c-bf58-5911e1a89bcd/file 
  - e.g. https://www.rijksoverheid.nl/documenten/mediateksten/2025/01/10/letterlijke-tekst-persconferentie-na-ministerraad-10-januari-2025 has a link to youtube
  - e.g. https://www.rijksoverheid.nl/documenten/geluidsfragmenten/2024/10/15/gesproken-versie-van-hoe-voorkom-ik-koolmonoxidevergiftiging-in-mijn-huis has an embed and download to a MP3 file hosted elsewhere

In [1]:
import time, datetime, random, warnings
import urllib.parse, re, pprint

import bs4
import dateutil.parser

import wetsuite.datacollect.rijksoverheid_nl_documenten
import wetsuite.helpers.net
import wetsuite.helpers.strings
import wetsuite.helpers.notebook
import wetsuite.helpers.localdata
import wetsuite.helpers.patterns
import wetsuite.extras.pdf

## Before we get to Woo, let's start with an easier example 

The case of Woo has a few hairy details, so if you are interested in fetching other things from [rijksoverheid.nl/documenten](https://www.rijksoverheid.nl/documenten?type=Woo-besluit), keep reading this section. If only interested in Woo, skip straight to that part.

Let's start with the `Mediatekst` document type, which seems to be press conferences, press releases, and such.

As mentioned, while the pagination is handled for you, but have to figure out what to do with the detail page it links to.

For the mentioned press pages within this set, those pages seem to be just a single HTML page with all the text,
so just downloading them and packing them into a small dataset should already be useful.

In [4]:
# Okay, almost easy.  Let's also make this a dataset we will provide. That means we have to create the store in a slightly specific way:
mediatekst_html_store = wetsuite.helpers.localdata.LocalKV('mediatekst-html.db',        key_type=str,value_type=bytes) # url to htmlbytes
mediatekst_html_store._put_meta('description_short','A dataset with press conderences and releases from www.rijksoverheid.nl/documenten ')
mediatekst_html_store._put_meta('description','''
This dataset stores all documents of type 'Mediatekst' from https://www.rijksoverheid.nl/documenten
which seems to be primarily conferences and press releases.

Each value is the HTML file as a bytestring, unprocessed from how we found it.
''')

In [None]:
def handle_mediatekst_detail_page(soup, url):
    ''' scrape_pagination wants to call something for each detail page. 
        In this case that page _is_ the content, so we only care to fetch it.
    '''
    print('ITEM',url)
    _, came_from_cache = wetsuite.helpers.localdata.cached_fetch(mediatekst_html_store, url)
    if not came_from_cache:
        time.sleep(2)


# currently under 300 items, should take maybe 20 minutes
wetsuite.datacollect.rijksoverheid_nl_documenten.scrape_pagination(
    doctype='Mediatekst',
    from_date=datetime.date(year=2010, month=1, day=1), # we know this store currently contains nothing before then 
    to_date  =datetime.date.today(), 
    detail_page_callback=handle_mediatekst_detail_page, 
    debug=True)

In [14]:
mediatekst_html_store.summary(get_num_items=True)

{'size_bytes': 15044608,
 'size_readable': '14.3MiB',
 'num_items': 276,
 'avgsize_bytes': 54509,
 'avgsize_readable': '53KiB'}

## Moving on to Woo

We can fetch responses to Woo reqests when we [rijksoverheid.nl/documenten](https://www.rijksoverheid.nl/documenten?type=Woo-besluit) if we filter for `type=Woo-besluit`.

It should be clearly pointed out that, due to the way Woo requests are handled, **there is no singular complete source of Woo requests** (yet?).

...that the government provides. The third-party [woogle](https://woogle.wooverheid.nl/search?q=*&page=1&type=2i) is probably your best bet for... well, at least [_more_](https://woogle.wooverheid.nl/overview).

### Notes on detail pages

This is also one of the more complex examples of fetching things from this site,
in that **there is variation both in how detail pages work, and how the fetched documents are structured**.

Consider:
* https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/06/20/besluit-op-woo-verzoek-over-cites-2-b-soorten
  - is an overall rejection so only has a decision document

* https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/06/22/besluit-op-woo-verzoek-over-de-vergaderingen-van-de-kerngroep-bloembollen
  - has a single document that is decision + inventory + contents

* https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/06/14/besluit---woo-verzoek-ongebruikelijke-transacties-estland-letland-en-litouwen
  - has a separate decision document, and document that is the contents of the response, here a bunch of tables

* https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/06/15/besluit-op-woo-verzoek-over-correspondentie-naar-aanleiding-van-fouten-in-de-lijst-met-top-100-ammoniakuitstoters
  - has a separate decision, and the bijlage/documents is actually a link to _another_ detail page first -- which breaks our otherwise-mostly-correct assumption that every page is a case

* https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/06/21/besluit-op-woo-verzoek-over-vertraging-van-de-bouw-van-stikstofinstallatie-zuidbroek-ii
  - is a separate decision. It also separates inventaris and seven separate content documents, also each via their own detail page.

* https://open.overheid.nl/documenten/a645fe8c-6c58-4e75-b355-aa3a97023eb8/file
  - has a document that is the decision + inventory, and points out the real data is large files should be requested for via mail

* https://open.overheid.nl/documenten/107e45ab-6533-4bda-bfc2-816ce107906e/file
  - is images of a besluit document. The PDF contains no text layer / OCR.

* https://open.overheid.nl/documenten/ronl-439dfebe8cffecb9a385633cb757ced59de469ee/pdf
  - has one page of OCR-less image-of-text, then goes on to actual text in the middle of a sentence

### Fetch case-detail pages

As such, creating a dataset with more consistency than provided to us in the first place
will take more work, and some creativity.

For now, let's only care about the motivation behind the decision, and not the document(s) that the request/decision implies,
if only to keep our task reasonable for now.

Now we can get a little more practical:
- each result page contains a number of cases, each case link is to a detail page; this is handled for us.
- that detail page typically containing a short paragraph and links to one or more PDFs.
  - let's always cache the linked document - saves time and resources (~2500 cases, times a bunch of MByte per document, times a couple per case, is maybe 40GByte and hours to days of fetching)
  - let's assume these are always _finished_ cases, not evolving ones, so that the detail pages themselves can also be cached; TODO: check that.

In [2]:
# fetch and cache data as it is provided - both the detail pages and the documents that are linked to
woo_detail_pages     = wetsuite.helpers.localdata.LocalKV(  'woo-besluiten_detailpages.db', key_type=str,value_type=bytes)    # url -> page_bytes
woo_linked_docs      = wetsuite.helpers.localdata.LocalKV(  'woo-besluiten_docs.db',        key_type=str,value_type=bytes)    # url -> content_bytes

In [3]:
# These are headed out to be datasets, so create the stores that will be those datasets.

## Metadata
# It was a choice to collect the metadata and text into separate datasets.
# As things are, we do this during detail-page handling
woo_metadata         = wetsuite.helpers.localdata.MsgpackKV('woo-besluiten_meta.db',        key_type=str,value_type=None)     # case -> metadata_dict
#woo_metadata.truncate() # when redoing that dataset, ensure there's no data from a previous run
woo_metadata._put_meta('description_short','A dataset focusing on the reaction documents to Woo requests; this gives basic metadata. ')
woo_metadata._put_meta('description','''
This dataset tries to focus on the reactions to Woo requests.

its .data is a a map 
- from an unique case identifier (not necessarily meaningful, currently happens to be the case's URL on the website, looks like 'https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/08/31/besluit-op-woo-verzoek-geen-documenten-over-gewasbeschermingsregistratie')
- to a dict like:
    {'detail_page_url':       'https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/08/31/besluit-op-woo-verzoek-geen-documenten-over-gewasbeschermingsregistratie',
     'title':                 'Besluit op Woo-verzoek geen documenten over gewasbeschermingsregistratie',
     'response_document_url': 'https://open.overheid.nl/documenten/ef04ac22-1eb9-4b5c-a439-39fb3b636fdf/file',
     'attachments':           [],
     'dates':                 ['2023-08-31'],
     'onderwerpen':           ['Bestrijdingsmiddelen'],
     'responsible':           'Ministerie van Landbouw, Natuur en Voedselkwaliteit'
    }

You are proably also interested in wetsuite.datasets.load()-ing the related dataset that has the text for the response documents (fetched by response_document_url)

Both could use some refinement.   TODO: clean up both the woo metadata and woo text datasets                       
''')


## Text
# somewhere later, we generate text from the PDFs via OCR, which is slow so let's also cache that 
woo_linked_docs_txt  = wetsuite.helpers.localdata.LocalKV(  'woo-besluiten_docs_txt.db',    key_type=str,value_type=str)      # url -> text
woo_linked_docs_txt._put_meta('description_short','A dataset focusing on the reaction documents to Woo requests; this is a plain text variant of the PDF text. ')
woo_linked_docs_txt._put_meta('description','''
This dataset tries to focus on the reactions to Woo requests.
                                                     
You will probably want the Woo metadata dataset first.
It will have `response_document_url` keys linking to a PDF document.

This is a map from that URL to the a plaintext string of the text that PDF contains.
                              
Notes:
- if we figure it was something other than a decision, it may not be present
- the extraction is currenty 'the text the PDF itself reports'
    
Both could use some refinement.   TODO: clean up both the woo metadata and woo text datasets
''')

In [4]:
# summarizing metadata into a dataset

def handle_woobesluit_detail_page(soup, url):
    ''' This function only fetches the detail page and fill metadata.
        We could also fetch and parse PDFs at the same time -- but we can also do that later.
    '''
    print('ITEM',url)

    ### Fetch that detail HTML page and parse various metadata out of that detail-page HTML.   
    try:   
        # cached fetch , we assume it's answered once and won't change over time -- TODO: check that's true, we could in theory adhere to HTTP cacheing rules better
        #detail_page_bytes = wetsuite.helpers.net.download( detail_page_absurl )
        detail_page_bytes, came_from_cache = wetsuite.helpers.localdata.cached_fetch( woo_detail_pages, url )
    except ValueError as ve:
        print('SKIP, error  %s  while fetching detail page  %r'%(ve, url))
        return
    
    soup = bs4.BeautifulSoup( detail_page_bytes )
    entry_metadata = {
        'detail_page_url':       url,
        'title':                 soup.find('h1').text.strip(),
        'response_document_url': None,                                    # the decision to the request, with reasoning
        'attachments':           [],   # - zero or more documents (each downloading on their own HTML page). See notes above on the variation of what is in here
        'dates':                 [],   
        'onderwerpen':           [],   # subject       (div.linkBlock)
        'responsible':           None, # who responded (div.belongsto).   not a list, only ever one
    }

    # fish out the dates,
    for mt in soup.select('p.article-meta, p.meta'):
        mtt = mt.text.strip()
        if 'pagina' in mtt:
            continue
        if '|' in mtt:
            mtd = mtt.split('|',1)[1].strip()
            try:
                entry_metadata['dates'].append( dateutil.parser.parse(mtd).strftime('%Y-%m-%d') ) # reformat ISO style 
            except:
                print("WARN: didn't understand %r as date"%mtd)
    # subject and answerer(?),
    for lb in soup.select('div.linkBlock a'):
        entry_metadata['onderwerpen'].append( lb.text.strip() )
    for bt in soup.select('div.belongsTo a'):
        entry_metadata['responsible'] = bt.text.strip()

    ## Many pages seem to follow this format
    #  main document, often the decision 
    alist = soup.select('div.article.content div.intro a')
    if len(alist) == 0:
        pass
        #print( "no div.intro a" )
    else: # assume it's len 1, CONSIDER: check
        besluit_a = alist[0]
        besluit_absurl = urllib.parse.urljoin( url, besluit_a.get('href') ) # urljoin in case they're relative (I think they're not)
        entry_metadata['response_document_url'] = besluit_absurl
    #  additional attachment documents (note: links to the respective download pages, not to the documents themselves)
    for attachment_li in soup.select('div.results ul.common li'):
        attachment_a = attachment_li.find('a')
        entry_metadata['attachments'].append( (urllib.parse.urljoin( url, attachment_a.get('href') ),
                                                attachment_a.find('h3').text.strip()) )
    ## ...unless the page is different.
    #   neither of the above blocks will have collected nothing (so no real need to make it conditional)
    #   and this probably will instad
    for adlc in soup.select('div.download a.download-chunk'):
        adlc_absurl = urllib.parse.urljoin( url, adlc.get('href') ) 
        name = adlc.find('h2').text.strip()

        if name.startswith("Download '"): # clean up link text like  "Download 'title'"  to  "title"
            name = name[10:].rstrip("'")

        #if name.lower().startswith('bijlage') or 'inventaris' in name.lower():
        #    pass

        # whitelist style to get complaints rather than silently ignoring
        if re.search('^(besluit|([12345]e )?deelbesluit|woo-besluit|aanvullend besluit|herstelbesluit|beslissing)', name.lower()) is not None:
            if entry_metadata.get('response_document_url', None) is None:
                entry_metadata['response_document_url'] = adlc_absurl
            else: # assume it's an attachment?
                entry_metadata['attachments'].append( (adlc_absurl, name) )

        else: # assume it's an attachment?
            entry_metadata['attachments'].append( (adlc_absurl, name) )

    ## complain  if we still didn't find any document links
    if entry_metadata['response_document_url'] is None and len(entry_metadata['attachments']) == 0:
        if '[ingetrokken]' in entry_metadata['title'].lower():
            print('NOTE: no document, seems fine because [ingetrokken], on %r'%url)
            # actually there may still be a link if retracted, see https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/07/05/besluit-op-woo-verzoek-over-documentatie-tussen-ministerie-van-ezk-en-energiebedrijven-rwe-en-uniper
        else:
            print("WARN: no document links (%s, %s) found on %r"%(
                entry_metadata['response_document_url'], 
                len(entry_metadata['attachments']),
                url,
                )) # there are just a few cases like this, that's probably fine

    # store the entry's metadata into a store
    woo_metadata.put( url, entry_metadata )
    

In [None]:
# currently ~5000 items, might take a few hours
wetsuite.datacollect.rijksoverheid_nl_documenten.scrape_pagination(
    doctype='Woo-besluit',
    from_date=datetime.date(year=2019, month=1, day=1), # we know this store currently contains nothing before then
    to_date  =datetime.date.today(), 
    detail_page_callback=handle_woobesluit_detail_page, 
    debug=True)

In [6]:
woo_metadata.summary(get_num_items=True)

{'size_bytes': 6496256,
 'size_readable': '6.2MiB',
 'num_items': 5116,
 'avgsize_bytes': 1270,
 'avgsize_readable': '1.2KiB'}

In [10]:
# what does that information look like at this stage?
random.sample( woo_metadata.items(), 2 )

[('https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/08/17/besluit-op-woo-verzoek-over-aanvragen-cites-in-of-uitvoervergunningen-voor-marokkaanse-stekelige-staarthagedis',
  {'detail_page_url': 'https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/08/17/besluit-op-woo-verzoek-over-aanvragen-cites-in-of-uitvoervergunningen-voor-marokkaanse-stekelige-staarthagedis',
   'title': 'Besluit op Woo-verzoek over aanvragen CITES in- of uitvoervergunningen voor Marokkaanse stekelige staarthagedis',
   'response_document_url': 'https://open.overheid.nl/documenten/2131fdec-390c-419c-acfe-c463adf7e990/file',
   'attachments': [],
   'dates': ['2023-08-17'],
   'onderwerpen': ['Herziening Woo-besluit over aanvragen CITES in- of uitvoervergunningen Marokkaanse stekelige staarthagedis',
    'Natuur en biodiversiteit'],
   'responsible': 'Ministerie van Landbouw, Natuur en Voedselkwaliteit'}),
 ('https://www.rijksoverheid.nl/documenten/woo-besluiten/2023/09/04/besluit-op-woo-verzoek-over

#### Documents and their text contents

The above stored a case's metadata, let's see about the documents it contains.

The response PDFs can be relatively large, and contain a lot of images of text.
Giving you those PDFs as-is would lead to a gigabytes-large dataset, that wouldn't compress well.

Interesting to some, certainly, yet let's assume for a moment that our research interest 
is just the arguments made about releasing the data.

This is a simpler start, in that that should be manageable to wrangle into plain text.
This should be relatively little work because, unlike the requested data, the decision text is likely in PDFs with extractable text.
(Even _if_ they are part of a larger franken-PDF, that part should be moderately easy to find).

In [None]:
#woo_linked_docs_txt.truncate() # wipe it whenever you change the extraction
# this should take roughly twenty(?) minutes for 5K items

# Go through those metadata items, and sift through the attachments, mostly PDFs, for what we need
for detail_page_url, entry_metadata in wetsuite.helpers.notebook.ProgressBar( woo_metadata.items(), description='cases' ):
    response_document_url = entry_metadata['response_document_url']

    # these are some known huge (500MByte to 1GByte) PDFs that we might as well not fetch or store
    if response_document_url in (
        'https://open.overheid.nl/documenten/3ab07a67-c8cd-4ef5-897c-a73cf9db36ae/file',
        'https://open.overheid.nl/documenten/c2299374-1f63-424a-aa16-28ece8f0cb43/file',
        'https://open.overheid.nl/documenten/6998c113-a776-4442-ba21-160285727e98/file',
        'https://open.overheid.nl/documenten/b02410c1-0286-4c72-81c0-d3081fc2954e/file',
        'https://open.overheid.nl/documenten/172a4167-5a6b-48b6-bb63-be4b0c284ad7/file',
    ):
        continue
 
    # when updating (you commented that truncate) we can do just the new documents quickly
    # (note: would still rediscover skippable problem cases each time)
    if response_document_url in woo_linked_docs_txt: 
        continue

    ### Check that the detail page actually did link to a response document  (arguably can/should be done earlier)        
    if response_document_url is None: # (no need here anymore?)
        print('SKIP CASE; detail page seems to have no response doc?   %s'%entry_metadata)
        # TODO: deal with the HTML variation for the cases that actually, there definitely is.
        continue

    # Fetch the document that we thought was the decision
    #   note this is sometimes dozens of megabytes - there are some 300-page documents in there, and one 970MByte, 2190-page PDF
    try:
        doc_bytes, came_from_cache = wetsuite.helpers.localdata.cached_fetch( woo_linked_docs, response_document_url, sleep_sec=5, timeout=60 )
    except ValueError as ve:
        print( 'SKIP CASE; response doc failed to fetch: %s for %r   on detail page %r'%(ve, response_document_url, detail_page_url ) )
        continue

    ### Check that it's a PDF
    if not doc_bytes.startswith(b'%PDF-'):
        print( "SKIP CASE; not sure what kind of response document %r is, first bytes are %r, on detail page %r"%(
            response_document_url, doc_bytes[:25], detail_page_url) )
        continue
    
    # Okay, we have a document and it's a PDF.
    # Extract page text as reported by the PDF itself   
    #   (no OCR necessary for **most** of the written responses - other contents are a mess in and of itself, that we luckily chose to not address yet)
    pages_text = list( wetsuite.extras.pdf.page_text( doc_bytes ) ) # (explicit list() because generator)

    # The smallest of cleanup:  try to remove the page number lines on each page  (which should only appear in the footer - not that we're testing for that)
    pages_temp = []
    for page_text in pages_text:
        ptemp = page_text
        ptemp = re.sub( r'\n[Pp]agina(?:nummer)?\s+[0-9]+(?:\s+van\s[0-9]+)?', ' ', page_text ) 
        #ptemp = re.sub( r'\n\s*[1-9][0-9]?}[\s\n]*\Z', ' ', ptemp, flags=re.M ) # seems a little too fuzzy without also using location
        pages_temp.append( ptemp )
    pages_text = pages_temp

    all_text = '\n'.join( pages_text )

    woo_linked_docs_txt.put(response_document_url, all_text)

In [11]:
for url, text in woo_linked_docs_txt.random_sample(2):
    print( url )
    print( repr(text)[:200] )
    print( )    

https://open.overheid.nl/documenten/ec4703fb-26cb-462e-be65-107fd9d0049f/file
'> Retouradres Postbus 90801 2509 LV Den Haag\nDirectie Wetgeving,\nBestuurlijke en Juridische\nAangelegenheden\nPostbus 90801\n2509 LV Den Haag\nParnassusplein 5\nT 070 333 44 44\nwww.rijksoverheid.n

https://open.overheid.nl/documenten/f6bc1a3d-3df4-4f83-a18c-5ab81b93ef0a/file
'\n \n \nDirectie Openbaarmaking \n \nDatum \n11 november 2024 \n \nOnze referentie \n \nBesluit \nDe gevraagde documenten zijn reeds (gedeeltelijk) openbaar. Op reeds openbare \ninformatie is de Woo 



In [12]:
# Count is roughly the number of cases:
woo_metadata.summary(get_num_items=True)

{'size_bytes': 6496256,
 'size_readable': '6.2MiB',
 'num_items': 5116,
 'avgsize_bytes': 1270,
 'avgsize_readable': '1.2KiB'}

In [13]:
# Count is roughly the decision doc texts they link to - that we got:
woo_linked_docs_txt.summary(get_num_items=True)

{'size_bytes': 309526528,
 'size_readable': '295MiB',
 'num_items': 4899,
 'avgsize_bytes': 63182,
 'avgsize_readable': '62KiB'}