<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-datacollect/blob/main/koop_frbr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Purpose of this notebook

Fetch and use the data provided at https://repository.overheid.nl/frbr/

**WARNING: WE WOULD RECOMMEND AGAINST RUNNING THIS YOURSELF, OR MAYBE AT ALL**
- know its limitations,
- know thiat it is what you want and nothing else,
- Know just how many files it is, and that it's more than most people will be interested in dealing with.
- Know that your ISP may block you -- mine did. (presumably suspecting I had been infected by a botnet or such)

<!-- -->

Since you couldn't really fetch much more selectively than 'for just this year',
you may really better off doing much selective searches in some systems that point into the same data.
As of this writing, [POD](koop_pod.ipynb) seems to work.

For context, and an altertnative: it seems that much the same contents are provided via KOOP's Bulk Uitlever Systeem (BUS),
an anonymous SFTP service at `bestanden.officielebekendmakingen.nl`.

It's a metric ton of files and data, and not quite geared to keeping up to date, but still a better option than this.

See also [koop_bulkuitleversysteem](koop_bulkuitleversysteem.ipynb) (TODO: finish and upload).
(It may be harder to automate, though, as 'anonymous SFTP' seems a creative configuration
and not a thing most SSH libraries understand.)


### Notes on the contents


#### File-folder nature

This repository looks a lot like a frontend over oldschoold file-and-folder data (particularly when you look at the contents of things like `cga`).

Note that in many repositories, the top level has a lot of pages, and levels under it do not.

The code below suggests that is always true - it is not (note e.g. that in officielepublikaties, there are more pages at the second level than the first. This also breaks the progress bars).
The distinction is useful to make to be able to dive into the details, and avoid .
...which makes more sense when we don't cache the pagination


##### Areas 
Distinct areas of documents include:
- `lokalebekendmakingen`
- `officielepublicaties`

- `vd` - Verdragenbank
- `sgd` - staten generall digitaal

- `datacollecties` - (varied documents?)
- `samenwerkendecatalogi`
- `tuchtrecht`

- `cga` - ?
- `cvdr` (empty)

Perhaps most interesting are `localebekendmakingen` and `officielepublikaties`




URLs look something like:
    https://repository.overheid.nl/frbr/sgd/1840/0000448441/

which are mentioned to be:

    https://repository.overheid.nl/frbr/[area]/[subarea]/[work]/

....but a lot of parts seem to play relatively loose with what `[work]` means or groups, or even what `[subarea]` is, e.g. 
* https://repository.overheid.nl/frbr/officielepublicaties/ah-tk/19961997/ah-tk-19961997-100/1/html
  * seems to be `officielepublicaties/[subarea]/[year]/[identifier]/1/[filetype]`

* https://repository.overheid.nl/frbr/lokalebekendmakingen/0001e23e9094c1a01b15624d2a877073/1/html
  * seems to be `lokalebekendmakingen/[identifier]/1/[filetype]/`

* `cga/[folder]/1/[filetype]/`



#### Other notes

There are other systems / searches that point to resources stored here

(TODO: mention which )

## Fetching the data

Such a browse-only interface does not allow search, 
and even complete fetches are a relatively manual task.

In [2]:
import time
import pprint
import collections
import random
from urllib.parse import urljoin

import requests
import bs4

import wetsuite.helpers.localdata
import wetsuite.helpers.net
import wetsuite.helpers.koop_parse
import wetsuite.helpers.notebook
import wetsuite.datacollect.koop_repositories

### stores

In [5]:
frbr_fetched    = wetsuite.helpers.localdata.LocalKV('frbr_fetched.db',     key_type=str, value_type=bytes )
temp_fol        = wetsuite.helpers.localdata.LocalKV('frbr_fetched_fol.db', key_type=str, value_type=bytes )

# We can continue a scrape within a session by cache the intermediate pages, to avoid a lot of fetches.
# Probably don't store this long-term because it is volatile state (it's already arguably to do this)
temp_page_cache = wetsuite.helpers.localdata.LocalKV(':memory:', key_type=str, value_type=bytes )

In [4]:
## Summarize what we have fetched already
# could get amount of items, size per item by handing in True, but
display( frbr_fetched.summary() )

# summary per area; what we summarize in an area varies, 
# and is roughly why some conditions are necessary to make this part more useful:
types = collections.defaultdict( lambda: collections.defaultdict(int) )
count = 0
for url in frbr_fetched.keys():
    count += 1
    urlparts = url.split('/')
    area, subarea = urlparts[4:6]
    #print(area, subarea, l)
    if area in ('officielepublicaties',):
        types[area][subarea] +=1
    elif area in ('lokalebekendmakingen',): # e.g. fishes 'gmb' out of URL like https://repository.overheid.nl/frbr/lokalebekendmakingen/000005650c5a5a581c37ce17fcfc5207/1/html/gmb-2022-314336.html
        types[area][ urlparts[-1].split('-')[0] ] +=1
    elif area in ('tuchtrecht','sgd',):
        types[area][ subarea ] +=1 # actually year
    elif area in ('vd',):
        types[area][ urlparts[7] ] +=1 
    elif area in ('samenwerkendecatalogi',):
        types[area][ urlparts[7] ] +=1 
    elif area in ('cga',):
        types[area][ urlparts[5] ] +=1 
    elif area in ('datacollecties',):
        types[area][ subarea ] +=1 
    else:
        print(area, subarea, urlparts)

pprint.pprint( { k:dict(v) for k,v in types.items() } )  # the syntax-fu turns defaultdict nesting to just dicts, for slightly cleaner printing
print(f"{count} items in total")

{'size_bytes': 288593489920, 'size_readable': '269GiB'}

{'lokalebekendmakingen': {'bgr': 919,
                          'gmb': 308505,
                          'metadata.xml': 333773,
                          'prb': 8673,
                          'stcrt': 238,
                          'wsb': 8557},
 'officielepublicaties': {'ag': 204,
                          'ag-ek': 3896,
                          'ag-tk': 4596,
                          'ag-vv': 120,
                          'ah': 333,
                          'ah-ek': 1748,
                          'ah-tk': 334152,
                          'bgr': 40055,
                          'blg': 250370,
                          'gmb': 2849061,
                          'h': 20,
                          'h-ek': 37926,
                          'h-tk': 88110,
                          'h-vv': 208,
                          'kst': 563572,
                          'kv': 93974,
                          'kv-ek': 460,
                          'kv-tk': 153632,
                          'nds

### helpers

In [4]:
def simpler_with_progressbar(seed_page_url, estimated_pages=0, progress_prefix='', verbose=0):
    ''' Use FRBRFetcher to recursively fetch a set of folders in one go, with a progress bar '''
    fetcher = wetsuite.datacollect.koop_repositories.FRBRFetcher( frbr_fetched, temp_fol, verbose=verbose, waittime_sec=1.0 )
    # CONSIDER: using estimated_pages for 'add a bunch of page seeds'
    fetcher.add_page( seed_page_url )
    pb = wetsuite.helpers.notebook.progress_bar(estimated_pages)
    for _ in fetcher.work():
        pb.value = fetcher.count_pages
        pb.description = '%s fetches:%d / cached:%d  -  items:%d  folders:%d (skipped: %d)  pages:%d '%(
            progress_prefix,
            fetcher.count_fetches, fetcher.count_cacheds,    fetcher.count_items, fetcher.count_folders,fetcher.count_skipped, fetcher.count_pages )

## Fetch cga

Not entirely sure what this is intended to store, 
but it's tiny and so a good test of the fetching code.

In [None]:
simpler_with_progressbar( 'https://repository.overheid.nl/frbr/cga?start=1', 6 )

## Fetch lokalebekendmakingen


The [localebekendmakingen](https://repository.overheid.nl/frbr/lokalebekendmakingen) section
contains mostly `gmb`, `prb`, and `wsb` items (gemeente, provincie, waterschap).

These items are probably also linked from elsewhere (e.g. previously KOOP SRU searches - see a different notebook).

Browsing them in here presents mostly many pages of identifiers, e.g.

    0000900e8da1023017290084212dad28/
        1/
            html/
                gmb-2023-521695.html
            metadata/
                metadata.xml
            (...each folder-like things at this level has one file)
        (...you can download a zip with the contents at this level)

HOWEVER, be aware this is on the order of 100000 pages, and the amount of fetches will be a few multiples more.
Don't run this unless you really know you want to.

In [None]:
fetcher = wetsuite.datacollect.koop_repositories.FRBRFetcher( frbr_fetched, temp_fol, verbose=0 )

#fetcher.add_page( 'https://repository.overheid.nl/frbr/lokalebekendmakingen?start=1' )
for i in range(2000):
    fetcher.add_page( 'https://repository.overheid.nl/frbr/lokalebekendmakingen?start=%d'%random.randint(100,108000) )

# the progress bar is one way to get constant feedback, while avoding a load of output in a cell
pb = wetsuite.helpers.notebook.progress_bar( 110000 )
for _ in fetcher.work():
    pb.value = fetcher.count_pages
    pb.description = 'fetches:%d / cached:%d  -  items:%d  folders:%d (skipped: %d)  pages:%d '%(
        fetcher.count_fetches, fetcher.count_cacheds,    fetcher.count_items, fetcher.count_folders,fetcher.count_skipped, fetcher.count_pages )

## Fetch officielepublicaties


The [officielepublicaties](https://repository.overheid.nl/frbr/officielepublicaties) section is split up into subareas.

The subareas are a mix.

It may help us read the list to split off the non-parliament ones...
 - `gmb`   (Gemeenteblad,     approx 300K per year)
 - `stcrt` (Staatscourant,    approx 70K per year)
 - `wsb`   (Waterschapsblad,  approx 13K per year)
 - `prb`   (Provinciaal blad, approx 8000 per year)
 - `bgr`   (Blad gemeenschappelijke regeling, approx 1000 per year)
 - `stb`   (Staatsblad,       approx 500 per year)
 - `trb`   (Tractatenblad,    approx 200 per year)
 
...from the parliament ones:
 - `ag`    (agenda)
 - `ag-ek` (...agenda eerste kamer)
 - `ag-tk` (...agenda tweede kamer)
 - `ag-vv` (...agenda verenigde vergadering)
 - `ah`    ('Aanhangsel van de Handelingen')
 - `ah-ek`
 - `ah-tk`
 - `blg`   ('Bijlage')
 - `h`     ('Handelingen')
 - `h-ek`
 - `h-tk`
 - `h-vv`
 - `kv`    (Kamervragen (zonder antwoord))
 - `kv-ek`
 - `kv-tk`
 - `kst`   (kamerstuk)
 - `nds`   (niet-dossierstuk)
 - `nds-tk`


Each of these areas has nesting, that seems ad-hoc though typically looks like:

    ah-tk/                       (subarea)
        19941995/                (in most but not all subareas of officielepublicaties, this layer is the year; for dossier it seems the vergaderjaar)
            ah-tk-19941995-100/  (identifier)
                1/
                    html/
                        ...each
                    metadata/
                        ...with
                    metadataowms/
                        ...one
                    pdf/
                        ...file
                    xml/
                        ...in it,
                    odt/
                        ...which ones
                    coordinaten/
                        ...varies
                ...though you can download a zip of all such files at this level


Keep in mind that the available formats varies per case, and there are strong patterns per set.
For example, it seems that `blg`, aside from metadata and metadataowms, only has pdf? (VERIFY)

In [None]:
#for subarea in ('ag', 'ag-ek', 'ag-tk', 'ag-vv',):  # agenda
#    simpler_with_progressbar( 'https://repository.overheid.nl/frbr/officielepublicaties/%s?start=1'%subarea, verbose=0 )

In [None]:
#for subarea in ('ah', 'ah-ek', 'ah-tk',):
#    simpler_with_progressbar( 'https://repository.overheid.nl/frbr/officielepublicaties/%s?start=1'%subarea )

In [None]:
#for subarea in ('bgr',):
#    simpler_with_progressbar( 'https://repository.overheid.nl/frbr/officielepublicaties/%s?start=1'%subarea )

In [None]:
for subarea in ('blg',):
    simpler_with_progressbar( 'https://repository.overheid.nl/frbr/officielepublicaties/%s?start=1'%subarea )

In [None]:
for subarea in ('gmb',):  # fairly large. Note you can do per-year by handing in something like gmb/2024
    simpler_with_progressbar( 'https://repository.overheid.nl/frbr/officielepublicaties/%s?start=1'%subarea )

In [None]:
#for subarea in ('trb',):
#    simpler_with_progressbar( 'https://repository.overheid.nl/frbr/officielepublicaties/%s?start=1'%subarea )   

In [None]:
for subarea in ('wsb', 'prb',):
    simpler_with_progressbar( 'https://repository.overheid.nl/frbr/officielepublicaties/%s?start=1'%subarea )

In [None]:
for subarea in ('stb',):
    simpler_with_progressbar( 'https://repository.overheid.nl/frbr/officielepublicaties/%s?start=1'%subarea )

In [None]:
for subarea in ('stcrt',): # fairly large
    simpler_with_progressbar( 'https://repository.overheid.nl/frbr/officielepublicaties/%s?start=1'%subarea )

In [None]:
for subarea in ('kst',):
    simpler_with_progressbar( 'https://repository.overheid.nl/frbr/officielepublicaties/%s?start=1'%subarea )

In [None]:
for subarea in ('kv', 'kv-ek', 'kv-tk',):
    simpler_with_progressbar( 'https://repository.overheid.nl/frbr/officielepublicaties/%s?start=1'%subarea )

In [None]:
for subarea in ('nds', 'nds-tk',):
    simpler_with_progressbar( 'https://repository.overheid.nl/frbr/officielepublicaties/%s?start=1'%subarea, verbose=0 )

## Fetch sgd

In [None]:
fetcher = wetsuite.datacollect.koop_repositories.FRBRFetcher( frbr_fetched, temp_fol, verbose=0 )

fetcher.add_page( 'https://repository.overheid.nl/frbr/sgd?start=1')

pb = wetsuite.helpers.notebook.progress_bar(22)
for chunk in fetcher.work():
    pb.value = fetcher.count_pages
    pb.description = 'fetches:%d / cached:%d  -  items:%d  folders:%d (skipped: %d)  pages:%d '%(
        fetcher.count_fetches, fetcher.count_cacheds,    fetcher.count_items, fetcher.count_folders,fetcher.count_skipped, fetcher.count_pages )

## Fetch verdragenbank

In [None]:
fetcher = wetsuite.datacollect.koop_repositories.FRBRFetcher( frbr_fetched, temp_fol, verbose=0 )

fetcher.add_page( 'https://repository.overheid.nl/frbr/vd?start=1' )

pb = wetsuite.helpers.notebook.progress_bar(858)
for chunk in fetcher.work():
    pb.value = fetcher.count_pages
    pb.description = 'fetches:%d / cached:%d  -  items:%d  folders:%d (skipped: %d)  pages:%d '%(
        fetcher.count_fetches, fetcher.count_cacheds,    fetcher.count_items, fetcher.count_folders,fetcher.count_skipped, fetcher.count_pages )

## Fetch samenwerkendecatalogi

In [None]:
fetcher = wetsuite.datacollect.koop_repositories.FRBRFetcher( frbr_fetched, temp_fol, verbose=0 )

fetcher.add_page( 'https://repository.overheid.nl/frbr/samenwerkendecatalogi?start=1' )

pb = wetsuite.helpers.notebook.progress_bar(54111)
for chunk in fetcher.work():
    pb.value = fetcher.count_pages
    pb.description = 'fetches:%d / cached:%d  -  items:%d  folders:%d (skipped: %d)  pages:%d '%(
        fetcher.count_fetches, fetcher.count_cacheds,    fetcher.count_items, fetcher.count_folders,fetcher.count_skipped, fetcher.count_pages )