### Purpose of this notebook

Fetch data from internetconsultatie.nl and make it easier to consume as text.

Notes:
 * there are a few huge cases
 * the result of internetconsultaties sometimes get added into kamervragen
 * the responses may be anonymized in the metadata, but things like comanies using signed PDFs with letterheads, or people just adding names to a text field, means that is not absolute

In [1]:
import re, time, datetime, collections, urllib.parse, random, pprint

import bs4

import wetsuite.datasets
from wetsuite.helpers import net, localdata, notebook
from wetsuite.extras import pdf#, ocr

In [2]:
# landing page for each case (/CASENAME), value is HTML page
case_landingdetail_store    = localdata.LocalKV('internetconsultaties_fetched.db',        key_type=str, value_type=bytes )    

# pagination (/CASENAME/reacties, /CASENAME/reacties/datum/PAGE), reaction page (/CASENAME/reactie/UUID),   value for both is HTML page
reacties_store              = localdata.LocalKV('internetconsultaties_reacties.db' ,      key_type=str, value_type=bytes )    

# linked documents (/CASENAME/document/NUMBER),  value is bytestring (all PDFs so far)
reacties_store_pdf          = localdata.LocalKV('internetconsultaties_reacties_pdf.db' ,  key_type=str, value_type=bytes )    

### Fetch list of cases (no reactions yet), via the pagination

We start at a URL like https://www.internetconsultatie.nl/geslotenconsultaties/1/10

We fetch 
- all pagination pages like it, based on the page link,
- each case's basic landing page - look like https://www.internetconsultatie.nl/klimaatmaatregelenfinancielesector

More notes after we're done with that.

---

This should take at most five or ten minutes when it comes mostly from your own cache,
and a bunch longer if all contents still need to be fetched (i.e. the first time).

We can't easily show a progress bar because we fetch everything concurrently
so only discover how much there along the way.
...though we _could_ extract a very good esimate from   `<p class="active">Er zijn in totaal 2566 consultaties`

In [3]:
# Control how chunked the pagination fetches are.  Options on the page are 10, 25, 100.
#    Arbitrary numbers seem to work, but this site is slow even with small numbers, and slower with larger numbers
FETCH_PER_PAGE = 100

# result pages URLs we have fetched, and have still to fetch, in this particular crawl
pagination_pages_to_fetch  = set()  # our TODO list
pagination_pages_fetched   = set()  # URLs of pagination pages we have fetched and handled, and shouldn't add to fetch again
case_landing_pages_fetched = set()  # URLs of case landing pages pages we have fetched

pagination_pages_to_fetch.add( f'https://www.internetconsultatie.nl/geslotenconsultaties/1/{FETCH_PER_PAGE}') # seed with the first page

while len(pagination_pages_to_fetch) > 0: # we keep adding pages on the way, and will eventually exchaust them
    fetching_page_url = pagination_pages_to_fetch.pop()
    print( f' ========== PAGE: {fetching_page_url} ============ ')

    # pagination pages are not cached, the pages will change with each new case
    pagebytes = net.download( fetching_page_url, timeout=30 ) # needs a moderately high timeout
    pagination_pages_fetched.add( fetching_page_url )


    # parse the HTML so we can find the cases and further pagination
    soup = bs4.BeautifulSoup(pagebytes, features='lxml')

    ### find all page links to closed cases, add them to the "to fetch" if we didn't before
    for page_a in soup.find_all('a', attrs={'href':re.compile(r'/geslotenconsultaties/([0-9]+)/([0-9]+)')}):
        abs_linked_page_url = urllib.parse.urljoin( fetching_page_url, page_a.get('href') )  # resolve href relative to the page it's on
        # don't add to fetch again if did already:
        if abs_linked_page_url not in pagination_pages_fetched:
            pagination_pages_to_fetch.add( abs_linked_page_url )

    ### extract all all case links, fetch the HTML page they point to
    for case_li in soup.select("div[class*='result--list'] > ul > li"): # CSS-style selector is succinct here
        case_a = case_li.find('a', attrs={'class':re.compile(r'\bresult--title\b')})
        abs_caselanding_url = urllib.parse.urljoin( fetching_page_url, case_a.get('href') )
        print( '      ', abs_caselanding_url )
        
        _, fromstore = localdata.cached_fetch( case_landingdetail_store, abs_caselanding_url ) # fetch detail page into store
        case_landing_pages_fetched.add( abs_caselanding_url )
        if not fromstore: # if we fetched it?
            time.sleep(1) # be somewhat nice to the server

       https://www.internetconsultatie.nl/slimmerstraffen/b1
       https://www.internetconsultatie.nl/antislapp/b1
       https://www.internetconsultatie.nl/terugwerkendekracht/b1
       https://www.internetconsultatie.nl/natuurinclusiefisoleren/b1
       https://www.internetconsultatie.nl/besluit_vrijstelling_rijbewijs_c/b1
       https://www.internetconsultatie.nl/paspoortuitvoeringsregelingenjanuari2025/b1
       https://www.internetconsultatie.nl/alcoholwet/b1
       https://www.internetconsultatie.nl/verzamelwetklimaatengroenegroei/b1
       https://www.internetconsultatie.nl/reparatiekadasterwet/b1
       https://www.internetconsultatie.nl/eerste_wijzigingsregeling_ienw_omgevingsregeling/b1
       https://www.internetconsultatie.nl/rbk2022_zwelklei_schuimglas/b1
       https://www.internetconsultatie.nl/anvs_vergunningenbeleid_2025/b1
       https://www.internetconsultatie.nl/pgb/b1
       https://www.internetconsultatie.nl/vezelteelten/b1
       https://www.internetconsultatie.

In [4]:
len(case_landing_pages_fetched)

2847

### Fetch details of each case

For each each case's basic detail page, go fetch actual responses

#### notes: each reaction

While we get to this only later, it helps to already know that each reaction is on its own page, linked like:
- `/CASENAME/reactie/UUID`
Each reaction can
- have its actual text as plain text on that HTML page,         e.g.  https://www.internetconsultatie.nl/vuurwerkverbod/reactie/397bc781-c36e-4aa4-9cc1-0813643828ad
- have its actual text be in a PDF linked from that HTML page,  e.g.  https://www.internetconsultatie.nl/vuurwerkverbod/reactie/3fdd34b2-8a29-48eb-a7eb-e40506fce114

#### notes: case landing-detail-page

You get into a case (what we here call landing/detail page) like: https://www.internetconsultatie.nl/vuurwerkverbod/b1
- TODO: check what the /b1 is; it's not part of the identifier, in that leaving it off redirects to add that /b1, and other links (e.g. /CASENAME/reacties) omit it
- TODO: check variation of landing page detail - it seems different templates were used over time, and the sections and table are not highly controlled to start with

It contains some potentially interesting information, such as
- topic labels (span with `label--large` class)
- introduction text
- start date, end date
- organisation
- type (e.g. AMvB, Wet)
- topics (which are also searchable)
- relevant documents (PDFs) (`/CASENAME/document/NUMBER`)
- further documents (PDFs) (`/CASENAME/document/NUMBER`)
- various other things (e.g. WGK number)


There may be
- up to three reactions listed there, with such a UUID-URL.
- a button "Bekijk alle reacties" (a styled `<a>` with `id="mainContentPlaceHolder_alleReactiesHyperLink"`), to a URL like `https://www.internetconsultatie.nl/vuurwerkverbod/reacties`
- neither, e.g. https://www.internetconsultatie.nl/energylabelling

Because the button isn't always there, whether there are reacties or not (VERIFY), we should probably
- collect reactie UUID links from both the page and the /reacties
- always try `/reacties` (without the `/b1` if present), even if it's not linked within the content block on the right (it is mentioned at the left, 'Reacties op consultatie')

#### notes: /reacties and its pagination

A `/reacties` page like `https://www.internetconsultatie.nl/vuurwerkverbod/reacties` is a list, 
each of which is a reaction on a single page, linked via a UUID, like `/vuurwerkverbod/reactie/5b1da716-e920-4cc3-81e4-0695725c3b04`.

<!-- -->

(Only) if there are more than 10 reactions, this has pagination links.
The pagination links look like either `/CASENAME/reacties/datum/PAGENUM` or `/CASENAME/reacties/naam/PAGENUM`;
the default sorting is by date so we can look for only `/CASENAME/reacties/datum/PAGENUM`.

<!-- -->

Note that we can probably cache the reactie pagination,
if we don't mind baking in the probably-fair assumption that closed cases will not change anymore.

#### Fetching all reacties (by going through the pagination)

...or, turning those observations into code.

In [None]:
# Note that while in theory we could remember what pages we got from what page, 
# all reactions and documents can be traced back to a case via the name in its URL.

fetchtype_counter = collections.defaultdict(int)
fromcache_counter = collections.defaultdict(int)

# To fish out the (relative) URLs on the page  (these match within href attributes)
#_RE_REACTIES_LIST             = re.compile( r'.*/reacties$' )                                             # all-reactions link, e.g. "Bekijk alle reacties" button
_RE_REACTIES_PAGINATION_DATUM = re.compile( r'.*/reacties/datum/[0-9]+$' )                # "page X when sorting by date" (which is the default; the other is name)
_RE_REACTIE_UUID              = re.compile( r'.*/reactie/[0-9A-Fa-f]{8}-[0-9A-Fa-f]{4}-[0-9A-Fa-f]{4}-[0-9A-Fa-f]{4}-[0-9A-Fa-f]{12}$' ) # link to specific reactie


verbose = 0

def fetch_case(case_landingpage_url):
    """ Starting at the case's basic detail page, fetch all the parts.
        (this is often little work, but sometimes a lot)
    """
    global pb

    if verbose:
        print( f'CASE: {case_landingpage_url}' )
    case_landingpage_bytedata, from_cache = localdata.cached_fetch( case_landingdetail_store, case_landingpage_url ) # should all be cached due to the above; could also be a .get
    fromcache_counter[ from_cache ] += 1
    case_landingpage_soup                 = bs4.BeautifulSoup(case_landingpage_bytedata, features='lxml')

    reactiepages_to_fetch = set() # set of URLS we saw but haven't stored;  we also test against the store that stores things we have
    reactiepages_fetched  = set() # a list of 

    # The detail page may link to some reactions
    # - in other cases the pagination has the complete list and this will be redundant with that
    # - in some cases this is the complete list (if it's a few) and there is no separate pagination of reactions.
    # To be sure, we add from both sources.
    for reactie_a in case_landingpage_soup.find_all('a', attrs={'href':_RE_REACTIE_UUID}):
        reactie_abs_url = urllib.parse.urljoin( case_landingpage_url, reactie_a.get('href'))
        #print('   CASEPAGE REACTION', reactie_abs_url)
        _, from_cache = localdata.cached_fetch( reacties_store, reactie_abs_url )
        fromcache_counter[from_cache] += 1


    # If there is a link (button at the bottom, or item in the menu on the left) 
    #   that seems to view/paginate the reactions,
    #   so in that case go for that (if not, assume that the above reactions were all (?)
    #for reactie_pages_a in case_page_soup.find_all('a', attrs={'href':_RE_REACTIES_LIST}):
    #    reactie_page_abshref = urllib.parse.urljoin( case_page_url, reactie_pages_a.get('href'))
    #    if reactie_page_abshref not in reactiepages_to_fetch:
    #        reactiepages_to_fetch.add( reactie_page_abshref )
    # CHANGED TO just unconditionally looking there
    reactiepages_to_fetch.add( 'https://www.internetconsultatie.nl/%s/reacties'%case_landingpage_url.split('/')[3] )

    for reactie_pages_a in case_landingpage_soup.find_all('a', attrs={'href':_RE_REACTIES_PAGINATION_DATUM}):
        reactie_page_abshref = urllib.parse.urljoin( case_landingpage_url, reactie_pages_a.get('href'))
        if reactie_page_abshref not in reactiepages_to_fetch:
            reactiepages_to_fetch.add( reactie_page_abshref )

    # this "do pagination until we're done" logic is similar to the one we did to discover the case list
    while len(reactiepages_to_fetch)>0: # while there is reactie paging we haven't visited yet
        reactiepage_url = reactiepages_to_fetch.pop()
        if verbose:
            print( f"   REACTIE PAGING: {reactiepage_url}")
        reactiepaging_bytedata, from_cache = localdata.cached_fetch( reacties_store, reactiepage_url )
        fetchtype_counter[ 'paging'   ] += 1
        fromcache_counter[ from_cache ] += 1

        reactiepages_fetched.add( reactiepage_url )

        # Process the pagination page
        reactiepaging_soup = bs4.BeautifulSoup(reactiepaging_bytedata, features='lxml')

        # - add previously unseen pagination page links
        for reactie_pages_a in reactiepaging_soup.find_all('a', attrs={'href':_RE_REACTIES_PAGINATION_DATUM}):
            reactie_page_abshref = urllib.parse.urljoin( case_landingpage_url, reactie_pages_a.get('href'))
            if reactie_page_abshref not in reactiepages_fetched:#  and   reactie_page_abshref not in reactiepage_store:
                reactiepages_to_fetch.add( reactie_page_abshref )

        # - find links to actual reactie details, and fetch those
        for reactie_a in reactiepaging_soup.find_all('a', attrs={'href':_RE_REACTIE_UUID}):
            reactie_abs_url = urllib.parse.urljoin( case_landingpage_url, reactie_a.get('href'))

            reactie_bytes, from_cache = localdata.cached_fetch( reacties_store, reactie_abs_url ) # also do above
            fetchtype_counter[ 'reactie'  ] += 1
            fromcache_counter[ from_cache ] += 1

            # if that reactie page mentions a document (probably PDF) attachment, fetch that too (because it, rather than the page, will include the actual response text)
            reactie_soup     = bs4.BeautifulSoup( reactie_bytes, features='lxml' )
            for pdf_a in reactie_soup.select( "#content ul[class*='result--actions'] a[class*='icon--download']" ):
                pdf_abs_url = urllib.parse.urljoin( reactie_abs_url, pdf_a.get('href'))
                _, from_cache = localdata.cached_fetch( reacties_store_pdf, pdf_abs_url )
                fetchtype_counter[ 'document' ] += 1
                fromcache_counter[ from_cache ] += 1

            # update bar after every reactie
            cntstr = ' '.join("%s:%s"%(k,v) for k,v in fetchtype_counter.items())
            pb.description = f'cached:{fromcache_counter[True]}/fetched:{fromcache_counter[False]} -- {cntstr} -- cases:' 
    pb.value +=1


pb = notebook.progress_bar( len(case_landingdetail_store) ) # (assumes that is indeed the only content of that store)

for case_landingpage_url in sorted( case_landingdetail_store.keys() ):
    try:
        fetch_case( case_landingpage_url )
    except Exception as e:
        print(f'ERROR {str(e)} for {repr(case_landingpage_url)}')
        raise

### Process into something more useful

We can tell what it is by the URL, 
which means that we didn't really need to save into separate stores,
though it does make the following code slightly more _readable_.

In [26]:
def url_to_idname( case_url ):
    ' Turn a case URL into a more digestible identifier, still likely unique, and hopefully filesystem-safe '
    t = case_url.split('/')[3] #  (unique is fairly easy, as the URL path starts as unique, basically by definition)
    # there are some percent-escaped characters and doublequote. Rather than deal with their encoding,
    # just treat the numbers as part of the identifier (probably helps it stay filesystem-safe too)
    t = t.replace('%','')
    return t

In [27]:
cases = collections.defaultdict( dict ) # one dict to collect all information

In [None]:
# This should take a minute, to fill the cases dict with the landing page's information. 
# It adds some as-of-yet empty elements, that the section after this one will fill

def parse_basic_details( case_url ):
    case_idname = url_to_idname(case_url)

    html_bytes = case_landingdetail_store.get( case_url )
    if b'Wegens technische problemen is de website tijdelijk niet beschikbaar' in html_bytes: # single case - and why didn't that 404 out while fetching?
        print("SKIP %r"%case_url)
        return

    if '/reacties/datum' in case_url:
        print("SKIP, pagination handed into parse_basic_details (assuming a problem in past fetching)")
    elif case_url.endswith('/reacties'):
        cases[case_idname]['reacties_pages_url'] = case_url
    else:
        cases[case_idname]['entry_page_url'] = case_url
        cases[case_idname]['details'] = {}

        soup = bs4.BeautifulSoup( html_bytes )
        content = soup.find('div', attrs={'id':'content'})

        try:
            cases[case_idname]['details']['title'] = content.find('h1').text.strip()
        except:
            print('NOTITLE', case_url)
            #print( content)

        cases[case_idname]['details']['labels'] = list( label_large.text.strip()  for label_large in content.find_all('span', attrs={'class':re.compile(r'\blabel--large\b')}) if len(label_large.text.strip()) > 0 )

        cases[case_idname]['details']['overview'] = {}
        overview_table = content.find('table', attrs={'class':re.compile(r'\btable__data-overview\b')}) 
        if overview_table is not None:
            for tr in overview_table.find_all('tr'):
                # it seems like all interesting rows are a th and td (and there are some other things without a th)
                th = tr.find('th')
                td = tr.find('td')
                if th is not None:
                    cases[case_idname]['details']['overview'][ th.text.strip() ] = td.text.strip()

        cases[case_idname]['details']['paragraphs'] = {}
        for h2 in content.find_all('h2'): # see if there's a a h2 with a paragraph after it (there must be a cleaner way to do this)
            sib = list(h2.next_siblings)
            if isinstance( sib[0], str):
                sib.pop(0)
            if sib[0].name == 'p':
                text = ('\n'.join( sib[0].find_all(string=True) ) ).strip()
                cases[case_idname]['details']['paragraphs'][h2.text.strip()] = text # TODO: put back in

        cases[case_idname]['details']['documenten'] = []
        # the in-document structure is a little weird (double links for many things)
        for dr in content.find_all('div', attrs={'class':re.compile(r'\bresult--list\b')}):
            a = dr.find('a', attrs={'href':re.compile(r'\b/document/\b')}) # first only
            if a is None: # n
                pass
                #print(dr)
            else:
                abs_href = urllib.parse.urljoin( case_url, a.get('href'))
                cases[case_idname]['details']['documenten'].append( (a.text.strip(), abs_href) )

        cases[case_idname]['reacties'] = []



for case_url in notebook.ProgressBar(case_landingdetail_store,   description="Parsing basic case details: "):
    parse_basic_details( case_url )

Parsing basic case details:   0%|          | 0/2771 [00:00<?, ?it/s]

SKIP, pagination handed into parse_basic_details (assuming a problem in past fetching)
SKIP, pagination handed into parse_basic_details (assuming a problem in past fetching)
SKIP, pagination handed into parse_basic_details (assuming a problem in past fetching)
SKIP, pagination handed into parse_basic_details (assuming a problem in past fetching)
SKIP, pagination handed into parse_basic_details (assuming a problem in past fetching)
SKIP, pagination handed into parse_basic_details (assuming a problem in past fetching)
SKIP, pagination handed into parse_basic_details (assuming a problem in past fetching)
SKIP, pagination handed into parse_basic_details (assuming a problem in past fetching)
SKIP, pagination handed into parse_basic_details (assuming a problem in past fetching)
SKIP, pagination handed into parse_basic_details (assuming a problem in past fetching)
SKIP, pagination handed into parse_basic_details (assuming a problem in past fetching)
SKIP, pagination handed into parse_basic_de

## Now extracts the text from reacties.

Notes: 
- there are currently ~150K reacties
- The text-directly-on-the-page cases is pretty fast - we can process ~100K items in maybe ten minutes.
- The other ~30K are PDF
  - The PDFs that report their own text will add maybe twenty minutes
  - the cases that require OCR are much slower, but also relatively rare.


### On anonymization

People had the option to react anonymously.

Yet if they did not, as fas as I understand, we still have fewer rights to use people's names than the government site does.
Also, no text analysis really needs personal names. So we can give a decent effort to forget names.

Both text-reactie and link-to-PDF-reactie pages have table on the webpage with  name. 
- PDF reactions often come from companies, where the name is the company name and a personal name.
- text reactions often come from people, where the name is either `Anoniem` or an actual name.


Personal reactions are easy enough to anonymize: just don't extract that name.

There are some cases where people still sign with their name in the text.
We could make more effort to remove those.

Company reactions are presumably less pressing, 
in that they _mean_ to put the weight of their name behind it.

In [29]:
plain_count, pdfdoc_count = 0, 0

def add_reactie_text( reactie_url ):
    global plain_count, pdfdoc_count
    case_idname = url_to_idname(reactie_url)
    soup = bs4.BeautifulSoup( reacties_store.get( reactie_url ) )
    content = soup.find('div', attrs={'id':'content'})
    if content is None:
        print( 'SKIP NOCONTENT', reactie_url)
        return

    # Right now we are intentionally NOT picking up the name from the table 
    # because this is a case that likely needs to be anonymized
    # in the case of documents we might want to pick up the name temporarily _to look for it in that document._
    # TODO: place and date

    reactie_bestand = content.find('a', attrs={'href':re.compile(r'.*/[^/]+/reactie/[^/]+/bestand$')})
    if reactie_bestand is not None:
        if 0: # no PDF extraction until we figure out they're okay in terms of anonymization
            pdf_url = 'https://www.internetconsultatie.nl'+reactie_bestand.get('href')
            pdf_bytes = reacties_store_pdf.get( pdf_url )

            try:            
                txt = pdf.page_text( pdf_bytes )
                pages_txt = list( txt )
                all_pdf_text = ( '\n\n'.join( pages_txt ) ).strip()
                if len(all_pdf_text) > 0:
                    cases[case_idname]['reacties'].append( 
                        (reactie_url, 'pdf-text', all_pdf_text),
                    )
            except Exception as e:
                print( f'SKIP - {str(e)} for {repr(pdf_url)}' )

            #else:
            #    all_ocr_text = pdf.pdf_text_ocr( pdf_bytes )
            #    cases[case_idname]['reacties'].append( 
            #        (reactie_url, 'pdf-ocr', all_ocr_text),
            #    )
        pdfdoc_count   += 1
    else: # no linked PDF, look for text on the page
        #print(reactie_url)
        plain_count += 1

        # There can be multiple questions, each with their own answer.
        # this isn't overly common but still something we should deal with.

        blockquotes = content.find_all('blockquote')
        if len(blockquotes)==0:
            # there seem to be a small portion of reactions without any content. ...okay.
            #print( 'NO CONTENTS?', reactie_url )
            pass
        elif len(blockquotes) > 1:
            pass
            #print("TODO: Deal with me " + reactie_url)
        else:
            text = ('\n'.join( blockquotes[0].find_all(string=True) ) ).strip()

            cases[case_idname]['reacties'].append(  (reactie_url, 'plain', text)  )

#for reactie_url in notebook.ProgressBar( reacties_store.random_keys(1500) ):
#for url in notebook.ProgressBar( reacties_store ):
for reactie_url in notebook.ProgressBar( sorted( list( reacties_store ) ) ):
    if reactie_url.endswith('/reacties')  or  '/reacties/datum' in reactie_url:
        pass # reaction pagination
    else:
        if not '/reactie/' in reactie_url:
            print( 'TODO FIX -- '+reactie_url )
            continue
        add_reactie_text( reactie_url )

plain_count, pdfdoc_count

  0%|          | 0/149103 [00:00<?, ?it/s]

SKIP NOCONTENT https://www.internetconsultatie.nl/klimaatplan/reactie/04840f57-f0e4-4b9e-b8ce-f92e90729e81
SKIP NOCONTENT https://www.internetconsultatie.nl/klimaatplan/reactie/171536db-1a88-4ea4-8db7-1861fe479067
SKIP NOCONTENT https://www.internetconsultatie.nl/klimaatplan/reactie/248842c6-3b23-41df-a281-0c3ea46e62e9
SKIP NOCONTENT https://www.internetconsultatie.nl/klimaatplan/reactie/40158712-c7cf-47a7-8c8d-9c6708ecfa39
SKIP NOCONTENT https://www.internetconsultatie.nl/klimaatplan/reactie/4b86fc84-a93a-486d-8cdc-3df947889d07
SKIP NOCONTENT https://www.internetconsultatie.nl/klimaatplan/reactie/5a4f2ab0-8ef4-4967-9a11-b6eefff3169c
SKIP NOCONTENT https://www.internetconsultatie.nl/klimaatplan/reactie/f16e33e4-327f-4ba6-b9fd-edea6c72c7d2
SKIP NOCONTENT https://www.internetconsultatie.nl/knelpuntenregelingschadeafhandelinggroningen/reactie/1b4d3ede-6204-4df7-9ef0-19b0cb942f23
SKIP NOCONTENT https://www.internetconsultatie.nl/knelpuntenregelingschadeafhandelinggroningen/reactie/2dbb3322

(104382, 29662)

In [78]:
# Display a random case's reaction, as a basic test
for id, details in random.sample( list( cases.items() ), 1000 ):
    print( details.keys() )
    #'entry_page_url', 'details', 'reacties

dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'reacties'])
dict_keys(['entry_page_url', 'details', 'react

In [76]:
# Display a random case's reaction, as a basic test
for id, details in random.sample( list( cases.items() ), 1000 ):
    if 'reacties' in details and len(details['reacties'])>5:
        pprint.pprint( details['reacties'])
        break 
    #pprint.pprint( details )
    #break

[('https://www.internetconsultatie.nl/kleineklassen/reactie/01489ba0-d8e5-47a2-bc0d-097bd65e4a95',
  'plain',
  'Goed voorstel om goed onderwijs te kunnen waarborgen bij kleinere klassen '
  'en meer zicht en inzicht in leerprocessen van kinderen en meer overzicht en '
  'motivatie bij de lerkrachten. En de ouders kunnen ervaren dat de '
  'leerkrachten ook meer te vertellen hebben over hun kind wua prestaties, '
  'leren in groepjes en optrekken met leeftijdgenoten, plezirig en soms '
  'gepest.'),
 ('https://www.internetconsultatie.nl/kleineklassen/reactie/02055ecc-8ae9-4e6d-90a5-be357db8c4c7',
  'plain',
  'Uitstekend initiatief om klassen te verkleinen en daardoor het onderwijs te '
  'verbeteren en de werkdruk voor leraren te verkleinen. Als net '
  'gepensioneerde leraar weet ik waar ik het over heb!'),
 ('https://www.internetconsultatie.nl/kleineklassen/reactie/03fd6572-f24d-40d9-a2f3-b9cd8abdfbaf',
  'plain',
  'Als we passend onderwijs willen geven zullen de klassen kleiner mo

In [77]:
# put all these cases into a store so we can make it a dataset
icp = localdata.MsgpackKV('internetconsultaties-partial-struc.db', key_type=str)
icp.truncate()

icp._put_meta('description_short', 'The text contents of a good portion of the answers at https://www.internetconsultatie.nl')
icp._put_meta('description','''
The text contents of a good portion of the answers at https://www.internetconsultatie.nl
The focus is the reactions themselves, but varied

More time should be spent on anonymizing this, 
and on dealing with the PDFs (where the anonymization is likely more relevant).

For example:                              
  {
    'entry_page_url': 'https://www.internetconsultatie.nl/registratieregeling/b1',
    'details': {'title': 'Wijziging Regeling periodieke registratie Wet BIG (orthopedagoog-generalist en klinisch technoloog)',
      'labels': ['Consultatie gesloten', 'Zorg en gezondheid'],
      'overview': {'Startdatum consultatie': '16-05-2024',
        'Einddatum consultatie': '13-06-2024',
        'Status': 'Gesloten',
        'Type consultatie': 'Ministeriële regeling',
        'Organisatie': 'Ministerie van Volksgezondheid, Welzijn en Sport',
        'Onderwerpen': 'Organisatie en beleid'},
      'paragraphs': {'In het kort': 'In 2020 zijn twee nieuwe beroepen opgenomen in artikel 3 van de Wet op de [...]'},
      'documenten': []},
    'reacties': [
      ('https://www.internetconsultatie.nl/registratieregeling/reactie/32214e32-91f8-42be-bb2e-d456a8453f66', 
       'plain',
       'goed! akkoord')]
  }              

'''+wetsuite.datasets.generated_today_text())

for case_id, details in cases.items():
    icp.put( case_id, details)

In [None]:
len(icp)

2757