<a href="https://colab.research.google.com/github/WetSuiteLeiden/data-collection/blob/master/tweede_kamer_part3_both_apis_kamerdossiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose of this notebook

See how we might get towards actually making a dataset from data at [tweede kamer open data portal](https://opendata.tweedekamer.nl/).

<!-- -->

We restrict ourselves to kamerstukdossiers and their documents.

Kamerstukken are documents between government and parliament, 
and are are organized into dossiers (a.k.a. kamerdossiers, kamerstukdossiers).

## How do numbers work?

One useful thing to grasp first is how the numbering/identifiers/references for dossiers and their contained documents work.

This turns out to be _interesting_ -- see the relevant notes in in the [identifiers-and-references notebook (in the wetsuite-notebook repo)](https://github.com/WetSuiteLeiden/example-notebooks/blob/main/notes/notes__legal_identifiers_and_references.ipynb)

In [36]:
# various things we'll end up using
import pprint, collections, random, time

import wetsuite.helpers.localdata
import wetsuite.helpers.notebook
import wetsuite.datacollect.tweedekamer_nl

In [None]:
# we will eventually download some content
tkapi_docs = wetsuite.helpers.localdata.LocalKV('tkapi_docs.db', key_type=str, value_type=bytes)

## Kamerstukdossiers with the SyncFeed API

In [2]:
# Fetch all kamerstukdossiers.
# may take half a minute or so to fetch all, because that's ~30 fetches amounting to ~6MByte of XML.
ks_tree = wetsuite.datacollect.tweedekamer_nl.merge_etrees( wetsuite.datacollect.tweedekamer_nl.fetch_all( 'Kamerstukdossier' ) )

In [3]:
# Just the first entry:
print( wetsuite.helpers.etree.debug_pretty( ks_tree.find('entry') ))

# parse that into dicts, as before
ks_dicts = list( wetsuite.datacollect.tweedekamer_nl.entry_dicts( ks_tree ) )

pprint.pprint( ks_dicts[0] )

<entry>
  <title>1f031e16-cb3b-45b5-b3c9-a8abd27c913a</title>
  <id>https://gegevensmagazijn.tweedekamer.nl/SyncFeed/2.0/Entiteiten/1f031e16-cb3b-45b5-b3c9-a8abd27c913a</id>
  <author>
    <name>Tweede Kamer der Staten-Generaal</name>
  </author>
  <updated>2019-06-28T20:34:35Z</updated>
  <category term="kamerstukdossier"/>
  <link rel="next" href="https://gegevensmagazijn.tweedekamer.nl/SyncFeed/2.0/Feed?category=Kamerstukdossier&amp;skiptoken=53457"/>
  <content type="application/xml">
    <kamerstukdossier id="1f031e16-cb3b-45b5-b3c9-a8abd27c913a" bijgewerkt="2008-08-26T12:13:04.6270000" verwijderd="false">
      <titel>Wijziging van de Wet ruimtelijke ordening inzake de grondexploitatie</titel>
      <citeertitel nil="true"/>
      <alias nil="true"/>
      <nummer>30218</nummer>
      <toevoeging nil="true"/>
      <hoogsteVolgnummer>25</hoogsteVolgnummer>
      <afgesloten>false</afgesloten>
      <kamer>Tweede Kamer</kamer>
    </kamerstukdossier>
  </content>
</entry>

{'categ

### Inspecting those suffixes

In [4]:
# if you've read the bit about the toevoegingen / suffixes, you may care about how many there are

# get an idea of these toevoegingen that exist on certain numbers
num_toe = collections.defaultdict( list )
for kd in ks_dicts:
    num_toe[ kd['content'].get('nummer') ].append( kd['content'].get('toevoeging','') )

In [34]:
len( num_toe )

5509

In [5]:
# all with more than a few - most of which seem to be begroting-related
for nummer in num_toe:
    if num_toe[nummer] != [None]:
        if len(num_toe[nummer]) >= 3:
            print(nummer, num_toe[nummer])

30800 ['VIII', 'XVI', 'XIV']
31444 ['IXB', 'X', None, 'XIV', 'VII', 'XII']
31200 ['XV', 'XIV', 'A', None, 'B', 'I', 'D', 'IXB', 'IV', 'VII', 'XVIII', 'XVII', 'VIII', 'III', 'XIII', 'XI', 'XVI', 'VI', 'XII', 'V', 'X']
31792 ['F', 'IIA', 'C', 'B', 'XII', 'A', 'XIII', 'VII', 'IXA', 'D', 'XVIII', 'IXB', 'XI', 'X', 'IV', None, 'V', 'XVI', 'XVII', 'XV', 'XIV', 'VI', 'VIII', 'III', 'G', 'IIB']
31965 ['XIII', 'III', 'E', 'VI', 'XVII', 'IXA', 'VIII', 'D', 'IXB', 'XVIII', 'V', 'B', 'C', 'VII', 'X', 'A', 'G', 'IIA', 'IV', 'XVI', 'IIB', 'XIV', 'XV', 'F', 'XII', None, 'XI']
31700 ['IXA', 'IIA', 'I', 'G', 'F', 'IXB', 'D', 'XIII', 'XII', 'XV', 'V', 'A', 'C', 'B', 'IV', None, 'VII', 'XVII', 'VI', 'III', 'XVIII', 'VIII', 'XVI', 'XI', 'XIV', 'E', 'IIB', 'X']
32123 ['E', 'G', 'IXA', 'IIA', 'F', 'IXB', 'D', 'C', 'III', 'VI', 'XIII', 'XVII', 'I', 'XI', 'X', None, 'XVIII', 'XVI', 'B', 'IV', 'XV', 'V', 'XIV', 'XII', 'A', 'VIII', 'VII', 'IIB']
32222 ['XIII', 'XVI', 'G', 'VI', 'III', 'XVIII', 'XV', 'XIV', 'X',

## Kamerstukdossiers with tkapi

As it turns out, the more relational nature of this part of the data is served well by tkapi
doing more things for us out of the box.

In [12]:
# !pip3 install tkapi
import tkapi   
from tkapi.dossier import Dossier
from tkapi.document import DocumentSoort

In [13]:
api = tkapi.TKApi()

In [15]:
# again, fetch all dossiers
all_dossiers = api.get_dossiers() # takes ~30 seconds

# tkapi knows about properties, and relationships to other objects, 
# and allows you to fetch them via attributes and functions it puts on each object  (for the programmers: it's much like an ORM)
# the following (while not a clean list) but helps illustrate that point, 
# in particular see  
# - properties like  id, nummer
# - relations like   documenten,  zaken (which are its relations mentioned in the diagram from ealier)
first_dossier = all_dossiers[0]
list( name  for name in dir(first_dossier)  if not name.startswith('_'))

['afgesloten',
 'begin_date_key',
 'create_filter',
 'documenten',
 'end_date_key',
 'expand_params',
 'filter_param',
 'get_date_from_datetime_or_none',
 'get_date_or_none',
 'get_datetime_or_none',
 'get_param_expand',
 'get_params_default',
 'get_property_enum_or_none',
 'get_property_or_empty_string',
 'get_property_or_none',
 'get_resource_url_or_none',
 'get_year_or_none',
 'gewijzigd_op',
 'id',
 'nummer',
 'orderby_param',
 'organisatie',
 'print_json',
 'related_item',
 'related_items',
 'related_items_deep',
 'titel',
 'toevoeging',
 'type',
 'url',
 'zaken']

In [40]:
# Let's get a basic summary of dossiers.

# fetch all, sort by dossier number (and toevoeging, which requires minor syntax-fu right now)
all_dossiers      = api.get_dossiers() # 30sec, we just did this above

In [41]:
# just a few just to get some output.
selected_dossiers = random.sample( all_dossiers, 5) 

# the sorting is for when you select a sub-range (e.g. use the following line) and want to see some of the patterns in the numbering and suffixes
#selected_dossiers = all_dossiers[:200]
# ...in particular, we find out apparently  dossier nummers  are not unique without the  toevoeging

sorted_dossiers   = sorted(  selected_dossiers,   key=lambda dossier:str(dossier.nummer)+(dossier.toevoeging or '')  )

In [42]:
for dossier in sorted_dossiers:
    print('\n\n')

    # you could e.g. figure out other zaken that refer to the same documents
    #zaaknrs = set()
    #for zaak in dossier.zaken:
    #    zaaknrs.add( zaak.nummer ) # zaak.onderwerp)

    nummer_and_toevoeging = ('%s-%s'%(dossier.nummer, dossier.toevoeging or '')).rstrip('-')
    print( f"== Dossier {nummer_and_toevoeging} == {dossier.titel} ==" )
    #print( '  ',dossier.url.replace(')','%29') ) # the replace is to make the notebook's url include the final bracket

    # It seems that many documenten have a related zaal, but it's not one-to-one;  TODO: 
    for zaak in dossier.zaken:
        print(f'   ZAAK      {zaak.nummer}  {str(zaak.soort).split(".",1)[1]:18s} {zaak.onderwerp}')
        #print('         ',zaak.url)
    
    for document in sorted(dossier.documenten, key=lambda doc:doc.volgnummer):
        print( f'   DOC #{str(document.volgnummer):3s}  {str(document.nummer):10s} - {str(document.datum):12s} - {document.onderwerp:100s} - {document.bestand_url:30s}')
        
        #for zaak in document.zaken:
        #    print(f'     DOCZAAK      {zaak.nummer}  {str(zaak.soort).split(".",1)[1]:18s} {zaak.onderwerp}')




== Dossier 28102 == Voorstel van wet van het lid Arib tot instelling van een Kinderombudsman ==
   ZAAK      2007Z02007  INITIATIEF_WETGEVING Voorstel van wet van het lid Arib tot instelling van een Kinderombudsman
   DOC #1    2007D02031 - 2001-12-06   - Geleidende brief                                                                                     - https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Document(a5ca5e82-06e7-42b0-9bc7-f0509ef0e13b)/TK.DA.GGM.OData.Resource()
   DOC #2    2007D02032 - 2001-12-06   - Voorstel van Wet                                                                                     - https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Document(28c9b096-e95e-43d1-ba0b-dddd793a3780)/TK.DA.GGM.OData.Resource()
   DOC #3    2007D02033 - 2001-12-06   - Memorie van Toelichting                                                                              - https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Document(7c93cf39-7df5-4d7c-a1ed-1eeaca57d8

In [43]:
# Out of interest, and for a short example, look for larger dossiers.  Many of them will be more thematic.
# Note: the implied fetches would take 15 minutes
for dossier in sorted( all_dossiers, key=lambda d:d.nummer ): # sort to show the same-nummer-different-toevoeging cases together
    if len( dossier.documenten ) > 100:
        toe_str = dossier.toevoeging is not None  and  '-%s'%dossier.toevoeging  or  '   ' # apologies for the syntax-fu
        print( "%5s%-5s  is a large dossier with %4d documents, titled:  %s" %(
                dossier.nummer, toe_str, len(dossier.documenten), dossier.titel))

17050       is a large dossier with  254 documents, titled:  Misbruik en oneigenlijk gebruik op het gebied van belastingen, sociale zekerheid en subsidies
19637       is a large dossier with 2043 documents, titled:  Vreemdelingenbeleid
20454       is a large dossier with  113 documents, titled:  Voortgangsrapportage uitvoering wetten oorlogsgetroffenen
21501-07    is a large dossier with 1431 documents, titled:  Raad voor Economische en Financiële Zaken
21501-32    is a large dossier with 1361 documents, titled:  Landbouw- en Visserijraad
21501-20    is a large dossier with 1696 documents, titled:  Europese Raad
21501-02    is a large dossier with 2073 documents, titled:  Raad Algemene Zaken en Raad Buitenlandse Zaken
21501-33    is a large dossier with  902 documents, titled:  Raad voor Vervoer, Telecommunicatie en Energie
21501-34    is a large dossier with  318 documents, titled:  Raad voor Onderwijs, Jeugd, Cultuur en Sport
21501-31    is a large dossier with  597 documents, titled

In [44]:
# Out of a different interest, let's get a count of the document types
# Note: the implied fetching of document objects will take on the order of fifteen minutes
soorten = collections.defaultdict(list)

for dossier in sorted( all_dossiers, key=lambda d:d.nummer ):
    for document in sorted( dossier.documenten, key=lambda doc:doc.volgnummer ):
        try:
            soorten[ str(document.soort).split(".",1)[1] ].append( document ) # takes the enum name (rather than value) and splits off the DocumentSoort.
        except Exception as e:
            print("SKIP invalid soort: %s"%(e))

In [45]:
# key a list of (soort,count), sort it by count, descending
soorten_by_count = sorted( list( soorten.items() ), key=lambda pair: len(pair[1]), reverse=True )

for soort, doclist in soorten_by_count:
    print( f'{len(doclist):7d}  {soort}')

  73489  BRIEF_REGERING
  49458  MOTIE
   8274  AMENDEMENT
   5248  MOTIE_GEWIJZIGDNADER
   5037  VERSLAG_VAN_EEN_ALGEMEEN_OVERLEG
   4657  AMENDEMENT_GEWIJZIGD_NADER_VERVANGEND
   4178  VERSLAG_VAN_EEN_SCHRIFTELIJK_OVERLEG
   3979  VOORSTEL_VAN_WET
   3978  MEMORIE_VAN_TOELICHTING
   3216  VERSLAG_INITIATIEFWETSVOORSTEL_NADER
   2907  LIJST_VAN_VRAGEN_EN_ANTWOORDEN
   2499  NOTA_NAV_HET_NADERTWEEDE_NADERENZ_VERSLAG
   2400  NOTA_VAN_WIJZIGING
   2299  KONINKLIJKE_BOODSCHAP
   2156  ADVIES_AFDELING_ADVISERING_RAAD_VAN_STATE_EN_NADER_RAPPORT
   1170  VERSLAG_HOUDENDE_EEN_LIJST_VAN_VRAGEN_EN_ANTWOORDEN
    918  BRIEF_COMMISSIE
    903  BRIEF_ALGEMENE_REKENKAMER
    789  VERSLAG_VAN_EEN_COMMISSIEDEBAT
    617  VERSLAG_VAN_EEN_WETGEVINGSOVERLEG
    528  GELEIDENDE_BRIEF
    463  BRIEF_LID__FRACTIE
    419  MEMORIE_VAN_TOELICHTING_INITIATIEFVOORSTEL
    394  VOORSTEL_VAN_WET_INITIATIEFVOORSTEL
    388  JAARVERSLAG
    380  VERSLAG_COMMISSIE_VERZOEKSCHRIFTEN_EN_DE_BURGERINITIATIEVEN
    361 

In [None]:
# count how often each volgnummer appears
vn = collections.defaultdict(int)
for doc in all_docs:
    vn[ doc.volgnummer ] += 1

for volgnummer, count in sorted( vn.items(), key=lambda x:x[0] ): # sort by volgnummer
    print("%-5s  %s"%(volgnummer, count))

### TODO: figure out why things seem to be missing
I ask because I know of cases like 

    https://zoek.officielebekendmakingen.nl/kst-35302-F

as part of 
    
    https://zoek.officielebekendmakingen.nl/dossier/35302


TODO: figure out whether that's the API or tkapi

...but it seems that the tweeke kamer API is literally just that -- it _excludes_ things from the eerste kamer.

***That means the kamerdossiers may be incomplete*** whenever the two cooperated.

In [55]:
# To demonstrate, let's get that specific dossier, which has 139 documents via that zoek.officielebekendmakingen.nl link
dossier_filter = Dossier.create_filter()
dossier_filter.filter_nummer('35302')
for dossier in api.get_dossiers( dossier_filter ):
    print('-')
    print( len(dossier.documenten) ) # should be 139, see  https://zoek.officielebekendmakingen.nl/dossier/35302
    for document in sorted( dossier.documenten, key=lambda x:x.volgnummer ):
        print(document.volgnummer, end=' ')

-
85
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 

..or even more visibly,
    https://zoek.officielebekendmakingen.nl/dossier/34211
exists but just seems to have no documents:

In [58]:
dossier_filter = Dossier.create_filter()
dossier_filter.filter_nummer('34211')
for dossier in api.get_dossiers( dossier_filter ):
    print( len(dossier.documenten) )
    for document in sorted( dossier.documenten, key=lambda x:x.volgnummer ):
        print(document.volgnummer)

0


### Example - fetching all documents of type VERSLAG_VAN_EEN_COMMISSIEDEBAT

...yes, just like in part 2. This is here as a slightly different take. 

In [48]:

for typ, doclist in soorten_by_count:
    if 'VERSLAG' in typ: # mention what other kinds of verslagen there are
        print( f'{len(doclist):<10d} {typ}' )
    
    # fetch contents for a specific type
    if typ == 'VERSLAG_VAN_EEN_COMMISSIEDEBAT':
        count_fetched, count_cached = 0, 0
        pb = wetsuite.helpers.notebook.progress_bar( len(doclist), description=str(typ) )

        for doc in doclist:
            try:
                for dossier in doc.dossiers:
                    toe_s = dossier.toevoeging is not None  and  '-%s'%dossier.toevoeging  or  '   ' # apologies for the syntax-fu
                    #print( 'Document %s belongs to %5s%-5s  (%s)'%(doc.nummer,  dossier.nummer, toe_s, dossier.titel),  )
                    #print( '  ', doc.url )
                    #print( '  ',doc.bestand_url )

                    bytestring, came_from_cache = wetsuite.helpers.localdata.cached_fetch( tkapi_docs, doc.url)
            except Exception as e:
                print('ERR: '+str(e) )
            
            #time.sleep(0.1)
            pb.value += 1

5037       VERSLAG_VAN_EEN_ALGEMEEN_OVERLEG
4178       VERSLAG_VAN_EEN_SCHRIFTELIJK_OVERLEG
3216       VERSLAG_INITIATIEFWETSVOORSTEL_NADER
2499       NOTA_NAV_HET_NADERTWEEDE_NADERENZ_VERSLAG
1170       VERSLAG_HOUDENDE_EEN_LIJST_VAN_VRAGEN_EN_ANTWOORDEN
789        VERSLAG_VAN_EEN_COMMISSIEDEBAT


VERSLAG_VAN_EEN_COMMISSIEDEBAT:   0%|          | 0/789 [00:00<?, ?it/s]

617        VERSLAG_VAN_EEN_WETGEVINGSOVERLEG
388        JAARVERSLAG
380        VERSLAG_COMMISSIE_VERZOEKSCHRIFTEN_EN_DE_BURGERINITIATIEVEN
361        VERSLAG_VAN_EEN_BIJEENKOMST
322        VERSLAG_VAN_EEN_NOTAOVERLEG
247        VERSLAG_VAN_EEN_HOORZITTING__RONDETAFELGESPREK
144        VERSLAG_VAN_EEN_WERKBEZOEK
32         VERSLAG_VAN_EEN_RAPPORTEUR
8          VERSLAG_VAN_EEN_POLITIEKE_DIALOOG
4          INBRENG_VERSLAG_SCHRIFTELIJK_OVERLEG


### Example - finding Raad van State advice

Let's say that our interest is more specific:
finding what  Raad van State  has to say about  proposed laws (wetsvoorstellen).

...and, in the process also learn what the kinds of documents there are in each dossier.
 
There is also the [advice on the raad van state site](https://www.raadvanstate.nl/adviezen/),
(for a more data-like form, see also our [extras_datacollect_raadvanstate](extras_datacollect_raadvanstate.ipynb)),
but there it is not placed in the context of the law it's referring to.
This interface should at least gives us the law's name.

In [49]:
# We start by selecting dossiers where there already _is_ RvS advice.
#  - this is a decent filter for wetsvoorstellen
#  - and filters out wetsvoorstellen that don't need this advice (e.g. begroting)
# ...but we are about to find out
# - there are other things that RVS advises on, like finances (see e.g. 36200) 
# - there are law changes that RVS does not advise on (e.g. TODO)

sorted_dossiers = sorted(all_dossiers,  key=lambda d:d.nummer,  reverse=True )

count = 0
for i, dossier in enumerate( sorted_dossiers ):
    nummer_and_toevoeging = ('%s-%s'%(dossier.nummer, dossier.toevoeging or '')).rstrip('-')

    #if (dossier.nummer%100) == 0: # ignore a few specific special cases for now,   just because they're large to print
    #    continue

    ## In our stated interest:  first see if it has RvS advice
    sorted_docs      = sorted(dossier.documenten,  key=lambda d:d.volgnummer )
    has_raadvanstate = False
    for document in sorted_docs:
        try:
            # these come from an enum, try  list( tkapi.document.DocumentSoort )  to see a list
            if document.soort in (DocumentSoort.ADVIES_AFDELING_ADVISERING_RAAD_VAN_STATE, 
                                  #DocumentSoort.ADVIES_AFDELING_ADVISERING_RAAD_VAN_STATE_EN_NADER_RAPPORT, # seems to be begrotingstuff?  (TODO: check)
                                  DocumentSoort.ADVIES_AFDELING_ADVISERING_RAAD_VAN_STATE_EN_REACTIE_VAN_DE_INITIATIEFNEMERS,
                                ):
                has_raadvanstate = True
        except ValueError: # there's some invalid / non-covered soort values in the data
            pass # ignore
        # we can filter on more, but we may not need to?

    if not has_raadvanstate:
        continue
    # if execution gets here, it's probably interesting to us.
    
    count += 1
    #if len(sorted_docs)>500:
    #    print( "\n\n== Dossier %s == %s =="%( dossier.nummer, dossier.titel) )
    #    print(' LARGE: %d documents'%len(sorted_docs))
    #    print(' %s ({{kamerdossier|%d}}'%(dossier.titel, dossier.nummer))
    #    continue

    print( "\n== %r == Dossier %s == %d docs == %s =="%( dossier.id, nummer_and_toevoeging, len(dossier.documenten), dossier.titel) )
    for document in sorted_docs:
        try:
            if 0: # just to make the summaries a little easier to read
                if document.soort in (DocumentSoort.MOTIE, DocumentSoort.AMENDEMENT, DocumentSoort.BRIEF_REGERING, DocumentSoort.VERSLAG_VAN_EEN_ALGEMEEN_OVERLEG,
                                    DocumentSoort.MEMORIE_VAN_TOELICHTING_INITIATIEFVOORSTEL,
                                    ):
                    continue
        except ValueError:
            print( "soort not known by tkapi")
            continue

        try:
            docsoort = document.soort
        except ValueError: # this seems to be internal inconsistency
            continue

        show_all_docs = False
        if show_all_docs or docsoort in (DocumentSoort.ADVIES_AFDELING_ADVISERING_RAAD_VAN_STATE, 
                                DocumentSoort.ADVIES_AFDELING_ADVISERING_RAAD_VAN_STATE_EN_REACTIE_VAN_DE_INITIATIEFNEMERS):

            print( '#%s'%(document.volgnummer, ), document.soort.name)
            #print( 'soort', document.soort.name, '(%s)'%document.soort.value )
            print( '  onderwerp    ', document.onderwerp )     # for wetsvoorstel-dossiers, seems to often be the same as soort plus some detail (who a letter is from, who )
            print( '  citeertitel  ', document.titel_citeer ) # for wetsvoorstel-dossiers, this often seems to name the law. Or a related one, see e.g. 36195
            print( '  titel        ', document.titel )              # for wetsvoorstel-dossiers, this seems to often name the law, plus sometimes some reason
            #print( 'versies', document.versies )
            print( '  url          ', document.bestand_url )
            if 0:        # It may be interesting to know the document is part of multiple dossiers and/or multiple zaken
                print( '  zaken         ', document.zaken )
                #nums = document.dossier_nummers
                #nums.pop(dossier.nummer)
                #if len(nums)>0:
                #  print( "  also in dossiers: %s"%nums )

            print()

    #if i > 1000: # show only a bunch, not all
    #    print("break %d"%i)
    #    break
print( 'Interesting cases: %d'%count )


== '6abacd05-acdf-4938-ab33-36f88d8ea469' == Dossier 36468 == 6 docs == Voorstel van wet van het lid Dijk houdende verandering in de Grondwet, strekkende tot opneming van bepalingen inzake het correctief referendum ==
#5 ADVIES_AFDELING_ADVISERING_RAAD_VAN_STATE_EN_REACTIE_VAN_DE_INITIATIEFNEMERS
  onderwerp     Advies Afdeling advisering Raad van State en Reactie van de initiatiefnemer
  citeertitel   
  titel         Voorstel van wet van het lid Dijk houdende verandering in de Grondwet, strekkende tot opneming van bepalingen inzake het correctief referendum
  url           https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Document(87f1dfa6-1f5a-4d6a-96da-72e52ad0ae45)/TK.DA.GGM.OData.Resource()


== '74a0986e-e2bb-4ea9-b752-7011fd53c3e1' == Dossier 36353 == 7 docs == Voorstel van wet van de leden Diederik van Dijk, Erkens, Boswijk, Dassen, Kahraman, Tuinman, Paternotte, Eerdmans en Ceder houdende vaststelling van regels inzake het voldoen aan verplichtingen voor de defensie van h

### Example: inspect specific dossier

In [25]:
#If you only wanted a specific dossier, use something like:
from tkapi.dossier import Dossier
dossier_filter = Dossier.create_filter()
dossier_filter.filter_nummer('35302')
dossiers = api.get_dossiers( dossier_filter )

for document in dossiers[0].documenten:
    print( document.id, document.bestand_url)

f424dd8a-71c7-49f9-8c41-01ddb9d603b7 https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Document(f424dd8a-71c7-49f9-8c41-01ddb9d603b7)/TK.DA.GGM.OData.Resource()
860126d6-fcd6-4d02-800f-07cc63e98b93 https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Document(860126d6-fcd6-4d02-800f-07cc63e98b93)/TK.DA.GGM.OData.Resource()
7c3441c9-e3e1-44d2-9bfc-0822d3e27571 https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Document(7c3441c9-e3e1-44d2-9bfc-0822d3e27571)/TK.DA.GGM.OData.Resource()
83c01e2d-2314-4d39-8f92-0912d95640af https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Document(83c01e2d-2314-4d39-8f92-0912d95640af)/TK.DA.GGM.OData.Resource()
3b5606db-4279-47b8-9793-10a524bd4b91 https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Document(3b5606db-4279-47b8-9793-10a524bd4b91)/TK.DA.GGM.OData.Resource()
ee4dcb5e-a102-4a63-b64b-10df63260559 https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0/Document(ee4dcb5e-a102-4a63-b64b-10df63260559)/TK.DA.GGM.OData.Resource()
adf73905-0660-48

### Downloading a bunch of documents

Note: You probably don't want to do this.

This was intended to dig deeper into these documents. We might provide this as a dataset,
but this is one of the nicer APIs that _should_ actually let you translate most wishes
into queries that fetch everything in minutes. 

In [51]:
# If we wanted to download _all_ dossier-related documents, try the following.
# Note that 
# - This would take maybe fifteen minutes - most of it the fetching of .documenten
# - still just their object metadata. You can inspect it now, but you don't have the contents yet
#   - you might prefer to do this per soort, to avoid gigabytes of what you don't want

all_docs = []
for dossier in all_dossiers:
    all_docs.extend( dossier.documenten )

In [None]:
print( "Documents to fetch: %d"%len(all_docs)) # ~180K as of this writing

# fetching all the actual content would probably take hours, also depending on how nice we are being to the servers
count_cached, count_fetched = 0, 0
pb = wetsuite.helpers.notebook.progress_bar( len(all_docs) )
for document in all_docs: # will likely have
    bytestring, came_from_cache = wetsuite.helpers.localdata.cached_fetch( tkapi_docs, document.url )            # json metadata (odata style)
    bytestring, came_from_cache = wetsuite.helpers.localdata.cached_fetch( tkapi_docs, document.bestand_url )    # the document
    if came_from_cache:
        count_cached += 1
    else:
        count_fetched += 1
        time.sleep(5) # be somewhat nice to the server
    pb.description = f"fetched {count_fetched}, cached {count_cached}  ({(100.*count_cached)/(count_cached+count_fetched):.0f}% cached)  "    
    pb.value += 1

In [61]:
# How much do we have now?
tkapi_docs.summary( get_num_items=True )

{'size_bytes': 12690366464,
 'size_readable': '11.8GiB',
 'num_items': 369202,
 'avgsize_bytes': 34372,
 'avgsize_readable': '34KiB'}

In [53]:
# Out of interest (and the need for us to parse), what kind of documents are these?
import magic # file magic refers to detecting file type by its contents

filetypes = collections.defaultdict(int)

for url, value in tkapi_docs.items(): # ~180K documents will take at least a minute to inspect
    # The store contains both the fetched metadata JSON and the fetched document,
    #   we are interested only in the document right now, which will be the ones with Resource() at the end of their URL
    if 'Resource()' in url:
        descr = magic.from_buffer( value )
        if 'Composite Document File' in descr: # group a few earlier word variants
            descr = 'Earlier MS Office'
        if 'PDF document' in descr:            # group PDF versions into just one
            descr = 'PDF document'
        filetypes[ descr ] +=1

dict( filetypes )

# Note:  it seems that the same URL will be an office document while it is not final,
#        and becomes a PDF once it is. 
#        If so 
#        - that's an abuse of URL logic
#        - that also means that if we care about the lastest,
#           we need to poke the server a lot more to actually get it.
#           (the quick and dirty way around it is to remove all officey documents and hope the re-fetch is different)

{'PDF document': 183221,
 'Microsoft Word 2007+': 325,
 'Earlier MS Office': 274,
 'Zip archive data, at least v2.0 to extract': 2}