<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-datacollect/blob/main/tweede_kamer_dossiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose of this notebook

See how we might get towards actually making a dataset from data at [tweede kamer open data portal](https://opendata.tweedekamer.nl/).

We restrict ourselves to kamerstukdossiers and their documents.

You may want to skip over part 1, as it



## How do numbers work?

One useful thing to grasp first is how the numbering/identifiers/references for dossiers and their contained documents work. 
See the relevant notes in in the [identifiers-and-references notebook (in the wetsuite-notebook repo)](https://github.com/knobs-dials/wetsuite-notebooks/blob/main/notes__legal_identifiers_and_references.ipynb)

## Kamerstukdossiers with the SyncFeed API

In [9]:
import pprint
import collections

import wetsuite.datacollect.tweedekamer_nl

In [6]:
# Fetch all kamerstukdossiers.    
# may take half a minute or so to fetch all, because that's ~30 fetches amounting to ~6MByte of XML.
ks_tree = wetsuite.datacollect.tweedekamer_nl.merge_etrees( wetsuite.datacollect.tweedekamer_nl.fetch_all( 'Kamerstukdossier' ) )

In [7]:
# Just the first entry:
print( wetsuite.helpers.etree.debug_pretty( ks_tree.find('entry') ))

# parse that into dicts, as before
ks_dicts = list( wetsuite.datacollect.tweedekamer_nl.entry_dicts( ks_tree ) )

pprint.pprint( ks_dicts[0] )

<entry>
  <title>1f031e16-cb3b-45b5-b3c9-a8abd27c913a</title>
  <id>https://gegevensmagazijn.tweedekamer.nl/SyncFeed/2.0/Entiteiten/1f031e16-cb3b-45b5-b3c9-a8abd27c913a</id>
  <author>
    <name>Tweede Kamer der Staten-Generaal</name>
  </author>
  <updated>2019-06-28T20:34:35Z</updated>
  <category term="kamerstukdossier"/>
  <link rel="next" href="https://gegevensmagazijn.tweedekamer.nl/SyncFeed/2.0/Feed?category=Kamerstukdossier&amp;skiptoken=53457"/>
  <content type="application/xml">
    <kamerstukdossier id="1f031e16-cb3b-45b5-b3c9-a8abd27c913a" bijgewerkt="2008-08-26T12:13:04.6270000" verwijderd="false">
      <titel>Wijziging van de Wet ruimtelijke ordening inzake de grondexploitatie</titel>
      <citeertitel nil="true"/>
      <alias nil="true"/>
      <nummer>30218</nummer>
      <toevoeging nil="true"/>
      <hoogsteVolgnummer>25</hoogsteVolgnummer>
      <afgesloten>false</afgesloten>
      <kamer>Tweede Kamer</kamer>
    </kamerstukdossier>
  </content>
</entry>

{'categ

In [None]:
# if you've read the bit about the toevoegingen / suffixes, you may care about how many there are

# get an idea of these toevoegingen that exist on certain numbers
num_toe = collections.defaultdict( list )
for kd in ks_dicts:
    num_toe[ kd['content'].get('nummer') ].append( kd['content'].get('toevoeging','') )

In [None]:
# all with more than a few - all seem to be begroting-related
for nummer in num_toe:
    if num_toe[nummer] != [None]:
        if len(num_toe[nummer]) >= 3:
            print(nummer, num_toe[nummer])


In [15]:
# with just a few
for nummer in num_toe:
    if num_toe[nummer] != [None]:
        if len(num_toe[nummer]) < 3:
            print(nummer, num_toe[nummer])

31314 ['(R1843)']
31725 ['(R1867)']
31449 ['(R1857)']
31754 ['(R1869)']
31740 ['(R1868)']
31422 ['(R1853)']
31429 ['(R1855)']
31900 ['(R1879)']
31846 ['(R1875)']
31882 ['(R1878)']
32167 ['(R1895)']
32148 ['(R1893)']
32049 ['(R1891)']
32227 ['(R1904)']
32170 ['(R1896)']
31969 ['(R1881)']
32142 ['(R1892)']
31970 ['(R1882)']
32166 ['(R1894)']
31808 ['(R1872)']
31797 ['(R1871)']
32020 ['(R1887)']
32178 ['(R1898)']
32406 ['(R1913)']
32251 ['(R1905)']
32329 ['(R1909)']
32485 ['(R1917)']
32179 ['(R1899)']
32510 ['(R1918)']
32313 ['(R1908)']
32330 ['(R1910)']
32041 ['(R1890)']
31879 ['(R1877)']
32482 ['(R1916)']
32354 ['(R1911)']
32407 ['(R1914)']
32511 ['(R1919)']
31872 ['(R1876)']
32028 ['(R1889)']
32365 ['(R1912)']
32816 ['(R1955)']
32686 ['(R1941)']
32691 ['(R1946)']
32682 ['(R1937)']
32724 ['(R1951)']
32662 ['(R1933)']
32690 ['(R1945)']
32628 ['(R1925)']
32535 ['(R1923)']
32689 ['(R1944)']
32630 ['(R1927)']
32737 ['(R1952)']
32629 ['(R1926)']
32684 ['(R1939)']
32688 ['(R1943)']
32663 ['(R

## Kamerstukdossiers with tkapi

Kamerstukken are documents between government and parliament, 
and are are organized into dossiers (a.k.a. kamerdossiers, kamerstukdossiers).

# So

In [46]:
# !pip3 install tkapi
import wetsuite.helpers.localdata
import wetsuite.helpers.notebook

import tkapi   
from tkapi.dossier import Dossier
from tkapi.document import DocumentSoort

In [21]:
api = tkapi.TKApi()

tkapi_docs = wetsuite.helpers.localdata.LocalKV('tkapi_docs.db', key_type=str, value_type=bytes)

In [22]:
# If we wanted to download _all_ document's actual contents, try the following
# (you might prefer to do this per soort, to avoid gigabytes of what you don't want)
#
# This combination will take maybe fifteen minutes - most of it the fetching of .documenten

all_dossiers = api.get_dossiers()

all_docs = []
for dossier in all_dossiers:
    all_docs.extend( dossier.documenten )

In [35]:
print( "Documents to fetch: %d"%len(all_docs))

# fetching all the actual content would probably take hours, also depending on how nice we are being to the servers
count_cached, count_fetched = 0, 0
pb = wetsuite.helpers.notebook.progress_bar( len(all_docs) )
for document in all_docs: # will likely have
    bytestring, came_from_cache = wetsuite.helpers.localdata.cached_fetch( tkapi_docs, document.url )            # json metadata (odata style)
    bytestring, came_from_cache = wetsuite.helpers.localdata.cached_fetch( tkapi_docs, document.bestand_url )    # the document
    if came_from_cache:
        count_cached += 1
    else: 
        count_fetched += 1
    pb.description = f"fetched {count_fetched}, cached {count_cached}  ({(100.*count_cached)/(count_cached+count_fetched):.0f}% cached)  "    
    pb.value += 1

Documents to fetch: 182516


  0%|          | 0/182516 [00:00<?, ?it/s]

In [None]:
# for typ, doclist in soorten_by_count:
#     if 'VERSLAG' in typ:
#         print( f'{len(doclist):<10d} {typ}' )
    
#     if typ == 'VERSLAG_VAN_EEN_COMMISSIEDEBAT':
#         count_fetched, count_cached = 0, 0
#         pb = wetsuite.helpers.notebook.progress_bar( len(doclist), description=str(typ) )

#         for doc in doclist:
#             try:
#                 for dossier in doc.dossiers:
#                     toe_s = dossier.toevoeging is not None  and  '-%s'%dossier.toevoeging  or  '   ' # apologies for the syntax-fu
#                     #print( 'Document %s belongs to %5s%-5s  (%s)'%(doc.nummer,  dossier.nummer, toe_s, dossier.titel),  )
#                     #print( '  ', doc.url )
#                     #print( '  ',doc.bestand_url )

#                     bytestring, came_from_cache = wetsuite.helpers.localdata.cached_fetch( tkapi_docs, doc.url)
#             except Exception as e:
#                 print('ERR: '+str(e) )
            
#             #time.sleep(0.1)
#             pb.value += 1

In [40]:
# Out of interest, what kind of documents are they in the first place?
# 
filetypes = collections.defaultdict(int)
import magic

for url, value in tkapi_docs.items(): # ~180K documents will take at least a minute to inspect
#for url, value in tkapi_docs.random_sample(1000):
    # The store contains both the fetched metadata JSON and the fetched document,
    #   we are interested only in the document right noe
    if 'Resource()' in url: 
        descr = magic.from_buffer( value )
        if 'Composite Document File' in descr:
            descr = 'Earlier MS Office'
        filetypes[ descr ] +=1

dict( filetypes )

{'PDF document, version 1.4': 161204,
 'PDF document, version 1.3': 20498,
 'PDF document, version 1.2': 409,
 'Microsoft Word 2007+': 230,
 'Earlier MS Office': 106,
 'PDF document, version 1.6': 48,
 'Zip archive data, at least v2.0 to extract': 2,
 'PDF document, version 1.5': 13,
 'PDF document, version 1.7': 6}

In [44]:
# count how often each volgnummer appears
vn = collections.defaultdict(int)
for doc in all_docs:
    vn[ doc.volgnummer ] += 1

for volgnummer, count in sorted( vn.items(), key=lambda x:x[0] ): # sort by volgnummer
    print("%-5s  %s"%(volgnummer, count))

1      6117
2      5636
3      5206
4      4316
5      3957
6      3585
7      3054
8      2647
9      2423
10     2203
11     2026
12     1895
13     1753
14     1640
15     1532
16     1461
17     1385
18     1318
19     1244
20     1182
21     1118
22     1065
23     1012
24     973
25     931
26     898
27     878
28     846
29     815
30     787
31     766
32     741
33     726
34     716
35     699
36     688
37     677
38     668
39     655
40     641
41     634
42     636
43     626
44     621
45     608
46     602
47     596
48     589
49     586
50     583
51     582
52     579
53     563
54     550
55     544
56     536
57     532
58     532
59     524
60     521
61     511
62     507
63     507
64     505
65     501
66     496
67     494
68     488
69     484
70     477
71     472
72     470
73     461
74     461
75     456
76     455
77     448
78     440
79     436
80     426
81     427
82     424
83     418
84     413
85     408
86     401
87     399
88     396
89     39

I ask because I know of cases like 
    https://zoek.officielebekendmakingen.nl/kst-35302-F
as part of 
    https://zoek.officielebekendmakingen.nl/dossier/35302


In [49]:
# Let's get that specific dossier:

dossier_filter = Dossier.create_filter()
dossier_filter.filter_nummer('35302')
for dossier in api.get_dossiers( dossier_filter ):
    print('-')
    print( len(dossier.documenten) ) # should be 139, see  https://zoek.officielebekendmakingen.nl/dossier/35302
    for document in sorted( dossier.documenten, key=lambda x:x.volgnummer ):
        print(document.volgnummer, end=' ')
# TODO: figure out whether that's the API or tkapi

-
85
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 

..or even more visibly,
    https://zoek.officielebekendmakingen.nl/dossier/34211
just seems to have no documents:

In [50]:
dossier_filter = Dossier.create_filter()
dossier_filter.filter_nummer('34211')
for dossier in api.get_dossiers( dossier_filter ):
    print( len(dossier.documenten) )
    for document in sorted( dossier.documenten, key=lambda x:x.volgnummer ):
        print(document.volgnummer)

0
