<a href="https://colab.research.google.com/github/WetSuiteLeiden/data-collection/blob/master/api_rechtspraaknl_many.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose of this notebook

Understanding what you can get out of [rechtspraak.nl](https://www.rechtspraak.nl/).

Note that this is mostly just showing our work. If the rechtspraak [dataset](../../intro/wetsuite_datasets.ipynb) we provide suits your needs,
then running this notebook would just be a slower and more cumbersome way to get basically the same.

Somewhat related: [extras_datacollect_rechtspraak_codes](extras_datacollect_rechtspraak_codes.ipynb)

# Website, Open data API, and some other notes

You are probably familiar with the [rechtspraak.nl](https://rechtspraak.nl) website, and possibly its [search that has a number of filters](https://uitspraken.rechtspraak.nl/#!/) (and some [exta query logic for the text](https://www.rechtspraak.nl/Uitspraken/Paginas/Hulp-bij-zoeken.aspx#1ab85aa0-e737-4b56-8ad5-d7cb7954718d77a998be-3c73-40e3-90f7-541fceeb00fd3)), which gives webpage results, with text where present.


There is also an [Open Data van de Rechtspraak](https://www.rechtspraak.nl/Uitspraken/Paginas/Open-Data.aspx), an API that exposes much the same in data form.

As [its documentation](https://www.rechtspraak.nl/SiteCollectionDocuments/Technische-documentatie-Open-Data-van-de-Rechtspraak.pdf) ([this intro](https://www.rechtspraak.nl/Uitspraken/paginas/open-data.aspx) may also be useful) mentions,
- you stick query parameters on the base URL of http://data.rechtspraak.nl/uitspraken/zoeken 
- the results mainly mention ECLIs, which you can fetch details for via e.g. https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:PHR:2011:BP5608 (more notes below)


Worthy of note:
- the fields you can search mostly matches with the 'Uitgebreid zoeken' at [uitspraken.rechtspraak.nl](https://uitspraken.rechtspraak.nl), such as:
  - instantie / court code (basically that third element in the ECLI)
  - rechtsgebied
  - procedure

- **You can't search in the body text**.  In a practical sense that largely limits the API to a 'keep updated with new cases' feed, specific to your interests, or generally.
  - Worthy of note: the website search (queried via `https://uitspraken.rechtspraak.nl/api/zoek`) does support this, and even seems like a better data API than the _actual_ data API one -- but it doesn't look like it's supposed to be used externally.

- there are plenty of cases where there is no text / document.  You can filter for this in the search.

- the **identifiers** used are **ECLI** (European Case Law Identifier)
  - which in this case will be mainly Dutch ECLIs  (`ECLI:NL:`...), used since 2013 or so, and which absorbed the previously used LJN identifiers.
  - court code XX (`ECLI:NL:XX:`...) is used for things not (yet?) assgined to a court, and/or non-Dutch ECLIs,
    - rechtspraak.nl may later resolve such ECLIs to a different ECLI. That makes their site the most up-to-date information, that a mirror may not necessarily be aware of (yet). 
    - Example: ECLI:NL:XX:2009:BJ4574 
      - in [XML metadata](https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:XX:2009:BJ4574) mentions it's now (isReplacedBy) ECLI:CE:ECHR:2009:0528JUD002671305
      - in [webpage form](https://uitspraken.rechtspraak.nl/#!/details?id=ECLI:NL:XX:2009:BJ4574) also seems to link to [a place to find it](https://hudoc.echr.coe.int/eng#%7B%22ecli%22:%5B%22ECLI:CE:ECHR:2009:0528JUD002671305%22%5D%7D) (e.g. `hudoc.echr.coe.int` or `e-justice.europa.eu`).

For each ECLI you might consider various URLs, including
  - XML, e.g. at https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:PHR:2011:BP5608
    - most interesting when you want case details as data
    - (there is a different XML form at https://uitspraken.rechtspraak.nl/api/document/?id=ECLI:NL:PHR:2011:BP5608 but this is for the webpage view, and is less interesting as data)
  - the case on the website
    - is linked by the website as   https://uitspraken.rechtspraak.nl/InzienDocument?id=ECLI:NL:PHR:2011:BP5608
    - it seems the slightly shorter https://deeplink.rechtspraak.nl/uitspraak?id=ECLI:NL:PHR:2011:BP5608 is equivalent
    - both of the above redirect to a URL like https://uitspraken.rechtspraak.nl/#!/details?id=ECLI:NL:PHR:2011:BP5608
      - which is a general page with scripting that picks up that identifer and then does another request to https://uitspraken.rechtspraak.nl/api/document/?id=ECLI:NL:PHR:2011:BP5608 (the webpage-view variant mentioned earlier)
  - The website also often links to LiDo, e.g.  https://linkeddata.overheid.nl/document/ECLI:NL:PHR:2011:BP5608
    - If you want that as data, consider http://linkeddata.overheid.nl/service/get-links?ext-id=ECLI:NL:PHR:2011:BP5608&output=xml
    - though as https://linkeddata.overheid.nl/front/portal/services notes, this is not part of public LiDo, so you'll need to request an account first


Because there are over three million dutch ECLIs on record, getting a lot of data from here would take a while, and they ask you to be nice to their server.

There used to be a ZIP file (linked from [https://www.rechtspraak.nl/Uitspraken/paginas/open-data.aspx](https://www.rechtspraak.nl/Uitspraken/paginas/open-data.aspx) you could download to bootstrap your own copy. This seems to have been removed, as [this open data request](https://data.overheid.nl/community/datarequest/zip-bestand-alle-uitspraken) seems to confirm. It seems to imply that you should get it the hammery way.
(As of this writing the URL to the ZIP file still works but they probably removed the link for a reason)

In [1]:
import collections, time, random, datetime

import wetsuite.datacollect.rechtspraaknl
import wetsuite.datasets
import wetsuite.helpers.date
import wetsuite.helpers.etree
import wetsuite.helpers.koop_parse
import wetsuite.helpers.meta
import wetsuite.helpers.net
import wetsuite.helpers.localdata
import wetsuite.helpers.notebook

## Example of API search, its results, and fetching

### Querying 

The base URL for data is http://data.rechtspraak.nl/uitspraken/.

Browsing to just that will link you to some identifier/value lists.

The **search** base is http://data.rechtspraak.nl/uitspraken/zoeken

Search parameters include: (again, see [the documentation](https://www.rechtspraak.nl/SiteCollectionDocuments/Technische-documentatie-Open-Data-van-de-Rechtspraak.pdf))
* `type` - `Uitspraak` or `Conclusie`
* `return` - if you specify `return=DOC` you only get entries for which there is a document; if not you also get entries for which there is only metadata
* `from`, `max`  (from is 0-based, max value of max is documented as 1000)
* `sort` - default is by modification date, ascending. `DESC` lets you do descending instead.

* `date` - date of this uitspraak / conclusie
* `uitspraakdatum` - (date, or date range; optional)
* `instantie` - as mentioned in https://data.rechtspraak.nl/Waardelijst/Instanties 
* `subject` - rechtsgebied as mentioned in https://data.rechtspraak.nl/Waardelijst/Rechtsgebieden
* `modified` - last change of the metadata and/or text (with some subtleties, e.g. not necessarily of the uitrspraak or conclusie document)
* `replaces` - previous ECLI, or LJN,. for this case.  Meant for backwards compatibility (of searches?).

For example: http://data.rechtspraak.nl/uitspraken/zoeken?modified=2023-01-01&max=50

The response format is [Atom](https://en.wikipedia.org/wiki/Atom_(standard)) and entries are minimal: title, date, summary (empty in the following example), and an URL pointing at the XML data  

        <entry>
            <id>ECLI:NL:RBAMS:2021:8211</id>
            <title type="text">ECLI:NL:RBAMS:2021:8211, Rechtbank Amsterdam, 29-06-2021, C/13/702999 / KG ZA 21-458</title>
            <summary type="text"/>
            <updated>2023-01-04T15:21:26Z</updated>
            <link rel="alternate" type="text/html" href="https://uitspraken.rechtspraak.nl/details?id=ECLI:NL:RBAMS:2021:8211"/>
        </entry>

...but let's get code to help:

In [None]:
## First get the search-result metadata

search_result_entries = {}

# Note that given at_a_time is necessary anyway (there can be more than 1000 per day), the value of increment_days is barely relevant
for range_from, range_to in wetsuite.helpers.date.date_ranges( from_date=wetsuite.helpers.date.date_weeks_ago(6), # or e.g. '2024-01-01',
                                                               to_date  =wetsuite.helpers.date.date_today(),      
                                                               increment_days=7, strftime_format="%Y-%m-%d" ):
    from_position, at_a_time = 0, 1000   # note: 'max' seems capped at 1000, so probably don't touch that;
    
    while True: # the code below does multiple fetches until we've fetched that range.

        ## The query
        # You can ask for fairly specific things like...
        #query = [
        #    ('from', str(from_position)),  
        #    ('max',  str(at_a_time)),     # max seems capped at 1000, so we have to do more in multiple fetches
        #    ('creator', 'http://standaarden.overheid.nl/owms/terms/Rechtbank_Den_Haag'),   #('creator', 'http://psi.rechtspraak.nl/buitenlandseInstantie'),
        #    ('subject', 'http://psi.rechtspraak.nl/rechtsgebied#civielRecht_intellectueeleigendomsrecht'),
        #    ('return', 'DOC'),                                          # DOC asks for things with body text only
        #    #('modified', '2023-12-28'),
        #]
        #
        # ...but this notebook happens to be more about updating in bulk, so:
        query = []
        query.extend( [('modified',range_from), ('modified',range_to)] )

        ## Fetch more than the first hits
        # We are only allowed to fetch 1000 search results at a time -- fair enough -- so when we do want all, we needto do some extra queries.
        # the following fragment, the while above and the break below that stops it, are the way we do that here.
        query.extend( [ 
            ('from', str(from_position)),
            ('max',  str(at_a_time)    ),
        ] )
        print('Querying:', query)

        search_results = wetsuite.datacollect.rechtspraaknl.search( query )
        # That functiuon returns a parsed etree object. We could show that relatively raw like...
        #print( wetsuite.helpers.etree.debug_pretty(search_results) ) 
        # ...yet our parsed form (each entry as a dict) is little simpler to read:

        search_entries = wetsuite.datacollect.rechtspraaknl.parse_search_results( search_results )
        # we could make overall_entries a list and just extend() it, but there would probably be a lot of overlap,
        # so instead we use the ECLI to deduplicate
        #print('adding %d entries'%len(search_entries))
        for entry_dict in search_entries:
            search_result_entries[ entry_dict.get('ecli') ] = entry_dict 

        if len(search_results) < 1000: # last bunch?
            break
        # otherwise go on to fetch next page
        from_position += at_a_time
        time.sleep(2)

    print( "Cumulative query results: %d"%len(search_result_entries) )

### Fetching the documents the search refers to

In [4]:
rechtspraak_fetched = wetsuite.helpers.localdata.LocalKV('rechtspraak_fetched.db', key_type=str, value_type=bytes)

In [None]:
## Now get the text belonging to each
paths            = collections.defaultdict(int)
count_variations = collections.defaultdict(int)

pbar = wetsuite.helpers.notebook.progress_bar( len(search_result_entries) )

for entry in search_result_entries.values():
    pbar.description = f"Fetched:{count_variations['fetched']}, cached:{count_variations['cached']}, errors:{count_variations['error']}"
    pbar.value += 1
    
    entry_xml_url = 'https://data.rechtspraak.nl/uitspraken/content?id=%s'%entry['ecli']
    # TODO: timeout catch
    try:
        xmlbytes, came_from_cache = wetsuite.helpers.localdata.cached_fetch( rechtspraak_fetched, entry_xml_url )
        if came_from_cache:
            count_variations['cached']  +=1
        else:
            count_variations['fetched'] += 1
            time.sleep( 2 ) # be somewhat nice to the servers

    except ValueError as ve:
        count_variations['error'] += 1
        print( '%s for %r'%(ve, entry_xml_url) )
        time.sleep( 30 ) # be somewhat nicer to the servers

print( f"Fetched:{count_variations['fetched']}, cached:{count_variations['cached']}, errors:{count_variations['error']}\n" ) # because the progress bar doesn't update after iterating

Note:
There is a [extras_diagnose_rechtspraak_docstructure (in the notebook repository)](https://github.com/WetsuiteLeiden/example-notebooks/blob/main/specific-experiments/investigate-document-structures/rechtspraak_docstructure.ipynb), an exploration of the documents that informs some of the choices made below,
and in particular some of the helper code in the `wetsuite.datacollect.rechtspraaknl` module.

# Start making a dataset

## Sample dataset of raw XML documents

First something easy: a small sample of raw documents

In [7]:
rechtspraaknl_sample_xml = wetsuite.helpers.localdata.LocalKV('rechtspraaknl-sample-xml.db', str, None)
#rechtspraaknl_sample_xml.truncate()
rechtspraaknl_sample_xml._put_meta('description_short', '''A small sample of the XML form available at rechtspraak.nl: documents from 2022 on''' )
rechtspraaknl_sample_xml._put_meta('description',       '''rechtspraak.nl XML documents from 2022 on.
The key is the URL it came from, the value is the raw XML document as a bytestring.''' )

In [None]:
# 
for url in wetsuite.helpers.notebook.ProgressBar( rechtspraak_fetched.keys() ):
    if ':2022:' in url  or  ':2023:' in url   or  ':2024:' in url: # cheaper than parsing
        xmlbytes = rechtspraak_fetched.get( url )
        if b'<conclusie' in xmlbytes or b'<uitspraak' in xmlbytes: # cheaper than parsing
            rechtspraaknl_sample_xml.put( url, xmlbytes, commit=False ) # the commit thing and the next line makes the writing faster
rechtspraaknl_sample_xml.commit()

rechtspraaknl_sample_xml.summary(True)

## Fuller dataset, more parsed

In [10]:
rechtspraaknl_struc = wetsuite.helpers.localdata.MsgpackKV('rechtspraaknl-struc.db', str, None) # 12GByte uncompressed right now
#rechtspraaknl_struc.truncate()
rechtspraaknl_struc._put_meta('description_short', '''Cases from rechtspraak.nl, in a more pre-parsed form.''' )
rechtspraaknl_struc._put_meta('description',       '''Cases from rechtspraak.nl, in a more pre-parsed form.

A key is an URL like
    'https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RBAMS:2012:BY7448'

And the values (with [...] cutting off longer text fields)
    {'identifier': 'ECLI:NL:RBAMS:2012:BY7448',
    'issued': '2013-04-05',
    'publisher': 'Raad voor de Rechtspraak',
    'replaces': 'ECLI:NL:RBAMS:2012:6574',
    'date': '2012-11-07',
    'type': 'Uitspraak',
    'modified': '2022-03-10T09:10:56',
    'zaaknummer': '1320690 \\ HA EXPL  12-35',
    'creator': 'Rechtbank Amsterdam',
    'subject': 'Civiel recht',
    'inhoudsindicatie': '\nHet vragen van een persoonlijke garantstelling door een advocaat van een bestuurder voor openstaande [...]',
    'bodytext': "\nvonnis\nRECHTBANK  AMSTERDAM\n\nSector kanton\nlocatie: Amsterdam\n\nZaaknummer en rolnummer: 1320690 \\ HA EXPL  12-35\nUitspraak: [...]"}

'''+wetsuite.datasets.generated_today_text() )

### Efficiently skipping the majority we want to skip

A `len(rechtspraak_fetched)` would show us that we've fetched 3.3 million things. 

It turns out that the majority contain only metadata and no text. 
Our interest is only cases with text.

That would mean two million parses would be done _only_ to figure out we won't use that parse.
So we create a store that remembers that, so we can look it up and skip parsing.
- This could go at 10000 items per second when skipping cases we know have no text (so a few minutes total)
- and may be at ~100/s if we check everything (so hours the first time)
That difference is why we spend time on this optimization.

(Note that we are currently not sure that, when cases are fetched, they may add text later,
if cases change like that _and_ the fetching actually updates that, that update should _remove_ knowledge from the following store.)

In [5]:
rechtspraaknl_knownnotext = wetsuite.helpers.localdata.LocalKV('rechtspraaknl_knownnotext.db', str, None)

In [7]:
# This updates that 'does it contain text?' store before we actually start parsing.
# should take a few minutes -- except the first time you run this
known_notext_test = set(rechtspraaknl_knownnotext.keys()) # fetches out and placed into a set for somewhat faster lookups

for url in wetsuite.helpers.notebook.ProgressBar( rechtspraak_fetched.keys() ):
    if url in known_notext_test: #rechtspraaknl_knownnotext:
        continue
    else:
        xmlbytes = rechtspraak_fetched.get( url )
        # it's a case we didn't previously check
        if b'<conclusie' not in xmlbytes and b'<uitspraak' not in xmlbytes: # cheaper than parsing before deciding based on the parse?
            rechtspraaknl_knownnotext.put(url, 'y')
        #else:
        #    rechtspraaknl_knownnotext.put(url, 'n')

  0%|          | 0/3345186 [00:00<?, ?it/s]

### (incremental) parsing and storing

In [12]:
# these are used in the update section.
known_notext_urls      = set( rechtspraaknl_knownnotext.keys() )  # put those into a set, not list, so that the 'in' test is fast.
already_extracted_urls = set( rechtspraaknl_struc.keys() )        # assuming you didn't just truncate(), we can skip the things we have and only update

print( f'Known to have no text: {len(known_notext_urls)};   already extracted: {len(already_extracted_urls)}' )

Known to have no text: 2564616;   already extracted: 771860


In [13]:
# When updating just a few hundred items, this may take only a minute

count_variations = collections.defaultdict(int)

selected_keys = list(rechtspraak_fetched.keys()) # all keys, the real version. You might do a  random.sample( selected_keys, 5000 )  while debugging this
pbar = wetsuite.helpers.notebook.progress_bar( len(selected_keys), description='parsing...')

for url in selected_keys:
    pbar.value += 1

    pbar.description = f"{count_variations['conclusies']} new conclusies, {count_variations['uitspraken']} new uitspraken, {count_variations['notext']} notext, {count_variations['present']} already present " 

    if url in known_notext_urls: # we previously figured it had no text, no sense trying to parse it
        count_variations['notext'] += 1
        continue

    if url in already_extracted_urls:  # we previously extracted text (won't update). This is most of the speed increase, probably
        count_variations['present'] += 1
        continue


    ## Load and parse
    xmlbytes = rechtspraak_fetched.get(url)
    # skip parse if we think it's not worth it -- this test is a little chaper than parsing before deciding based on the parse?
    if b'<conclusie' not in xmlbytes and b'<uitspraak' not in xmlbytes: 
        count_variations['notext'] += 1
        print("missed ealier check?", url) # (or have fetched new things we haven't run through that yet)
        continue

    # actually parse XML
    tree = wetsuite.helpers.etree.fromstring( xmlbytes )
    tree = wetsuite.helpers.etree.strip_namespace( tree )

    content = wetsuite.datacollect.rechtspraaknl.parse_content( tree )

    rechtspraaknl_struc.put(url, content )

    if tree.find('uitspraak') is not None:
        count_variations['uitspraken'] += 1
    elif tree.find('conclusie') is not None:
        count_variations['conclusies'] += 1
    else: # actually shouldn't happen, in that the above should have caught that
        print("PROBLEM", url)
        count_variations['notext'] += 1

#print(f"{count_conclusies} conclusies and {count_uitspraken} uitspraken   (and {count_neither} that have no text)")       

parsing...:   0%|          | 0/3345186 [00:00<?, ?it/s]

## Some debug

In [16]:
display( rechtspraaknl_struc.summary(True) )

#rechtspraaknl_struc.random_sample(2)

{'size_bytes': 13874950144,
 'size_readable': '12.9GiB',
 'num_items': 780570,
 'avgsize_bytes': 17775,
 'avgsize_readable': '17KiB'}