If you want to use this notebook online without installing Python on your computer, try:
<a href="https://colab.research.google.com/github/WetSuiteLeiden/example-notebooks/blob/main/wetsuite-nlp-crash-course/3-a-first-nlp-experiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Google Colab"/></a> (do note however that this requires a Google account).

# WetSuite NLP crash course
# Part 3: A first NLP experiment

In part 3 of this crash course, we will finally build our first NLP experiment! In this experiment, we will count how often a reference is made to Dutch parliamentary documents (specifically, to the Kamerstukken) in court decisions.

# Sources
This notebook is based on Martijn Staal's bachelor thesis [_De memorie van toelichting overvraagd?_](https://martijn-staal.nl/nl/blog/de-memorie-van-toelichting-overvraagd) and the accompanying [digital appendix](https://github.com/mastaal/de-mvt-overvraagd).

# Step 1: Pre-processing our data
Before we can analyze anything, we of course need to get access to some data! Luckily for us, all the published court decisions in the Netherlands are also available via an online API (See ["Open Data van de Rechtspraak"](https://www.rechtspraak.nl/Uitspraken/Paginas/Open-Data.aspx)). This means that it is relatively straightforward to collect court decisions to analyze. To make your our first experiment even easier, we will use the WetSuite sample dataset `rechtspraak-sample-xml`. We've already downloaded some of the court decisions to make this first experiment easier.

So, just like in part 2, we start with installing the wetsuite library and importing it:

In [1]:
!pip3 install -U wetsuite
import wetsuite.datasets



Then we can download the dataset (if you've already downloaded it before and it's still on your PC, it will just load the dataset without re-downloading it):

In [2]:
# rd, short for rechtspraak-dataset
# Note that this download can take a while, as it is about 450 MB.
# Make sure you have enough space available on your device, as the uncompressed
# dataset is nearly 4GB.
rd = wetsuite.datasets.load("rechtspraaknl-sample-xml")

Now, let us see in what format the data is that we get from the Open Data Rechtspraak API:

In [3]:
# Get a random key and value from the dataset
key = "https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:CBB:2022:1"
value = rd.data.get(key)

# The value is stored as a bitstring, but to print it readably we decode it to a proper UTF-8 string.
value_str: str = value.decode()

print(value_str)

<?xml version="1.0" encoding="utf-8"?>
<open-rechtspraak>
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:ecli="https://e-justice.europa.eu/ecli" xmlns:tr="http://tuchtrecht.overheid.nl/" xmlns:eu="http://publications.europa.eu/celex/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:bwb="bwb-dl" xmlns:cvdr="http://decentrale.regelgeving.overheid.nl/cvdr/" xmlns:psi="http://psi.rechtspraak.nl/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
    <rdf:Description>
      <dcterms:identifier>ECLI:NL:CBB:2022:1</dcterms:identifier>
      <dcterms:format>text/xml</dcterms:format>
      <dcterms:accessRights>public</dcterms:accessRights>
      <dcterms:modified>2022-03-18T10:07:19</dcterms:modified>
      <dcterms:issued rdfs:label="Publicatiedatum">2022-01-07</dcterms:issued>
      <dcterms:publisher resourceIdentifier="http://rechtspraak.nl/">Raad voor de Rechtspraak</dcterms:publisher>
      <dcterms:language>nl</dcterms:language>
      <dcterms:creator rdfs:la

Well, that is clearly not just the text of the decision. What's with all the `<>`-brackets? It turns out the Open Data Rechtspraak API makes the court decisions available in the machine-readable [XML](https://en.wikipedia.org/wiki/XML)-format. XML is together with [JSON](https://en.wikipedia.org/wiki/JSON) one of the major two general format for computers to exchange information. Many more specific file standards are based on XML.

Generally the provider of an API also provides documentation on the _structure_ of the XML (or JSON) that is send by the API. The Dutch Judicial Council does this as well (although it is only available in Dutch): [Open Data van de Rechtspraak, versie 1.15](https://www.rechtspraak.nl/SiteCollectionDocuments/Technische-documentatie-Open-Data-van-de-Rechtspraak.pdf). Here, starting at page 16, it is exactly defined how the XML that we have is structured and what we can find where. The XML for each decision consists of metadata and the actual text with paragraph structures.

In our simple experiment, we just want to count the number of citations per decision, so we can ignore the metadata. In a real experiment, you probably do not want to do that, since the metadata can help you distinguish your results better. For example: maybe specific (types of) courts cite more often to parliamentary documents, or there is a specific time frame in which this occurs more often. To research such more specific things, you will need the metadata.

According to the specification, we can find the text of the decision in the `<uitspraak></uitspraak>` elements. Python has a built-in [XML library](https://docs.python.org/3/library/xml.etree.elementtree.html#tutorial) which we can use to select just the part we need.

So first, let's import the XML library and define the required namespaces:

In [4]:
# Import the XML library
# Note that normally you prefer to put all the inputs at the top of your Python file.
import xml.etree.ElementTree as ET

# Define the XML namespaces that we can expect in our XML
# See for more information: https://en.wikipedia.org/wiki/XML_namespace
XML_NS = {
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "dcterms": "http://purl.org/dc/terms/",
    "psi": "http://psi.rechtspraak.nl/",
    "rs": "http://www.rechtspraak.nl/schema/rechtspraak-1.0",
    "ecli": "https://e-justice.europa.eu/ecli"
}

Now we can load the XML data:

In [5]:
value_xml = ET.fromstring(value)

print(value_xml)

<Element 'open-rechtspraak' at 0x7f6d4c092520>


Now we can use Python's [xml library to find the appropriate XML element](https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements), which was `<uitspraak></uitspraak>` according to the documentation. Since the namespace for this element is `http://www.rechtspraak.nl/schema/rechtspraak-1.0`, we can use the shorthand `rs` we defined in our `XML_NS` dictionary above. So we need to look for the `rs:uitspraak` element:

In [6]:
uitspraakxml = value_xml.find("rs:uitspraak", XML_NS)

print(uitspraakxml)

<Element '{http://www.rechtspraak.nl/schema/rechtspraak-1.0}uitspraak' at 0x7f6d4c093c40>


Okay, neat. We've found the correct element. But we want to search through the text within the element, preferably without all the XML things. Luckily, Python provides the [xml.etree.ElementTree.tostring](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.tostring) function, which we can use to convert the content of a parsed XML element into just plain text:

In [7]:
uitspraak_bitstring = ET.tostring(uitspraakxml, encoding="utf-8", method="text")
# Since XML is a binary format, this results in a bitstring, so we need to decode it to get a proper string:
uitspraak_str = uitspraak_bitstring.decode()

print(uitspraak_str)


  
    uitspraak 
    
      
        
          
        
      
      
        
          
        
      
    
    COLLEGE VAN BEROEP VOOR HET BEDRIJFSLEVEN
    
    
      Zaaknummer: 20/1063
    
    
    uitspraak van de meervoudige kamer van 11 januari 2022 in de zaak tussen
    
      [naam 1] , te [woonplaats] , appellant,
    
    
      en
    
    
    de minister van Economische Zaken en Klimaat, verweerder
    (gemachtigde: mr. M. Wullink).
    
    
  
  
    Procesverloop 
    
    
      Bij besluit van 2 oktober 2020 (het primaire besluit) heeft verweerder beslist op de aanvraag van appellant om een investeringssubsidie duurzame energie (ISDE) voor een warmtepomp in het kader van de Regeling nationale EZ-subsidies (Regeling).
    
    
    
      Bij besluit van 2 november 2020 (het bestreden besluit) heeft verweerder het bezwaar tegen het primaire besluit ongegrond verklaard en het primaire besluit gehandhaafd. 
    
    
    
      Appellant heeft tegen het bestred

Now we have managed to convert this one court decision into some neat plaintext, in which we can search for our citations. However, we've only preprocessed this one decision. We will probably want to do that to a lot more decisions. So let us convert this procedure into a re-usable function.

In [8]:
def preprocess_decision(rd: wetsuite.datasets.Dataset, key: str) -> str:
    """Retrieve the value corresponding to the given key and return the decision text in plaintext"""

    value = rd.data.get(key)

    value_xml = ET.fromstring(value)
    uitspraakxml = value_xml.find("rs:uitspraak", XML_NS)

    uitspraak_bitstring = ET.tostring(uitspraakxml, encoding="utf-8", method="text")
    # Since XML is a binary format, this results in a bitstring, so we need to decode it to get a proper string:
    uitspraak_str = uitspraak_bitstring.decode()

    return uitspraak_str

We can now simply call this function with any key to do our pre-processing:

In [9]:
print(preprocess_decision(rd, "https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RVS:2022:1"))



  
    202101105/1/V1.
    Datum uitspraak: 3 januari 2022
    AFDELING
    BESTUURSRECHTSPRAAK
    Uitspraak met toepassing van artikel 8:54, eerste lid, van de Algemene wet bestuursrecht op het hoger beroep van:
    de staatssecretaris van Justitie en Veiligheid,
    appellant,
    tegen de uitspraak van de rechtbank Den Haag, zittingsplaats Roermond, van 9 februari 2021 in zaak nr. NL19.28807 in het geding tussen:
    [de vreemdeling]
    en
    de staatssecretaris.
    Procesverloop
    Bij besluit van 22 november 2019 heeft de staatssecretaris een aanvraag van de vreemdeling om hem een verblijfsvergunning asiel voor bepaalde tijd te verlenen, afgewezen.
    Bij uitspraak van 9 februari 2021 heeft de rechtbank het daartegen door de vreemdeling ingestelde beroep gegrond verklaard, dat besluit vernietigd en de staatssecretaris opgedragen een nieuw besluit op de aanvraag te nemen met inachtneming van de uitspraak.
    Tegen deze uitspraak heeft de staatssecretaris hoger beroep ingest

# Step 2: Searching for citations

We want to know how many citations there are to Kamerstukken in a court decision. Some examples of such citations are:

```
Kamerstukken II 2022/23, 36 229, nr. 1
Kamerstukken II 2020/21, 35 791, nr. 3
Kamerstukken II 2005/06, 30 316, nr. 3, p. 7–8
Kamerstukken II 2021/22, 29 628, nr. 1051
Kamerstukken II 2019/20, 35 300-XV, nr. 28
Kamerstukken I 2003/04, 27 484 (R 1669), nr. 289
Kamerstukken I 2020/21, 35 570, C
Kamerstukken I 2021/22, 35 925, nr. E
```

So how can we look for them? The first thing that comes to mind is to search for what the examples all have in common: they start with _Kamerstukken_. The simplest way to test if some string is contained in a longer string is using `in`:

In [11]:
print("Kamerstukken" in "False test")
print("Kamerstukken" in "Kamerstukken I 2021/22, 35 925, nr. E")

False
True


However, this only gives a binary result. When there are multiple citations in the same text, it will say _True_ all the same. We need something better to find patterns in text.

One approach to this is to use [_regular expressions_](https://en.wikipedia.org/wiki/Regular_expression) (often abbreviated to just _regexp_ or _regex_). A regex is a description of some textual pattern. In many programming languages, such regexes can be used to search for the patterns which they describe. In Python, the [re](https://docs.python.org/3/library/re.html) library provides this functionality.

For a more thorough introduction to regexes, there's a free Codecademy course: [Learn the Basics of Regular Expressions](https://www.codecademy.com/learn/introduction-to-regular-expressions), and also [REgexOne's interactive online course](https://www.regexone.com/).

If we just want to count citations, we can start out with a rough pattern that is just `Kamerstukken`.

In [13]:
import re

regex_pattern = r"Kamerstukken"

sample_text = preprocess_decision(rd, "https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:HR:2022:394")

re.findall(regex_pattern, sample_text)

['Kamerstukken', 'Kamerstukken']

Great, it seems that we've found two citations! But, our pattern was quite rough. We need to take into account that our pattern might also match _other_ snippets of text that are not the citations we are looking for.

Depending on what the goals are of your research you might need a more precise pattern, or a rough approach might be okay.