If you want to use this notebook online without installing Python on your computer, try:
<a href="https://colab.research.google.com/github/WetSuiteLeiden/example-notebooks/blob/main/wetsuite-nlp-crash-course/3-a-first-nlp-experiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Google Colab"/></a> (do note however that this requires a Google account).

# WetSuite NLP crash course
# Part 3: A first NLP experiment

In part 3 of this crash course, we will finally build our first NLP experiment! In this experiment, we will count how often a reference is made to Dutch parliamentary documents (specifically, to the Kamerstukken) in court decisions.

# Sources
This notebook is based on Martijn Staal's bachelor thesis [_De memorie van toelichting overvraagd?_](https://martijn-staal.nl/nl/blog/de-memorie-van-toelichting-overvraagd) and the accompanying [digital appendix](https://github.com/mastaal/de-mvt-overvraagd).

# Step 1: Pre-processing our data
Before we can analyze anything, we of course need to get access to some data! Luckily for us, all the published court decisions in the Netherlands are also available via an online API (See ["Open Data van de Rechtspraak"](https://www.rechtspraak.nl/Uitspraken/Paginas/Open-Data.aspx)). This means that it is relatively straightforward to collect court decisions to analyze. To make your our first experiment even easier, we will use the WetSuite sample dataset `rechtspraak-sample-xml`. We've already downloaded some of the court decisions to make this first experiment easier.

So, just like in part 2, we start with installing the wetsuite library and importing it:

In [None]:
!pip3 install -U wetsuite
import wetsuite.datasets

Then we can download the dataset (if you've already downloaded it before and it's still on your PC, it will just load the dataset without re-downloading it):

In [None]:
# rd, short for rechtspraak-dataset
# Note that this download can take a while, as it is about 450 MB.
# Make sure you have enough space available on your device, as the uncompressed
# dataset is nearly 4GB.
rd = wetsuite.datasets.load("rechtspraaknl-sample-xml")

Now, let us see in what format the data is that we get from the Open Data Rechtspraak API:

In [None]:
# Get a random key and value from the dataset
key = "https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:CBB:2022:1"
value = rd.data.get(key)

# The value is stored as a bitstring, but to print it readably we decode it to a proper UTF-8 string.
value_str: str = value.decode()

print(value_str)

Well, that is clearly not just the text of the decision. What's with all the `<>`-brackets? It turns out the Open Data Rechtspraak API makes the court decisions available in the machine-readable [XML](https://en.wikipedia.org/wiki/XML)-format. XML is together with [JSON](https://en.wikipedia.org/wiki/JSON) one of the major two general format for computers to exchange information. Many more specific file standards are based on XML.

Generally the provider of an API also provides documentation on the _structure_ of the XML (or JSON) that is send by the API. The Dutch Judicial Council does this as well (although it is only available in Dutch): [Open Data van de Rechtspraak, versie 1.15](https://www.rechtspraak.nl/SiteCollectionDocuments/Technische-documentatie-Open-Data-van-de-Rechtspraak.pdf). Here, starting at page 16, it is exactly defined how the XML that we have is structured and what we can find where. The XML for each decision consists of metadata and the actual text with paragraph structures.

In our simple experiment, we just want to count the number of citations per decision, so we can ignore the metadata. In a real experiment, you probably do not want to do that, since the metadata can help you distinguish your results better. For example: maybe specific (types of) courts cite more often to parliamentary documents, or there is a specific time frame in which this occurs more often. To research such more specific things, you will need the metadata.

According to the specification, we can find the text of the decision in the `<uitspraak></uitspraak>` elements. Python has a built-in [XML library](https://docs.python.org/3/library/xml.etree.elementtree.html#tutorial) which we can use to select just the part we need.

So first, let's import the XML library and define the required namespaces:

In [None]:
# Import the XML library
# Note that normally you prefer to put all the inputs at the top of your Python file.
import xml.etree.ElementTree as ET

# Define the XML namespaces that we can expect in our XML
# See for more information: https://en.wikipedia.org/wiki/XML_namespace
XML_NS = {
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "dcterms": "http://purl.org/dc/terms/",
    "psi": "http://psi.rechtspraak.nl/",
    "rs": "http://www.rechtspraak.nl/schema/rechtspraak-1.0",
    "ecli": "https://e-justice.europa.eu/ecli"
}

Now we can load the XML data:

In [None]:
value_xml = ET.fromstring(value)

print(value_xml)

Now we can use Python's [xml library to find the appropriate XML element](https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements), which was `<uitspraak></uitspraak>` according to the documentation. Since the namespace for this element is `http://www.rechtspraak.nl/schema/rechtspraak-1.0`, we can use the shorthand `rs` we defined in our `XML_NS` dictionary above. So we need to look for the `rs:uitspraak` element:

In [None]:
uitspraakxml = value_xml.find("rs:uitspraak", XML_NS)

print(uitspraakxml)

Okay, neat. We've found the correct element. But we want to search through the text within the element, preferably without all the XML things. Luckily, Python provides the [xml.etree.ElementTree.tostring](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.tostring) function, which we can use to convert the content of a parsed XML element into just plain text:

In [None]:
uitspraak_bitstring = ET.tostring(uitspraakxml, encoding="utf-8", method="text")
# Since XML is a binary format, this results in a bitstring, so we need to decode it to get a proper string:
uitspraak_str = uitspraak_bitstring.decode()

print(uitspraak_str)

Now we have managed to convert this one court decision into some neat plaintext, in which we can search for our citations. However, we've only preprocessed this one decision. We will probably want to do that to a lot more decisions. So let us convert this procedure into a re-usable function.

In [None]:
def preprocess_decision(rd: wetsuite.datasets.Dataset, key: str) -> str:
    """Retrieve the value corresponding to the given key and return the decision text in plaintext"""

    value = rd.data.get(key)

    value_xml = ET.fromstring(value)
    uitspraakxml = value_xml.find("rs:uitspraak", XML_NS)

    uitspraak_bitstring = ET.tostring(uitspraakxml, encoding="utf-8", method="text")
    # Since XML is a binary format, this results in a bitstring, so we need to decode it to get a proper string:
    uitspraak_str = uitspraak_bitstring.decode()

    return uitspraak_str

We can now simply call this function with any key to do our pre-processing:

In [None]:
print(preprocess_decision(rd, "https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RVS:2022:1"))
