If you want to use this notebook online without installing Python on your computer, try:
<a href="https://colab.research.google.com/github/WetSuiteLeiden/example-notebooks/blob/main/wetsuite-nlp-crash-course/3-a-first-nlp-experiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Google Colab"/></a> (do note however that this requires a Google account).

# WetSuite NLP crash course
# Part 3: A first NLP experiment

In part 3 of this crash course, we will finally build our first NLP experiment. In this experiment, we will count how often a reference is made to Dutch parliamentary documents (specifically, to the Kamerstukken) in court decisions.

# Sources
This notebook is based on Martijn Staal's bachelor thesis [_De memorie van toelichting overvraagd?_](https://martijn-staal.nl/nl/blog/de-memorie-van-toelichting-overvraagd) and the accompanying [digital appendix](https://github.com/mastaal/de-mvt-overvraagd).

In [1]:
!pip3 install -U wetsuite nllegalcit

Collecting nllegalcit
  Downloading nllegalcit-0.0.4-py3-none-any.whl (17 kB)
Collecting pypdf>=3.17.2
  Downloading pypdf-4.3.1-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.8/295.8 KB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting cryptography>=41.0.7
  Downloading cryptography-43.0.0-cp39-abi3-manylinux_2_28_x86_64.whl (4.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting lark>=1.1.8
  Downloading lark-1.2.2-py3-none-any.whl (111 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.0/111.0 KB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pypdf, lark, cryptography, nllegalcit
Successfully installed cryptography-43.0.0 lark-1.2.2 nllegalcit-0.0.4 pypdf-4.3.1


In [34]:
import wetsuite.datasets
from nllegalcit import parse_citations

import xml.etree.ElementTree as ET

# XML namespaces
XML_NS = {
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "dcterms": "http://purl.org/dc/terms/",
    "psi": "http://psi.rechtspraak.nl/",
    "rs": "http://www.rechtspraak.nl/schema/rechtspraak-1.0",
    "ecli": "https://e-justice.europa.eu/ecli"
}

In [4]:
# rd, short for rechtspraak-dataset
# Note that this download can take a while, as it is about 450 MB.
# Make sure you have enough space available on your device, as the uncompressed
# dataset is nearly 4GB.
rd = wetsuite.datasets.load("rechtspraaknl-sample-xml")

Downloading 'https://wetsuite.knobs-dials.com/datasets/rechtspraaknl-sample-xml.db.xz' to '/home/martijn/.wetsuite/datasets/4e7748c1c84ff581fb303a540447a5e8a6287bd4'
Decompressing... 3.9GiB    


In [6]:
keys_list = list(rd.data.keys())

In [18]:
# This dataset has URLs with the ECLI's as the keys, and the raw XML data as values:
print(keys_list[0])
print(rd.data.get(keys_list[0]))
# The structure of this XML data can be found here: https://www.rechtspraak.nl/SiteCollectionDocuments/Technische-documentatie-Open-Data-van-de-Rechtspraak.pdf

# See how many court decisions there are in this dataset:
print(len(keys_list))

https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:CBB:2022:1
b'<?xml version="1.0" encoding="utf-8"?>\r\n<open-rechtspraak>\r\n  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:ecli="https://e-justice.europa.eu/ecli" xmlns:tr="http://tuchtrecht.overheid.nl/" xmlns:eu="http://publications.europa.eu/celex/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:bwb="bwb-dl" xmlns:cvdr="http://decentrale.regelgeving.overheid.nl/cvdr/" xmlns:psi="http://psi.rechtspraak.nl/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">\r\n    <rdf:Description>\r\n      <dcterms:identifier>ECLI:NL:CBB:2022:1</dcterms:identifier>\r\n      <dcterms:format>text/xml</dcterms:format>\r\n      <dcterms:accessRights>public</dcterms:accessRights>\r\n      <dcterms:modified>2022-03-18T10:07:19</dcterms:modified>\r\n      <dcterms:issued rdfs:label="Publicatiedatum">2022-01-07</dcterms:issued>\r\n      <dcterms:publisher resourceIdentifier="http://rechtspraak.nl/">Raad voor de Rechtspraa

In [51]:
key = "https://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:CBB:2022:1"
value = rd.data.get(key)
xml = ET.fromstring(value)
uitspraakxml = xml.find("rs:uitspraak", XML_NS)

pprint(value)

from pprint import pprint
pprint(ET.tostring(uitspraakxml, encoding="utf-8", method="text").decode(encoding="utf-8"))

(b'<?xml version="1.0" encoding="utf-8"?>\r\n<open-rechtspraak>\r\n  <rdf:RDF x'
 b'mlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:ecli="https://e'
 b'-justice.europa.eu/ecli" xmlns:tr="http://tuchtrecht.overheid.nl/" xmlns:eu='
 b'"http://publications.europa.eu/celex/" xmlns:dcterms="http://purl.org/dc/ter'
 b'ms/" xmlns:bwb="bwb-dl" xmlns:cvdr="http://decentrale.regelgeving.overheid.n'
 b'l/cvdr/" xmlns:psi="http://psi.rechtspraak.nl/" xmlns:rdfs="http://www.w3.or'
 b'g/2000/01/rdf-schema#">\r\n    <rdf:Description>\r\n      <dcterms:identifie'
 b'r>ECLI:NL:CBB:2022:1</dcterms:identifier>\r\n      <dcterms:format>text/xm'
 b'l</dcterms:format>\r\n      <dcterms:accessRights>public</dcterms:accessRi'
 b'ghts>\r\n      <dcterms:modified>2022-03-18T10:07:19</dcterms:modified>\r\n '
 b'     <dcterms:issued rdfs:label="Publicatiedatum">2022-01-07</dcterms:issued'
 b'>\r\n      <dcterms:publisher resourceIdentifier="http://rechtspraak.nl/">'
 b'Raad voor de Rechtspraak</dcter