If you want to use this notebook online without installing Python on your computer, try:
<a href="https://colab.research.google.com/github/WetSuiteLeiden/example-notebooks/blob/main/datasets-use/datasets-introduction-part-1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> (do note however that this requires a Google account).

# WetSuite datasets: introduction (part 1)

WetSuite provides tools to more easily use existing legal datasets. In this series of example notebooks, we explain which datasets can be accessed via the `wetsuite-core` library, and how you can use this to leverage NLP tools in your legal research.

## Types of legal datasets
A lot of different legal datasets exist (TODO: reference the website here). The `wetsuite-core` library provides an easier interface to interact with some datasets. For each of these datasets, WetSuite also provides ready-made sample datasets which can help you practice your technical skills and show what's possible. The sample datasets are incomplete, as most datasets are quite big. Furthermore, the sample datasets won't be updated: they remain static. If you want to use the wetsuite-core library with more complete or up-to-date datasets, you will need to download this yourself (TODO: Reference how you can download other parts of the datasets yourself).

## WetSuite sample datasets
The sample datasets that are provided by WetSuite and accessible through the `wetsuite-core` exist to help you get started with programming and NLP research. The datasets are not complete and thus not fit to use in actual research. However, the datasets are based on larger and existing datasets which are complete and can be used for legal research. The things you learn in our notebooks should help you to create the tools you need for your research.

The WetSuite sample datasets are:
* Parliamentary data: a subset Dutch parliamentary data, in the XML format as provided by the Dutch government.
* Court decisions: court decisions of about 2,5 year as provided by the Dutch Judicial Council (Raad voor de rechtspraak) in the original XML format.
* Decisions by the Dutch Gambling Authority (Kansspelautoriteit) in a plaintext format
* Dutch legislation in the XML format as provided by the Dutch government through the API's of wetten.overheid.nl.

## Purpose of this notebook
In this first part, we show how to install the `wetsuite-core` library for use in your Python Notebook (or project). Then we show how to interact with one of the ready-made sample datasets provided by WetSuite.

## Target audience
The primary target audience for this series of notebooks is legal scholars with little or no programming experience.

## What you need
You can read this notebook on GitHub as a reference. However, it is better to run it in order to get some actual programming experience. To run this notebook [TODO: link to how-to-run explanation].

Besides this, the [Python documentation](https://docs.python.org/3/) can prove very helpful. If you want a more step-by-step and guided introduction to Python, plenty of online courses exists; for example [Codecademy's "Learn Python 3" course](https://www.codecademy.com/learn/learn-python-3).

## Summary
* WetSuite provides code to access some legal datasets in the `wetsuite-core` library.
* WetSuite provides some small ready-made sample datasets which can also be accessed directly via the `wetsuite-core` library.
* This series of notebooks aim to illustrate how to use Python to apply NLP-methods for legal research.


# Step 0: What is a notebook?
A Python Notebook, such as this file, is an interactive programming environment. It consists of blocks of text (such as this) and blocks of runnable Python code. When you run the notebook on your computer (or online in a service such as Google Colab), you can change the code and run it! The output of each code block is shown below the code block.

# Step 1: Using the `wetsuite-core` library

In computer programming, a library is a collection of code that is meant to be re-used by other programs. The `wetsuite-core` library provides, among other things, interfaces to interact in a practical manner directly with some existing legal datasets which are made available online. For example, you can easily download all published judgments by a specific court in the Netherlands as made available by the Raad voor de rechtspraak.

In order to use the `wetsuite-core` library, download and install the library by running the following code block:

(TODO: also add documentation on how to run this locally?)

In [None]:
# (only) in colab, run this first to install wetsuite from (the most recent) source.   For your own setup, see wetsuite's install guidelines.
!pip3 install -U wetsuite

Once the wetsuite library is installed, we can import it in our Python Notebook or file. Importing a library allows you to use the functions defined in the library in your program. In this notebook, we will use the datasets part of `wetsuite-core`:

In [1]:
import wetsuite.datasets

To learn more about what features are made available through the `wetsuite-core` library, you can visit the [WetSuite API documentation](https://wetsuite.knobs-dials.com/apidocs/index.html).

# Step 2: Finding a sample dataset

Once a library is imported, we can use it in our program. By calling the following function, we can see which sample datasets are currently available. Please note that it might change over time which sample datasets are available.

In [2]:
wetsuite.datasets.list_datasets()

['bwb-mostrecent-meta-struc',
 'bwb-mostrecent-text',
 'bwb-mostrecent-xml',
 'cvdr-mostrecent-html',
 'cvdr-mostrecent-meta-struc',
 'cvdr-mostrecent-text',
 'cvdr-mostrecent-xml',
 'eurlex-dir-nl-struc',
 'eurlex-judg-nl-struc',
 'eurlex-reg-nl-struc',
 'gemeentes-struc',
 'internetconsultaties-partial-struc',
 'kansspelautoriteit-sancties-struc',
 'parliament-sample-xml',
 'raadvanstate-adviezen-struc',
 'rechtspraaknl-sample-xml',
 'rechtspraaknl-struc',
 'tweedekamer-fractie-membership-struc',
 'tweedekamer-fracties-struc',
 'tweedekamer-kamervragen-struc',
 'wetnamen',
 'woo_besluiten_docs_text',
 'woo_besluiten_meta']

Each sample dataset also has a short (and longer) description, and we can see what the size is of each dataset.

You can get that in a machine-readable format by running: 

In [3]:
# Get all the index information in the form of a dictionary
datasets_index = wetsuite.datasets.fetch_index()

# Get the information of this specific dataset.
datasets_index["rechtspraaknl-sample-xml"]

{'name': 'rechtspraaknl-sample-xml',
 'url': 'https://wetsuite.knobs-dials.com/datasets/rechtspraaknl-sample-xml.db.xz',
 'version': '(preliminary)',
 'type': 'xz-sqlite3',
 'description_short': ' A small sample of the XML form available at rechtspraak.nl: documents from 2022 on ',
 'description': ' (TODO) ',
 'download_size': 461388184,
 'real_size': 4241199104,
 'download_size_human': '440MiB',
 'real_size_human': '3.9GiB',
 'uncompressed_sha1hex': 'e60a32ad72d5068e991ad7f05233dd9fe0846af1',
 'first100000_sha1hex': None}

There's also a separate function to give a human-readable overview of all the available datasets:

In [4]:
wetsuite.datasets.print_dataset_summary()

bwb-mostrecent-meta-struc               	  186MiB	Metadata structure text for the latest revision from each BWB-id
bwb-mostrecent-text                     	  393MiB	Plain text for the latest revision from each BWB-id
bwb-mostrecent-xml                      	  2.9GiB	Raw XML for the latest revision from each BWB-id
cvdr-mostrecent-html                    	 14.3GiB	Raw HTML for the latest expression within each CVDR work set
cvdr-mostrecent-meta-struc              	  259MiB	Metadata for the latest expression within each CVDR work set
cvdr-mostrecent-text                    	  3.8GiB	Flattened plain text for the latest expression within each CVDR work set
cvdr-mostrecent-xml                     	  8.7GiB	Raw XML for the latest expression within each CVDR work set
eurlex-dir-nl-struc                     	  213MiB	The dutch translation of metadata and text for European DIRectives in EUR-LEX. (preliminary version)
eurlex-judg-nl-struc                    	  0.9GiB	The dutch translation of met

For each dataset available through the wetsuite library, there is also an extended description. This is also part of the index, and can be accessed like this:

In [5]:
# Note that we've defined the datasets_index variable before.

print(datasets_index["kansspelautoriteit-sancties-struc"]["description"])

This is a plaintext form of the set of documents you can find under https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/ as PDFs.

        Since almost half of those PDFs do not have a text stream, this data is entirely OCR'd,
        so expect some typical OCR errors.  The OCR quality seems fairly decent, and some effort was made to remove headers and footers,
        yet there are some leftovers  like _ instead of . and = instead of :


        The data is a fairly nested structure of python objects (or JSON, before it's parsed).
        - .data is a list of cases.

        - each case is a dict, with a 
            - 'name', 
            - 'docs' (a list) 
            - and some extracted information like mentioned money amounts, the apparent date span of the case

        - each document in that mentioned list is is a dict, with keys like
            - 'url' - to the PDF it came from
            - 'status' - from the detail page (if we could find it - not 100%) 
      

# Step 3: Downloading and accessing a sample dataset

Now that we know which sample datasets are available, we can choose to download one. For this notebook, we'll delve deeper into the `kansspelautoriteit-sancties-struc` dataset. This dataset contains all decisions published by the Dutch Gambling Authority (Kansspelautoriteit, abbreviated to _Ksa_ for short). These documents have already been pre-processed: the text has been extracted from the PDF files published by the Ksa and some metadata is added.

In [6]:
import pprint

ksa = wetsuite.datasets.load("kansspelautoriteit-sancties-struc")
type(ksa)
# The object returned by this load function is an object of the custom Dataset class, the documentation
# of which can be found here: https://wetsuite.knobs-dials.com/apidocs/wetsuite.datasets.Dataset.html

wetsuite.datasets.Dataset

The WetSuite library provides a small abstraction layer to make it easier to handle your data and focus more on creating your experiment. However, using this method of accessing certain datasets is certainly not required. When you are building your own datasets, or using datasets from other sources, you might need to interface in a different way with your data.

For each `Dataset` object, the actual data can be found in its `.data` variable, which contains a simple key-value store (specifically, it's of type [`wetsuite.helpers.localdata.LocalKV`](https://wetsuite.knobs-dials.com/apidocs/wetsuite.helpers.localdata.LocalKV.html)). You can see a key-value store as essentially a dictionary: given a specific key, you will get the associated data (the value).

In [7]:
# Convert the keys iterable to a list
keys_list = list(ksa.data.keys())

# Print the first ten keys in the ksa dataset
for k in keys_list[0:10]:
    print(k)

https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/1x-corp-exinvest/
https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/7red-com/
https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/artikel/
https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/aulon/
https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/automaten/
https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/bankgiro-loterij/
https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/belhuis-internetcafe/
https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/bet-at-home/
https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/betent-0/
https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/betent-aanwijzing/


The `LocalKV` class also provides some helper functions, for example to get a random sample of keys.

Now that we know which keys are available in our dataset, we can see what values are actually available for each key.

In [8]:
v = ksa.data.get(
    "https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/bankgiro-loterij/"
)
# The output is pretty long, so it's commented out here. Uncomment it and run it to check it out!
# pprint.pprint(v)