If you want to use this notebook online without installing Python on your computer, try:
<a href="https://colab.research.google.com/github/WetSuiteLeiden/example-notebooks/blob/main/wetsuite-nlp-crash-course/2-introduction-to-datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Google Colab"/></a> (do note however that this requires a Google account).

# WetSuite NLP crash course
# Part 2: Introduction to sample datasets

Welcome to part 2 of the WetSuite NLP crash course. Make sure that you've read the [full introduction]((https://github.com/WetSuiteLeiden/example-notebooks/tree/main/wetsuite-nlp-crash-course)) and worked through part 1 before you start on this notebook.

## Purpose of this notebook
In part two of this series, we show how to install the `wetsuite-core` library for use in your Python Notebook (or project). Then we show how to interact with one of the ready-made sample datasets provided by WetSuite.

## Types of legal datasets
A lot of different legal datasets exist (TODO: reference the website here). The `wetsuite-core` library provides an easier interface to interact with some datasets. For each of these datasets, WetSuite also provides ready-made sample datasets which can help you practice your technical skills and show what's possible. The sample datasets are incomplete, as most datasets are quite big. Furthermore, the sample datasets won't be updated: they remain static. If you want to use the wetsuite-core library with more complete or up-to-date datasets, you will need to download this yourself (TODO: Reference how you can download other parts of the datasets yourself).

## WetSuite sample datasets
The sample datasets that are provided by WetSuite and accessible through the `wetsuite-core` exist to help you get started with programming and NLP research. The datasets are not complete and thus not fit to use in actual research. However, the datasets are based on larger and existing datasets which are complete and can be used for legal research. The things you learn in our notebooks should help you to create the tools you need for your research.

The WetSuite sample datasets are:
* Parliamentary data: a subset Dutch parliamentary data, in the XML format as provided by the Dutch government.
* Court decisions: court decisions of about 2,5 year as provided by the Dutch Judicial Council (Raad voor de rechtspraak) in the original XML format.
* Decisions by the Dutch Gambling Authority (Kansspelautoriteit) in a plaintext format
* Dutch legislation in the XML format as provided by the Dutch government through the API's of wetten.overheid.nl.

## Summary
* WetSuite provides code to access some legal datasets in the `wetsuite-core` library.
* WetSuite provides some small ready-made sample datasets which can also be accessed directly via the `wetsuite-core` library.
* This series of notebooks aim to illustrate how to use Python to apply NLP-methods for legal research.


# Step 1: Using the `wetsuite-core` library

In computer programming, a library is a collection of code that is meant to be re-used by other programs. The `wetsuite-core` library provides, among other things, interfaces to interact in a practical manner directly with some existing legal datasets which are made available online. For example, you can easily download all published judgments by a specific court in the Netherlands as made available by the Raad voor de rechtspraak.

In order to use the `wetsuite-core` library, download and install the library by running the following code block:

(TODO: also add documentation on how to run this locally?)

In [None]:
# (only) in colab, run this first to install wetsuite from (the most recent) source.   For your own setup, see wetsuite's install guidelines.
!pip3 install -U wetsuite

Once the wetsuite library is installed, we can import it in our Python Notebook or file. Importing a library allows you to use the functions defined in the library in your program. In this notebook, we will use the datasets part of `wetsuite-core`:

In [1]:
import wetsuite.datasets

To learn more about what features are made available through the `wetsuite-core` library, you can visit the [WetSuite API documentation](https://wetsuite.knobs-dials.com/apidocs/index.html).

# Step 2: Finding a sample dataset

Once a library is imported, we can use it in our program. By calling the following function, we can see which sample datasets are currently available. Please note that it might change over time which sample datasets are available.

In [None]:
wetsuite.datasets.list_datasets()

Each sample dataset also has a short (and longer) description, and we can see what the size is of each dataset.

You can get that in a machine-readable format by running: 

In [None]:
# Get all the index information in the form of a dictionary
datasets_index = wetsuite.datasets.fetch_index()

# Get the information of this specific dataset.
datasets_index["rechtspraaknl-sample-xml"]

There's also a separate function to give a human-readable overview of all the available datasets:

In [None]:
wetsuite.datasets.print_dataset_summary()

For each dataset available through the wetsuite library, there is also an extended description. This is also part of the index, and can be accessed like this:

In [None]:
# Note that we've defined the datasets_index variable before.

print(datasets_index["kansspelautoriteit-sancties-struc"]["description"])

# Step 3: Downloading and accessing a sample dataset

Now that we know which sample datasets are available, we can choose to download one. For this notebook, we'll delve deeper into the `kansspelautoriteit-sancties-struc` dataset. This dataset contains all decisions published by the Dutch Gambling Authority (Kansspelautoriteit, abbreviated to _Ksa_ for short). These documents have already been pre-processed: the text has been extracted from the PDF files published by the Ksa and some metadata is added.

In [None]:
ksa = wetsuite.datasets.load("kansspelautoriteit-sancties-struc")
type(ksa)
# The object returned by this load function is an object of the custom Dataset class, the documentation
# of which can be found here: https://wetsuite.knobs-dials.com/apidocs/wetsuite.datasets.Dataset.html

The WetSuite library provides a small abstraction layer to make it easier to handle your data and focus more on creating your experiment. However, using this method of accessing certain datasets is certainly not required. When you are building your own datasets, or using datasets from other sources, you might need to interface in a different way with your data.

For each `Dataset` object, the actual data can be found in its `.data` variable, which contains a simple key-value store (specifically, it's of type [`wetsuite.helpers.localdata.LocalKV`](https://wetsuite.knobs-dials.com/apidocs/wetsuite.helpers.localdata.LocalKV.html)). You can see a key-value store as essentially a dictionary: given a specific key, you will get the associated data (the value).

In [None]:
# Convert the keys iterable to a list
keys_list = list(ksa.data.keys())

# Print the first ten keys in the ksa dataset
for k in keys_list[0:10]:
    print(k)

The `LocalKV` class also provides some helper functions, for example to get a random sample of keys.

Now that we know which keys are available in our dataset, we can see what values are actually available for each key.

In [8]:
# pprint is a Python "pretty print" library, which aims to show Python data structures in a more
# human-readable manner. For more information, see https://docs.python.org/3/library/pprint.html.
import pprint

v = ksa.data.get(
    "https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/bankgiro-loterij/"
)

pprint.pprint(v)

# Conclusion

In this first notebook we have explained what the goal is of the WetSuite notebooks, and showed you the basics of interacting with the WetSuite sample datasets via the `wetsuite` library.

# Done! [Click here to go to part 3](https://github.com/WetSuiteLeiden/example-notebooks/blob/main/wetsuite-nlp-crash-course/3-a-first-nlp-experiment.ipynb)