If you want to start playing with this without installation, try:
<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-notebooks/blob/main/dataset_intro_by_doing__0intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WetSuite datasets: introduction (part 1)

WetSuite provides tools to more easily use existing legal datasets. In this series of example notebooks, we explain which datasets can be accessed via the `wetsuite-core` library, and how you can use this to speed up your research.

## Types of legal datasets
A lot of different legal datasets exist (TODO: reference the website here). The `wetsuite-core` library provides an easier interface to interact with some datasets. For each of these datasets, WetSuite also provides ready-made sample datasets which can help you practice your technical skills and show what's possible. The sample datasets are incomplete, as most datasets are quite big. Furthermore, the sample datasets won't be updated: they remain static.

## Purpose of this notebook
In this first part, we show how to install the `wetsuite-core` library for use in your Python Notebook (or project). Then we show how to interact with the ready-made datasets provided by WetSuite.

## Summary
* WetSuite provides code to access some legal datasets in the `wetsuite-core` library.
* WetSuite provides some small ready-made datasets which can also be accessed via the `wetsuite-core` library.


# Step 1: Using the `wetsuite-core` library

The `wetsuite-core` library provides, among other things, interfaces to interact in a practical manner directly with some existing legal datasets which are made available online. For example, you can easily download all published judgments by a specific court in the Netherlands as made available by the Raad voor de rechtspraak.

In order to use the `wetsuite-core` library, download and install the library by running the following code block:

(TODO: also add documentation on how to run this locally?)

In [1]:
# (only) in colab, run this first to install wetsuite from (the most recent) source.   For your own setup, see wetsuite's install guidelines.
!pip3 install -U --no-cache-dir --quiet https://github.com/WetSuiteLeiden/wetsuite-core/archive/refs/heads/main.zip

Once the library is installed, we can import these in our Python file:

In [2]:
# Where's glob for?
import glob
import wetsuite.datasets
import wetsuite.helpers.format
import json
from pprint import pprint

# Step 2: Finding a sample dataset

Once a library is imported, we can use it in our program. By calling the following function, we can see which sample datasets are currently available. Please note that it might change over time which sample datasets are available.

In [3]:
wetsuite.datasets.list_datasets()

['bwb-mostrecent-meta-struc',
 'bwb-mostrecent-text',
 'bwb-mostrecent-xml',
 'cvdr-mostrecent-html',
 'cvdr-mostrecent-meta-struc',
 'cvdr-mostrecent-text',
 'cvdr-mostrecent-xml',
 'gemeentes-struc',
 'kansspelautoriteit-sancties-struc',
 'raadvanstate-adviezen-struc',
 'rechtspraaknl-struc',
 'tweedekamer-fractie-membership-struc',
 'tweedekamer-fracties-struc',
 'tweedekamer-kamervragen-struc',
 'wetnamen',
 'woo_besluiten_docs_text',
 'woo_besluiten_meta']

Each sample dataset also has a short description, and we can see what the size is of each dataset:

In [4]:
# ...with a little more description?
for name, details in wetsuite.datasets.fetch_index().items():
    real_size_human     = wetsuite.helpers.format.kmgtp( details.get("real_size") )
    print( f"{name:<40}\t{real_size_human:>8s}\t{details.get('description_short')}" )

bwb-mostrecent-meta-struc               	    195M	Metadata structure text for the latest revision from each BWB-id
bwb-mostrecent-text                     	    412M	Plain text for the latest revision from each BWB-id
bwb-mostrecent-xml                      	    3.1G	Raw XML for the latest revision from each BWB-id
cvdr-mostrecent-html                    	     15G	Raw HTML for the latest expression within each CVDR work set
cvdr-mostrecent-meta-struc              	    272M	Metadata for the latest expression within each CVDR work set
cvdr-mostrecent-text                    	    4.1G	Flattened plain text for the latest expression within each CVDR work set
cvdr-mostrecent-xml                     	    9.3G	Raw XML for the latest expression within each CVDR work set
gemeentes-struc                         	    434K	
kansspelautoriteit-sancties-struc       	    9.2M	Metadata and plain text form of the set of PDFs you can find under https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebeslui

## Step 3: Downloading a sample dataset

Now that we know which sample datasets are available, we can choose to download one.


In [5]:

# QUESTION MS: is a sample dataset re-downloaded each time?
kamervragen = wetsuite.datasets.load('tweedekamer-kamervragen-struc')

data = kamervragen.data.get("ah-tk-20122013-511")
pprint(data)

{'available': '2012-11-12',
 'category': 'Zorg en gezondheid | Geneesmiddelen en medische hulpmiddelen',
 'identifier': 'ah-tk-20122013-511',
 'indiener': ['L.T. Bouwmeester'],
 'issued': '2012-11-09',
 'ontvanger': ['E.I. Schippers'],
 'type': ['officiële publicatie', 'Antwoord', 'Aanhangsel van de Handelingen'],
 'urls': ['https://repository.overheid.nl/frbr/officielepublicaties/ah-tk/20122013/ah-tk-20122013-511/1/metadata/metadata.xml',
          'https://repository.overheid.nl/frbr/officielepublicaties/ah-tk/20122013/ah-tk-20122013-511/1/xml/ah-tk-20122013-511.xml'],
 'vergaderjaar': '2012-2013',
 'vraagdata': {'1': {'antwoord': ['Ja, ik heb het IMS rapport met veel '
                                  'interesse gelezen. Dit IMS rapport diende '
                                  'als technisch document bij de Ministers '
                                  'Summit dat 3 oktober jl. in Amsterdam '
                                  'plaats had. Het bedrag is opgebouwd uit de '
        

## Example datasets

There are some fairly targeted examples, including

* `cvdr-mostrecent-*`

* `bwb-mostrecent-*`

* `kansspelautoriteit-sancties-struc` - for example use, see [using_dataset_kansspelautoriteit notebook](dataset_intro_by_doing__kansspelautoriteit__(OCR_example).ipynb)

* `tweedekamer-fracties-struc` and `tweedekamer-fracties-membership-struc` - for example use, see [using_dataset_tweedekamer notebook](dataset_intro_by_doing__tweedekamer.ipynb)

* `woobesluit` - for example use, see [using_dataset_woobesluit notebook](dataset_intro_by_doing__woobesluit.ipynb)


And some things that are little more than lists, including
* `gemeentes`: [using_dataset_gemeentes notebook](dataset_intro_by_doing__gemeentes.ipynb)

* `wetnamen`: [using_dataset_wetnamen notebook](dataset_intro_by_doing__woobesluit.ipynb)



Aside: We have an ongoing discussion of how to provide more varied data without making life harder for you.

For example, in theory the -xml would be enough, in that -meta and -text are only a handful of extra function calls away,
but making you do that is an unnecessary hurdle.

Providing these in a merged, composite way, e.g. with dataset items providing attributes like
item.xml/item.raw, item.metadata, item.text would be -  confusing that they differ per dataset,
and but might be inflexible and confusing in the long run in that datasets _or_ code changing is likely to breaks all previous uses.

So for now, we provide distinct datasets for different views on the same data,
which also means you have at least some chance of loding this data in non-python other contexts.

## "What if I want to feed files into something else?"

The databsets as downloaded are indeed in our own format,
only directly readable by our own code.

While _some_ sort of database tends to scale better than files,
there are plenty of existing things that just happen to take files.

So we give you some code export to files (or to a single ZIP file, which is a little more transportable),
We will estimate what to name those files, though this is not always very helpful.

In [None]:
kans_meta = wetsuite.datasets.load('kansspelautoriteit-sancties-struc')
kans_meta.export_files(to_zipfile_path='/tmp/kans.zip')

bwb_text  = wetsuite.datasets.load('bwb-mostrecent-text')
bwb_text.export_files(in_dir_path='/tmp/bwb-text' )
print( glob.glob('/tmp/bwb-text/*')[:10] )


## Why readymade datasets at all?

A few different reasons. Some of our considerations follow:

The most up to date information is often to fetch data directly from the source. We have notebooks elsewhere details how that may be done (TODO: link)

If you can express what you want, then such direct fetching is also the most elegant way to get exactly what you want.


Such 'if's matter. When you have a research question, and have gone to a system that probably has the data you are looking for, ask yourself:


**Does it let you _express_ what you want?**

Questions like "give me all pariamentary reports related to animal management" may be very reasonable, but for various data sources, you will not find it easy to be sure the search results are complete.


**Does it let you _fetch_ what you want?**

Even if a site lets you search for what you need, if there is no "download however many matching documents that is", you can't _do_ anythign with it (beyond spending the week pressing Save as)

Put another way: is there an interface to fetch data?

Also, is that interface at least as expressive as the site? For example
- the damocles example (TODO: link) may be an example of "hey, we managed to refine our search to download a fairly precise set of documents", but don't count on that always working.
- whereas the KOOP BWB interface doesn't let you search in the document body, as the website will.


**Maybe you just wanted a lot of data**

If you want to test a text processing method,
maybe you just wanted a lot of text of some different types,
no matter yet what it is exactly. Both site and API will would be slow.


Even if we provide code to make finding-and-fetchin easier,
_to end users_ that may only really move around the problem.

When you just wanted to focus on the _documents_, chances are you now need to stare at our code for an afternoon before you get it to work -- or discover that it would never have done what you expecting.

Or maybe you now have PDF of XML, and now have a different question before you can start working on _text_.


**tl;dr: these datasets represent certain amounts of prep work.**