If you want to start playing with this without installation, try:
<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-notebooks/blob/main/dataset_intro_by_doing__0intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WetSuite datasets: introduction (part 1)

WetSuite provides tools to more easily use existing legal datasets. In this series of example notebooks, we explain which datasets can be accessed via the `wetsuite-core` library, and how you can use this to speed up your research.

## Types of legal datasets
A lot of different legal datasets exist (TODO: reference the website here). The `wetsuite-core` library provides an easier interface to interact with some datasets. For each of these datasets, WetSuite also provides ready-made sample datasets which can help you practice your technical skills and show what's possible. The sample datasets are incomplete, as most datasets are quite big. Furthermore, the sample datasets won't be updated: they remain static.

## WetSuite sample datasets
The sample datasets that are provided by WetSuite and accessible through the `wetsuite-core` exist to help you get started with programming and NLP research. The datasets are not complete and thus not fit to use in actual research. However, the datasets are based on larger and existing datasets which are complete and can be used for legal research. The things you learn in our notebooks should help you to create the tools you need for your research.

The WetSuite sample datasets are:
* Parliamentary data: a subset Dutch parliamentary data, in the XML format as provided by the Dutch government.
* Court decisions: court decisions of about 2,5 year as provided by the Dutch Judicial Council (Raad voor de rechtspraak) in the original XML format.
* Decisions by the Dutch Gambling Authority (Kansspelautoriteit) in a plaintext format
* Dutch legislation in the XML format as provided by the Dutch government through the API's of wetten.overheid.nl.

## Purpose of this notebook
In this first part, we show how to install the `wetsuite-core` library for use in your Python Notebook (or project). Then we show how to interact with the ready-made sample datasets provided by WetSuite.

## What you need
You can read this notebook on GitHub as a reference. However, it is better to run it in order to get some actual programming experience. To run this notebook [TODO: link to how-to-run explanation].

Besides this, the [Python documentation](https://docs.python.org/3/) can prove very helpful. If you want a more step-by-step and guided introduction to Python, plenty of online courses exists; for example [Codecademy's "Learn Python 3" course](https://www.codecademy.com/learn/learn-python-3).

## Summary
* WetSuite provides code to access some legal datasets in the `wetsuite-core` library.
* WetSuite provides some small ready-made datasets which can also be accessed via the `wetsuite-core` library.


# Step 1: Using the `wetsuite-core` library

The `wetsuite-core` library provides, among other things, interfaces to interact in a practical manner directly with some existing legal datasets which are made available online. For example, you can easily download all published judgments by a specific court in the Netherlands as made available by the Raad voor de rechtspraak.

In order to use the `wetsuite-core` library, download and install the library by running the following code block:

(TODO: also add documentation on how to run this locally?)

In [None]:
# (only) in colab, run this first to install wetsuite from (the most recent) source.   For your own setup, see wetsuite's install guidelines.
!pip3 install -U wetsuite

Once the wetsuite library is installed, we can import it in our Python Notebook or file. Importing a library allows you to use the functions defined in the library in your program. In this notebook, we will use the datasets part of `wetsuite-core`:

In [None]:
import wetsuite.datasets

# Step 2: Finding a sample dataset

Once a library is imported, we can use it in our program. By calling the following function, we can see which sample datasets are currently available. Please note that it might change over time which sample datasets are available.

In [None]:
wetsuite.datasets.list_datasets()

Each sample dataset also has a short (and longer) description, and we can see what the size is of each dataset.

You can get that in data form by running: 

In [None]:
wetsuite.datasets.fetch_index() # which gives the same names, together with details about each dataset

...or you could pick out those details and print a summary. Because you probably  

In [None]:
wetsuite.datasets.print_dataset_summary() 
# that function actually uses , which gives more detailed
# ...and prints just a few parts of it, including the short description 

You can read a longer description if you care.

You can pick it out of the data that `fetch_index()` gives, 
but it's less typing to use our function that does just that:  

In [None]:
print(wetsuite.datasets.description('kansspelautoriteit-sancties-struc'))

## Step 3: Downloading a sample dataset

Now that we know which sample datasets are available, we can choose to download one.

In [None]:
import pprint

# QUESTION MS: is a sample dataset re-downloaded each time?
kamervragen = wetsuite.datasets.load('kansspelautoriteit-sancties-struc')

# give us a random item from it
case_id, case_details = kamervragen.data.random_choice()
print(f'=== {case_id} ===\n')
pprint.pprint( case_details )

## Example datasets

There are some fairly targeted examples, including

* `cvdr-mostrecent-*`

* `bwb-mostrecent-*`

* `kansspelautoriteit-sancties-struc` - for example use, see [using_dataset_kansspelautoriteit notebook](dataset_intro_by_doing__kansspelautoriteit.ipynb)

* `tweedekamer-fracties-struc` and `tweedekamer-fracties-membership-struc`

* `woobesluit` - for example use, see [using_dataset_woobesluit notebook](dataset_intro_by_doing__woobesluit.ipynb)


And some things that are little more than lists, including
* `gemeentes`: [using_dataset_gemeentes notebook](dataset_intro_by_doing__gemeentes.ipynb)

* `wetnamen`



Aside: We have an ongoing discussion of how to provide more varied data without making life harder for you.

For example, in theory the -xml would be enough, in that -meta and -text are only a handful of extra function calls away,
but making you do that is an unnecessary hurdle.

Providing these in a merged, composite way, e.g. with dataset items providing attributes like
item.xml/item.raw, item.metadata, item.text would be -  confusing that they differ per dataset,
and but might be inflexible and confusing in the long run in that datasets _or_ code changing is likely to breaks all previous uses.

So for now, we provide distinct datasets for different views on the same data,
which also means you have at least some chance of loding this data in non-python other contexts.

## "What if I want to feed files into something else?"

The datasets as downloaded are indeed in our own format,
only directly readable by our own code.

While the use of _some_ sort of database tends to scale better than many files,
there are plenty of existing things that just happen to take files.

So we give you some code export to files
(or to a single ZIP file, which is more transportable),

Note: We will estimate what to name those files, and while this will give unique names,
they will not always be very _helpful_ names.

In [None]:
kans_meta = wetsuite.datasets.load('kansspelautoriteit-sancties-struc')
kans_meta.export_files(to_zipfile_path='/tmp/kans.zip')

bwb_text  = wetsuite.datasets.load('bwb-mostrecent-text')
bwb_text.export_files(in_dir_path='/tmp/bwb-text' )
import glob # show a few of the filenames we made
print( glob.glob('/tmp/bwb-text/*')[:10] )


## Why readymade datasets at all?

A few different reasons. Some of our considerations follow:

If you want the the most up-to-date information, this argues _against_ datasets,
because the most up to date information is often to fetch data directly from the source.

We have notebooks elsewhere details how that may be done (TODO: add some links),
for varied sources.

If you can express what you want to such sources,
then directly fetching is also the most elegant way to get exactly what you want.

Such 'if's matter.
When you have a research question, 
and have gone to a system that probably has the data you are looking for, ask yourself:


**Does it let you _express_ what documents you want?**

Questions like "give me all pariamentary reports related to animal management" may be very reasonable,
but for various data sources, you will not find it easy to fetch exactly that - or to to be sure the search results are complete.

Chances are you will have to cast a much wider net, and sort out the results.


One example is the damocles test case (TODO: link to it), where we needed 
a good amount of expressiveness, fetching, and refinement when it turned out.
The interface we had to use for that case happened to work fairly well -- but
don't count on that always working.
..- whereas the KOOP BWB interface doesn't let you search in the document body, as the website will.


**Does it let you _fetch_ what you want?**

Say that you do manage to tell a system your precise needs.

Almost all website searches will tell you how many tens of thousands of results there are.

Almost no website searches will let you fetch them.

Put another way: is there an interface to fetch data so you can _do_ something with it 
-- that does not come down to spending the week pressing 'Save as'?


**Maybe you just wanted a lot of data**

Legal researchers often have a specific interest that is served from a singular system,
from a specific broad area. The net they cast tends to be... manageable.

NLP researcher, or the other hand, may just wish to test a text processing method,
on various differnt kinds of documents, and lots of them, no matter what they are exactly.

Both site and API would be a lot slower than one beefy download,
even if it is a fairly bite-sized sample rather than everything from a source.


Even if we provide code to make finding-and-fetching easier,
_to end users_ that may only really move around the problem.

When you just wanted to focus on the _documents_, chances are you now need to stare at our code for an afternoon
before you get it to work, discover that it would never have done what you expecting, 
or discover that fetching will take weeks.


**tl;dr: these datasets represent certain amounts of prep work.**