# ORNL
this notebook outlines how the ORNL, ORNL8 and ORNL26 datasets were created. Get them here:

```python
import datasets
df_ornl8  = datasets.load_dataset('Rodekool/ornl8 ')
df_ornl26 = datasets.load_dataset('Rodekool/ornl26')
```
<br>

## Gathering the raw data

1. We start by collecting the complete set of judgments, structured as a nested ZIP from [https://static.rechtspraak.nl/PI/OpenDataUitspraken.zip](https://static.rechtspraak.nl/PI/OpenDataUitspraken.zip).

2. We recursively extract them into a single folder. This scripts inflates nested zips into the current dir (this takes a while...)

```shell
while [ "`find . -type f -name '*.zip' | wc -l`" -gt 0 ]
do
    find . -type f -name "*.zip" -exec unzip -- '{}' \; -exec rm -- '{}' \;
done
```

3. We find and remove all files `<=4KB`; these only contain the boilerplate XML and have no content.

```shell
find . -name "*.xml" -type f -size -3k -delete
find . -name "*.zip" -type f -delete
```

4. We then move the XML files into subfolders per year. Since they are all individual files this takes longer than you may wish. _note: We do this because the whole package is too much data to deal with at once for a normal computer. Since the filecount increases with the years a better option would perhaps be splitting them into chunks that fit into memory._

```shell
setopt extended_glob
zmodload zsh/files
setopt +o nomatch

for y in {2022..1905}
do
        # mkdir -p $y
        mv *_${y}_* $y
        echo "Moved year $y"
done
```

## Pre-cleaning the data

### XML to CSV
All XML documents are mapped will be mapped into CSV files in the next code chunks. We keep only the relevant features (*identifier, date, case, title, short content, verdict, conclusion, category, location*). We tried different approaches to parse the files (LXML, beautiful soup, RegEx, and more), we went with beautiful soup for speed after running some tests. A loop over all documents parses them one by one using saves them as a row for the table.

In [None]:
import os
from bs4 import BeautifulSoup as bs
import re

import pandas as pd
import numpy as np

In [None]:
BASE = '/point/this/at/the/folder/from/the/previous/step'

dir_xmls  = os.path.join(BASE, 'data','OpenDataUitspraken')
dir_save  = os.path.join(BASE, 'data','odu_csv')
dir_ORNL  = os.path.join(BASE, 'data','ORNL')
dir_ORNL8  = os.path.join(BASE, 'data','ORNL8')
dir_ORNL26  = os.path.join(BASE, 'data','ORNL20')

In [None]:
def soup_get_text(soup, tag):
    text = soup.find(tag)
    text = text.text if text else ""
    return text

def main_sub_split(t, part):
    split = t.split('; ')
    if part != 'sub':
        return split[0]
    else: 
        return split[1] if len(split) == 2 else None

In [None]:
# turn all year folders with xmls into CSVs per year
for year in sorted(os.listdir(dir_xmls)):
    if year.isdigit():
        if int(year) == 0:
            print(year)
            l = []
            dir_year = os.path.join(dir_xmls, year)
            for doc in sorted(os.listdir(dir_year)):
                filename = os.path.join(dir_year, doc)
                if doc.endswith('.xml'):
                    with open(filename) as f:
                        soup = bs(f, 'xml')
                        d = {
                            'zaak'         : soup_get_text(soup, 'psi:zaaknummer'),
                            'identifier'   : soup_get_text(soup, 'dcterms:identifier'),
                            'location'     : soup_get_text(soup, 'dcterms:spatial'),
                            'category'     : soup_get_text(soup, 'dcterms:subject'),
                            'shortcontent' : soup_get_text(soup, 'inhoudsindicatie'),
                            'title'        : soup_get_text(soup, 'dcterms:title'),
                            'date'         : soup_get_text(soup, 'dcterms:date'),
                            'verdict'      : soup_get_text(soup, 'uitspraak'),
                            'conclusion'   : soup_get_text(soup, 'conclusie'),
                            
                        }
                    l.append(d)
            
            df = pd.DataFrame(l)
            save_path = os.path.join(dir_save,f'{year}.csv')
            df.to_csv(save_path, index=False)

In [None]:
# tests for reading and cleaning

year = 2010

df = pd.read_csv(os.path.join(dir_save,f'{year}.csv'), index_col=False)
df = df.astype(str)
df = df[df.category != 'nan']
# to label num, title, text
df['main_category'] = df.subject.apply(lambda x: main_sub_split(x, 'main'))
df['sub_category']  = df.subject.apply(lambda x: main_sub_split(x, 'sub'))

df['shortcontent'] = df['shortcontent'].apply(simple_clean_text)
df['verdict'] = df['verdict'].apply(simple_clean_text)
df['conclusion'] = df['conclusion'].apply(simple_clean_text)

df_parts = []
df

### Retaining ony instances with subcategories

Next we go through all CSVs again and keep only the cases that have both a main and subcategory

In [None]:
for year in range(1905,2022+1):
    year_csv = os.path.join(dir_save,f'{year}.csv')
    if year % 10 == 0:
        print()
    if os.path.exists(year_csv):
        print('year', end=' ')
        df = pd.read_csv(year_csv, index_col=False)
        print(f'{year}', end=' | ')
        df = df.astype(str)
        df = df[df['category'] != 'nan']
        df['sub_category']  = df.subject.apply(lambda x: main_sub_split(x, 'sub'))
        df = df[df['sub_category'].notna()]
        df_parts.append(df)

df = pd.concat(df_parts)
df.head()

In [None]:
save_path = os.path.join(dir_save,'all_subcategories_only.csv')
df.to_csv(save_path, index=False)

### Some graphs and statistics first, nothing really happens here

In [None]:
ez = df.copy(deep=True)
ez = ez[['sub_category', 'shortcontent', 'verdict', 'conclusion']]
ez['shc_c'] = ez.shortcontent.str.count(' ') + 1
ez['ver_c'] = ez.verdict.str.count(' ') + 1
ez['con_c'] = ez.conclusion.str.count(' ') + 1
ez = ez[['shc_c', 'ver_c', 'con_c']]

In [None]:
# text statistics

tab = ez.copy(deep=True)
tab = tab.replace(1, np.nan)
tab = tab.replace(0, np.nan)
tab = tab.replace(2, np.nan)
tab = tab.agg({
    'shc_c' : ['mean', 'std', 'max', 'min', 'count'],
    'ver_c' : ['mean', 'std', 'max', 'min', 'count'],
    'con_c' : ['mean', 'std', 'max', 'min', 'count'],
}).round(1)
tab

# # if you want them for a report uncomment
# tab = tab.to_latex()
# tab = tab.replace(".0","")
# print(tab)

# # as percentages
# tab.loc['in %'] = (tab.loc['count']/229172*100).round(1)
# tab

In [None]:
min_occurrences = 5_000
max_occurrences = 30_000

print(f'\ntotal samples with a subsubcategory:  {df.shape[0]}\n',)

ss_count = df.groupby(by='sub_category', as_index=False).size()
ss_count['percentage'] = round(ss_count['size'] / df.shape[0], 3)

print('subsub-category occurences:  \n',ss_count)


print()

cropped_ss_count = ss_count[ss_count['size'] > min_occurrences]
cropped_ss_count['percentage'] = round(cropped_ss_count['size'] / cropped_ss_count['size'].sum(), 3)

print(f'cropped > {min_occurrences} subsub-category occurences:  \n', cropped_ss_count)
print(f'\ntotal samples with a subsubcategory that has > {min_occurrences} occurences:  \n', cropped_ss_count['size'].sum())

print()

lost_samples = ss_count['size'].sum() - cropped_ss_count['size'].sum()


print(f'{lost_samples} ({lost_samples / ss_count["size"].sum():.2f}%) samples lost by filtering ')

print()

## Making the dataset more compact.

Since the dataset is too large to reliably use, we create two seperate smaller datasets (with overlap):
- `ORNL8` where we sample up to 30k texts from all subcategories that have at least 5k entries. Texts from subcategories with fewer than 5k entries are dropped. This leaves a total of 8 subcategories
- `ORNL26` where we sample up to 30k texts from *all* subcategories. Leaving us with 26 distinct subcategories

We also get rid of all columns except `text`, `sub_category`, and `label` (numerical encoded sub_category). Since a case will either have either a verdict or a conclusion, and sometimes a shortcontent, we turn them into one text field. `text` becomes the optional `shortcontent` + either `verdict` and `conclusion`.

**IMPORTANT** In this step we drop a lot of the case's text. In our experiments we don't use it as we can only realistically run experiments up to maybe 512 tokens. So we decide to only retain 512 words.

In [None]:
newlines = re.compile('(\\n){1,}')

def simple_clean_text(t):
    t = re.sub(newlines, ' \n ', t)
    return t.replace('\t',' ').strip()

### ORNL8

In [None]:
subcats = df.sub_category.unique()
ls_small_df = []

# filter only big categoires
sc_count = df.groupby(by='sub_category', as_index=False).size()
big_subcats = sc_count[ss_count['size'] > 5000].sub_category.tolist()
big_subcats.sort()
big_subcats

max_occurences_cat = 30_000
for subcategory in big_subcats:
    dfx = df[df['sub_category'] == subcategory]
    if dfx.shape[0] > max_occurences_cat:
        dfx = dfx.sample(n=max_occurences_cat)
    print(dfx.shape)
    ls_small_df.append(dfx)
    
df_small = pd.concat(ls_small_df, ignore_index=True)
df_small = df_small.astype(str)

# fix word max and whitespace
df_small['verdict'] = df_small['verdict'].apply(lambda x: ' '.join(x.split(' ')[:512]))
df_small['conclusion'] = df_small['conclusion'].apply(lambda x: ' '.join(x.split(' ')[:512]))
# do some very simple cleaning
df_small['verdict'] = df_small['verdict'].apply(simple_clean_text)
df_small['conclusion'] = df_small['conclusion'].apply(simple_clean_text)

df_small['text'] = (df_small.shortcontent + df_small.conclusion + df_small.verdict)

ornl8 = df_small[['sub_category', 'title', 'text']]
ornl8['label'] = pd.Categorical(ornl8.sub_category)
ornl8['label'] = ornl8.label.cat.codes + 1

In [None]:
subcats = df.sub_category.unique()
ls_small_df = []

max_occurences_cat = 30_000
for subcategory in subcats:
    dfx = df[df['sub_category'] == subcategory]
    if dfx.shape[0] > max_occurences_cat:
        dfx = dfx.sample(n=max_occurences_cat)
    print(dfx.shape)
    ls_small_df.append(dfx)
    
df_small = pd.concat(ls_small_df, ignore_index=True)
df_small = df_small.astype(str)

# fix word max and whitespace
df_small['verdict'] = df_small['verdict'].apply(lambda x: ' '.join(x.split(' ')[:512]))
df_small['conclusion'] = df_small['conclusion'].apply(lambda x: ' '.join(x.split(' ')[:512]))
# do some very simple cleaning
df_small['verdict'] = df_small['verdict'].apply(simple_clean_text)
df_small['conclusion'] = df_small['conclusion'].apply(simple_clean_text)

df_small['text'] = (df_small.shortcontent + df_small.conclusion + df_small.verdict)

ornl26 = df_small[['sub_category', 'title', 'text']]
ornl26['label'] = pd.Categorical(ornl26.sub_category)
ornl26['label'] = ornl26.label.cat.codes + 1

In [None]:
# # classes8 obtained by
# sc_count = df.groupby(by='sub_category', as_index=False).size()
# big_subcats = sc_count[ss_count['size'] > 5000].sub_category.tolist()
# big_subcats.sort()
# big_subcats

classes8 = [
'Ambtenaren',
'Arbeids',
'Belasting',
'Omgevings',
'Personen- en familie',
'Socialezekerheids',
'Verbintenissen',
'Vreemdelingen',
]

classes20 = [
'Aanbestedings',
'Ambtenaren',
'Arbeids',
'Belasting',
'Bestuursproces',
'Bestuursstraf',
'Burgerlijk proces',
'Europees bestuurs',
'Europees civiel ',
'Europees straf',
'Goederen',
'Insolventie',
'Intellectueel-eigendoms',
'Internationaal privaat',
'Internationaal straf',
'Materieel straf',
'Mededingings',
'Omgevings',
'Ondernemings',
'Penitentiair straf',
'Personen- en familie',
'Socialezekerheids',
'Strafproces',
'Verbintenissen',
'Volken',
'Vreemdelingen',
]

## Train/Test Split
We've premade these split and cleaned datasets available at huggingface:

```python
import datasets
df_ornl8  = datasets.load_dataset('Rodekool/ornl8 ')
df_ornl26 = datasets.load_dataset('Rodekool/ornl26')
```

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
ornl_train, ornl_test = train_test_split(ornl26, train_size = 0.8)
# ornl_train, ornl_test = train_test_split(ornl8, train_size = 0.8)

In [None]:
save_path = os.path.join(dir_ORNL26,'train.csv')
ornl_train.to_csv(save_path, index=False, header=False)

save_path = os.path.join(dir_ORNL26,'test.csv')
ornl_test.to_csv(save_path, index=False, header=False)