# Legal concept-exampl system

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/apohllo/jurix-2023/blob/main/legal-concepts.ipynb)

If running in Colab:
1. Copy `requirements.txt` to the main dir.
2. Create `data/` dir.
3. Copy `questions.json` do `data/`.
4. Creae `data/gdprhub/` dir.

Please not, that running time of sentence search on V100 takes approx. 1h.

## Stage 1 - processing of decisions from GDPRHub

### Download decisions from GDPRHub

In [2]:
! pip install -r requirements.txt

Collecting inflect==7.0.0 (from -r requirements.txt (line 5))
  Downloading inflect-7.0.0-py3-none-any.whl (34 kB)
Collecting matplotlib==3.7.2 (from -r requirements.txt (line 6))
  Downloading matplotlib-3.7.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting pandas==2.0.3 (from -r requirements.txt (line 7))
  Downloading pandas-2.0.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting scikit-learn==1.3.0 (from -r requirements.txt (line 8))
  Downloading scikit_learn-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.9/10.9 MB[0m [31m3.5 MB/s[0m eta [3

Take into account only decisions related to articles 44-46 of GDPR.

In [10]:
import requests
import tqdm
import re
import os

API_ENPOINT="https://gdprhub.eu/api.php"

category_titles = [("Category:Article_44_GDPR","44"),  
                   ("Category:Article_45_GDPR","45"),
                   ("Category:Article_45(1)_GDPR","45"),
                   ("Category:Article_45(2)_GDPR","45"),
                   ("Category:Article_45(3)_GDPR","45"),
                   ("Category:Article_45(4)_GDPR","45"),
                   ("Category:Article_45(5)_GDPR","45"),
                   ("Category:Article_45(6)_GDPR","45"),
                   ("Category:Article_45(7)_GDPR","45"),
                   ("Category:Article_45(8)_GDPR","45"),
                   ("Category:Article_45(9)_GDPR","45"),
                   ("Category:Article_46_GDPR","46"),
                   ("Category:Article_46(1)_GDPR","46"),
                   ("Category:Article_46(2)_GDPR","46"),
                   ("Category:Article_46(3)_GDPR","46"),
                   ("Category:Article_46(4)_GDPR","46"),
                   ("Category:Article_46(5)_GDPR","46"),
                  ]

In [9]:
def get_text(page_id):
    response = requests.get(API_ENPOINT + f"?action=query&pageids={page_id}&format=json&prop=revisions&rvslots=*&rvprop=content&formatversion=2")
    json_data = response.json()
    return re.sub(r"\\n", "\n", json_data['query']['pages'][0]['revisions'][0]['slots']['main']['content'])

The API paginates the result, so we have to walk through all pages to get all relevant decisions. 

Please not that some of the decisions are duplicated, as they might belong to multiple categories.

In [11]:
for category_title, category_id in category_titles:
    items = []
    continue_query = ""
    while(True):
        print(f"Processing {category_title} {category_id}")
        response = requests.get(API_ENPOINT + f"?action=query&prop=categories&format=json&list=categorymembers&cmtitle={category_title}{continue_query}")
        json_data = response.json()
        for item in json_data['query']['categorymembers']:
            items.append(item)

        if('continue' not in json_data):
            break

        continue_id = json_data['continue']['cmcontinue']
        continue_query = f"&cmcontinue={continue_id}"
    for item in tqdm.tqdm(items):
        if(item['ns'] == 0):
            # regular page
            id = item['pageid']
            directory = f"data/gdprhub/art-{category_id}/"
            if(not os.path.exists(directory)):
                os.mkdir(directory)
            with open(directory + f"{id}.txt", "w") as output:
                output.write(get_text(id))

Processing Category:Article_44_GDPR 44
Processing Category:Article_44_GDPR 44
Processing Category:Article_44_GDPR 44
Processing Category:Article_44_GDPR 44


100%|██████████| 31/31 [00:15<00:00,  2.00it/s]


Processing Category:Article_45_GDPR 45
Processing Category:Article_45_GDPR 45
Processing Category:Article_45_GDPR 45


100%|██████████| 23/23 [00:07<00:00,  3.13it/s]


Processing Category:Article_45(1)_GDPR 45


0it [00:00, ?it/s]


Processing Category:Article_45(2)_GDPR 45


100%|██████████| 3/3 [00:00<00:00, 64860.37it/s]


Processing Category:Article_45(3)_GDPR 45


100%|██████████| 2/2 [00:00<00:00,  2.41it/s]


Processing Category:Article_45(4)_GDPR 45


0it [00:00, ?it/s]


Processing Category:Article_45(5)_GDPR 45


0it [00:00, ?it/s]


Processing Category:Article_45(6)_GDPR 45


0it [00:00, ?it/s]


Processing Category:Article_45(7)_GDPR 45


0it [00:00, ?it/s]


Processing Category:Article_45(8)_GDPR 45


0it [00:00, ?it/s]


Processing Category:Article_45(9)_GDPR 45


0it [00:00, ?it/s]


Processing Category:Article_46_GDPR 46
Processing Category:Article_46_GDPR 46
Processing Category:Article_46_GDPR 46
Processing Category:Article_46_GDPR 46


100%|██████████| 34/34 [00:13<00:00,  2.58it/s]


Processing Category:Article_46(1)_GDPR 46


100%|██████████| 4/4 [00:02<00:00,  1.82it/s]


Processing Category:Article_46(2)_GDPR 46


100%|██████████| 7/7 [00:00<00:00, 13.10it/s]


Processing Category:Article_46(3)_GDPR 46


100%|██████████| 2/2 [00:00<00:00, 22982.49it/s]


Processing Category:Article_46(4)_GDPR 46


0it [00:00, ?it/s]


Processing Category:Article_46(5)_GDPR 46


0it [00:00, ?it/s]


### Extract content of decisions

In [15]:
def extract_structure(text):
    section = []
    structure = {"preamble": section}
    
    for line in text:
        if(re.match(r"^={1,3}[^=]", line)):
            match = re.match(r"^={1,3}([^=]+)={1,3}", line)
            section_name = match[1].strip()
            section = []
            structure[section_name] = section
        else:
            section.append(line)

    return structure

In [16]:
def extract_parts(text):
    lines = text.split("\n")

    infobox = []
    description = []
    translation = []
    # 0 init state
    # 1 bbox
    # 2 description
    # 3 translation
    state = 0 

    for line in lines:
        if(len(line) == 0):
            continue

        if(re.match(r"^{{", line)):
            state = 1
        elif(re.match(r"^[|]?}}", line)):
            state = 2
            continue
        elif(re.match(r"^<pre>", line)):
            state = 3
            continue
        elif(re.match(r"<\/pre>", line)):
            state = 0

        if(state == 1):
            infobox.append(line)
        elif(state == 2):
            description.append(line)
        elif(state == 3):
            translation.append(line)


    return {"infobox": infobox, "description": extract_structure(description), "translation": translation}

In [17]:
def get_sentences(path, processor, key_path):
    text = ""
    with open(path) as input:
        text = input.read()
    parts = extract_parts(text)
    
    item = parts
    for key in key_path:
        try:
            item = item[key]
        except KeyError:
            return []
    return processor.extract_sentences(item).sentences

In [8]:
import stanza 

class StanzaProcessor:
    def __init__(self):
        self.pipeline = stanza.Pipeline(lang='en', processors='tokenize')

    def extract_sentences(self, text):
        return self.pipeline(" ".join(text))


Extract individual sentences from the GDPRHub decisions. We take into account only `Holding` and `Facts`.

In [18]:
import glob

files = {}
file_names = set()
processor = StanzaProcessor()

2023-09-28 12:55:59 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-09-28 12:55:59 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2023-09-28 12:55:59 INFO: Using device: cpu
2023-09-28 12:55:59 INFO: Loading: tokenize
2023-09-28 12:55:59 INFO: Done loading processors!


In [19]:

for category_id in ["44", "45", "46"]:
    for idx, fname in enumerate(glob.glob(f"data/gdprhub/art-{category_id}/*.txt")):
        file_name = fname.split("/")[-1]
        if file_name in file_names:
            continue
        file_names.add(file_name)
        files[fname] = []
        files[fname][0:0] = list(get_sentences(fname, processor, ['description', 'Holding']))
        files[fname][0:0] = list(get_sentences(fname, processor, ['description', 'Facts']))

sentence_objects = {}
sentences = []
for idx,fname in enumerate(files):
    print(fname, len(files[fname]))
    sentences[0:0] = [s.text for s in files[fname]]
    sentence_objects.update({(s.text, (s,fname,i)) for i,s in enumerate(files[fname]) if s.text not in sentence_objects})
    
    
print(len(sentences))
print(len(sentence_objects))

data/gdprhub/art-44/4253.txt 33
data/gdprhub/art-44/5996.txt 24
data/gdprhub/art-44/6091.txt 20
data/gdprhub/art-44/4186.txt 10
data/gdprhub/art-44/6174.txt 26
data/gdprhub/art-44/5093.txt 44
data/gdprhub/art-44/3241.txt 29
data/gdprhub/art-44/6092.txt 18
data/gdprhub/art-44/3408.txt 14
data/gdprhub/art-44/5938.txt 18
data/gdprhub/art-44/3233.txt 16
data/gdprhub/art-44/5399.txt 18
data/gdprhub/art-44/5914.txt 34
data/gdprhub/art-44/5359.txt 64
data/gdprhub/art-44/3423.txt 26
data/gdprhub/art-44/6198.txt 19
data/gdprhub/art-44/4486.txt 41
data/gdprhub/art-44/4122.txt 20
data/gdprhub/art-44/5110.txt 30
data/gdprhub/art-44/6087.txt 22
data/gdprhub/art-44/2804.txt 44
data/gdprhub/art-44/4628.txt 32
data/gdprhub/art-44/5716.txt 49
data/gdprhub/art-44/3180.txt 27
data/gdprhub/art-44/3240.txt 7
data/gdprhub/art-44/5526.txt 36
data/gdprhub/art-44/6090.txt 20
data/gdprhub/art-44/5627.txt 28
data/gdprhub/art-44/5743.txt 52
data/gdprhub/art-44/5028.txt 25
data/gdprhub/art-44/5953.txt 12
data/gdpr

# Stage 2 - find sentences that match the questions (slow!)

Load questions generated by ChatGPT

In [23]:
import json

path = "data/"

items = []

with open(f"{path}questions.jsonl") as input:
    for idx, line in enumerate(input):
        if(len(line.strip()) == 0):
            continue
        items.append(json.loads(line))

print(f"Number of questions: {len(items)}")
        

Number of questions: 55


Download and load the model trained in the experiment. This is AlBERT-xxl-v1 trained on SQuAD 2.0 sentences.

In [25]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer


model = AutoModelForSequenceClassification.from_pretrained("apohllo/albert-xxl-squad-sentences", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("apohllo/albert-xxl-squad-sentences")

Downloading (…)lve/main/config.json:   0%|          | 0.00/914 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/890M [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

In [26]:
from transformers import pipeline

# Add device=0 if you want to use GPU!
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer, batch_size=16) #, device=0)

In [24]:
def top_results(question, sentences, classifier, top_k=5):
    samples = [{"text": s, "text_pair": question} for s in sentences]
    results = classifier(samples)
    
    results = [(idx, r["score"]) if r["label"] == 'LABEL_1' else (idx, 1 - r["score"]) 
            for idx, r in enumerate(results)]
    
    keys_values = sorted(results, key=lambda e: -e[1])[:top_k]
    return [(v,sentences[k]) for k,v in keys_values]

On CPU this will run for hours! Albert-XXL is a pretty large model.

In [None]:
with open(f"{path}/sentences.jsonl", "w") as output:
    for item in tqdm.tqdm(items):
        results = top_results(item["question"], sentences, classifier, top_k=10)
        results = [{"score":v, "sentence":s} for v,s in results]
        item["sentences"] = results
        output.write(json.dumps(item) + "\n")

# Stage 3 - answer the questions using Flan-T5-large

In [29]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-large"

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
generator = pipeline("text2text-generation", model=model, tokenizer=tokenizer, batch_size=16) #, device=0)

Downloading (…)lve/main/config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [None]:
data = []
with open("data/sentences.jsonl") as input:
    for line in input:
        data.append(json.loads(line.strip()))

In [None]:
context_length = 1

# We have selected some sentences below the threshold to see how the model works for them
additional_sentences = set([(16,0), (25,0), (39,0), (41,0)])

# The threshold was selected to get equal error rate
threshold = 0.65

with open("data/answers.jsonl", "w") as json_output:
    for i_idx, item in tqdm.tqdm(enumerate(data)):
        for s_idx, sentence in enumerate(item['sentences']):
            if sentence['score'] > threshold or (i_idx + 1, s_idx) in additional_sentences:
                print(i_idx, s_idx, "%.3f" % sentence['score'])
                sentence_object, fname, sentence_index = sentence_objects[sentence['sentence']]
                context = []
                print(fname, sentence_index)
                if sentence_index - context_length >= 0:
                    for i in range(context_length):
                        context.append(files[fname][sentence_index - context_length + i])
                context.append(sentence_object)
                if sentence_index + context_length < len(files[fname]):
                    for i in range(context_length):
                        context.append(files[fname][sentence_index + i + 1])

                context_text = ""
                for idx,sentence in enumerate(context):
                    context_text += sentence.text + " "

                tuple = {}
                tuple["concept"] = item['concept']
                tuple["question"] = item['question']
                tuple["context"] = context_text
                prompt = f"Given the information: \"{context_text}\" answer the following question: {item['question']}"
                answer = generator(prompt)[0]['generated_text']
                tuple["answer"] = answer
                json_output.write(json.dumps(tuple) + "\n")

# Stage 4 - summary

In [7]:

concepts = []
with open("data/answers.jsonl") as input:
    for line in input:
        data = json.loads(line)
        concept = data["concept"]
        if(len(concepts) > 0 and concepts[-1]["concept"] == concept):
            concepts[-1]["examples"].append({"example": data["context"], "answer": data["answer"]})
        else:
            concepts.append({"concept": concept, "examples": [{"example": data["context"], "answer": data["answer"]}]})
        

In [13]:
import textwrap

for item in concepts:
    print("=" * 30)
    print("** " + item["concept"] + " **")
    for example in item["examples"]:
        print("")
        print(textwrap.fill(example['example'], 80))
    

** Enforceable data subject rights **

It also appeared that the company's files contained several excessive comments
related to customers or their health conditions. In addition, people were not
properly informed about the processing of their personal data, or about the
recording of the conversations they had with the company. In total, following
its investigations the CNIL found five breaches of the GDPR: -         Violation
of the right to object, Article 21(2) GDPR: no procedure was implemented to
ensure effectively that persons who opposed telephone solicitation were no
longer called); -         Violation of the principle of data minimization,
Article 5(1)(c) GDPR: inadequate and offensive comments or irrelevant comments
related to people's health were found in the company's customer file; -        
Violation of Articles 12 and 13 GDPR: insufficient information on the processing
of data subject’s personal data and their rights; -         Violation of
Articles 46 and 49 GDPR:  the 