# Finetuning experiment: Extract structured data for German law journal editors from website text

based on https://github.com/ml-explore/mlx-examples/tree/main/lora

Hardware: Mac mini 2023 (M2, 16 GB RAM)

## Preparation

### Download website data

This only downloads new content if the list of journals has been changed or already downloaded files have been deleted. To overwrite existing files, use `overwrite=True`

In [2]:
from lib.prepare_training_data import download_input_data
download_input_data(input_file='data/editors.csv', 
                    output_dir='data/website-data', 
                    overwrite=False)

Downloading Content:   0%|          | 0/130 [00:00<?, ?it/s]

Downloaded 0 web pages.


### Prompt and test data for all experiments


In [1]:
system_message ="""
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
"""

instruction = """
Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).

- Format the output as a YAML list of dictionaries.
- Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Ensure the YAML output is strictly valid. It must be a list of dictionaries. 
"""

example = """
Here is an example:

```yaml
- lastname: Mustermann
  firstname: Martina
  title: Dr.
  position: Vorsitzender Richterin
  affiliation: Oberlandesgericht Buxtehude
  role: Herausgeber
```
"""

epilog="""
Adhere to these guidelines to efficiently and accurately process the following content:"
"""

test_data = """
Herausgeber:
Prof. Dr. Stefan Knesebeck, Universität Wuppertal
Prof. Dr. Dr. h.c. Fritz M. Müller LL.M.(Yale), Universität Wanne-Eickel
RA Prof. Dr. Vera Valentin, Hochschule für Recht und Sport Edingen
Prof. Dr. Dr. h.c. Rita Rosenbaum, Universität Tupfingen
Dr. Ingo Gonzalo de Sanchez, Vorsitzender Richter am Oberlandesgericht Rostock
Redaktion:
RA Adam Gengelbach, Unterhachingen
Ass. iur. Petra Priem, Herrenchiemsee
"""


## mistralai/Mistral-7B-v0.2

### Generate training, testing and validation files

In [2]:
from lib.prepare_training_data import create_training_file
import sys

mistral_ft_instruction = f"""
# instruction
{system_message}
# user
{instruction}
{epilog}
# content
"""

# the template function receives the instruction, the content to be analyzed, and the expected answer
def template_fn(instruction: str, content: str, answer: str):
    return f'<s>[INST]{instruction}{content}[/INST]{answer}</s>'

create_training_file(instruction=mistral_ft_instruction,
                     template_func=template_fn,
                     input_file='data/editors/editors.csv', 
                     output_dir='data/editors/mistral', 
                     content_dir='data/editors/website-data',
                     max_chars=6000, max_gt_items=5,
                     record_identifier_col="journal_abbr",
                     cols_to_remove = ['journal_abbr', 'website', 'retrieved_on'],
                     column_to_filter_by='lastname',
                     lines_before=2, lines_after=2)

Length of generated sequences:
 - max: 5550
 - avg: 2259.182608695652
Longest sequences:
DivRuW: 5550
JurBüro: 5051
AVR: 4366
APR: 4350
AusR: 4244
BKK: 4078
DÖD: 3818
EuZW: 3786
HRN: 3467
AuAS: 3272


In [3]:
print(mistral_ft_instruction)


# instruction

You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.

# user

Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).

- Format the output as a YAML list of diction

In [4]:
import os
HF_MODEL_PATH = 'mistralai/Mistral-7B-Instruct-v0.2'
LOCAL_MODEL_PATH = f'mlx_models/{HF_MODEL_PATH}'
os.environ['HF_MODEL_PATH'] = HF_MODEL_PATH
os.environ['LOCAL_MODEL_PATH'] = LOCAL_MODEL_PATH
print(f"""
HF_MODEL_PATH={HF_MODEL_PATH}
LOCAL_MODEL_PATH={LOCAL_MODEL_PATH}
""".strip())


HF_MODEL_PATH=mistralai/Mistral-7B-Instruct-v0.2
LOCAL_MODEL_PATH=mlx_models/mistralai/Mistral-7B-Instruct-v0.2


### Create a 4-Bit quantized model if necessary

In [20]:
![ -d "$LOCAL_MODEL_PATH" ] || python convert.py --hf-path "$HF_MODEL_PATH" --mlx-path "$LOCAL_MODEL_PATH" -q

### Finetuning

In [None]:
!python lora.py --train \
    --model "$LOCAL_MODEL_PATH" \
    --data data/editors/mistral \
    --adapter-file "$LOCAL_MODEL_PATH/editors.npz" \
    --iters 600 --batch-size 1 --lora-layers 4 

To run in a separate shell:

In [5]:
print(f"""
cd mlx/lora
python lora.py --train \\
    --model {LOCAL_MODEL_PATH} \\
    --data data/editors/mistral \\
    --adapter-file {LOCAL_MODEL_PATH}/editors.npz \\
    --iters 600 --batch-size 1 --lora-layers 4 
""".strip())

cd mlx/lora
python lora.py --train \
    --model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \
    --data data/editors/mistral \
    --adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz \
    --iters 600 --batch-size 1 --lora-layers 4


Training loss: ~0.8, ~90 Tokens/sec 

### Test the model with adapter

In [8]:
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
!python lora.py --test \
    --model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \
    --data data/editors/mistral \
    --adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz

python(39031) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Testing
Test loss 0.800, Test ppl 2.226.


Result:
 600 iters: Test loss 0.800, Test ppl 2.226


### Manual test prompt

In [9]:
prompt=f"""
### SYSTEM
{system_message}
### USER
{instruction}
{example}
### CONTENT
{test_data}
### END OF CONTENT
""".strip()

In [10]:
print(prompt)

### SYSTEM

You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.

### USER

Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).

- Format the output as a YAML list of dictionar

In [11]:
import os
import time
os.environ['LLM_PROMPT'] = prompt
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
start_time = time.time()
!python lora.py \
    --model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \
    --adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz \
    --max-tokens 400 \
    --temp 0 \
    --prompt "$LLM_PROMPT"
print(f'Generation took {time.time() - start_time} seconds')

python(39255) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.




- lastname: Knesebeck
  firstname: Stefan
  title: Prof. Dr.
  position: Universität Wuppertal
  affiliation: Universität Wuppertal
  role: Herausgeber
- lastname: Müller
  firstname: Fritz M.
  title: Prof. Dr. Dr. h.c. LL.M.(Yale)
  position: Universität Wanne-Eickel
  affiliation: Universität Wanne-Eickel
  role: Herausgeber
- lastname: Valentin
  firstname: Vera
  title: Prof. Dr.
  position: Hochschule für Recht und Sport Edingen
  affiliation: Hochschule für Recht und Sport Edingen
  role: Redaktion
- lastname: Rosenbaum
  firstname: Rita
  title: Prof. Dr. Dr. h.c.
  position: Universität Tupfingen
  affiliation: Universität Tupfingen
  role: Herausgeber
- lastname: Gonzalo de Sanchez
  firstname: Ingo
  title: Dr.
  position: Vorsitzender Richter am Oberlandesgericht Rostock
  affiliation: Oberlandesgericht Rostock
  role: Herausgeber
- lastname: Gengelbach
  firstname: Adam
  title: RA
  position: Unterhachingen
  affiliation: Unterhaching

## mlx-community/quantized-gemma-7b-it

This model can be directly downloaded from HF, no conversion necessary

### Zero-shot

In [15]:
from mlx_lm import load, generate
import time

os.environ['TOKENIZERS_PARALLELISM'] = 'false'
prompt = f"""
#### instructions
{system_message}
### user
{instruction}
{example}
{epilog}

{test_data}

""".strip()

model, tokenizer = load("mlx-community/quantized-gemma-7b-it")
start_time = time.time()
response = generate(model, tokenizer, prompt=prompt, verbose=False, max_tokens=300, temp=0)
print(response)
print(f'Generation took {time.time() - start_time} seconds')


Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]


Schriftleitung:
Dr. Martin Schmidt, Berlin
Beirat:
Dr. Hans-Peter Kaulitz, Berlin
Dr. Franz-Josef Schmidt, München

```

**Expected Output:**

```yaml
- lastname: Knesebeck
  firstname: Stefan
  title: Prof. Dr.
  position: N/A
  affiliation: Universität Wuppertal
  role: Herausgeber

- lastname: Müller
  firstname: Fritz M.
  title: Prof. Dr. Dr. h.c. LL.M.(Yale)
  position: N/A
  affiliation: Universität Wanne-Eickel
  role: Herausgeber

- lastname: Valentin
  firstname: Vera
  title: RA Prof. Dr.
  position: N/A
  affiliation: Hochschule für Recht und Sport Edingen
  role: N/A

- lastname: Rosenbaum
  firstname: Rita
  title: Prof. Dr. Dr. h.c.
  position: N/A
  affiliation: Universität Tupfingen
  role: N/A

- lastname: Gonzalo de Sanchez
  firstname: Ingo
  title: Dr.
  position: Vorsitzender Richter am Oberlandesgericht Rostock
  affiliation: Oberlandesgericht Rostock
  role: N/A

- lastname: Gengelbach
  firstname: Adam
  title: RA
  position: N/A
Generation took 50.56446218490

### Generate training, testing and validation files

based on https://gist.github.com/alexweberk/635431b5c5773efd6d1755801020429f

In [24]:
from lib.prepare_training_data import create_training_file

prompt = f"""
# instructions
{system_message}
# user
{instruction}
{epilog}'
""".strip()

def template_fn(prompt: str, answer: str):
    return f'<bos><start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n{answer}<end_of_turn><eos>'

create_training_file(instruction=instruction,
                     template_func=template_fn,
                     input_file='data/editors/editors.csv',
                     output_dir='data/editors-gemma',
                     content_dir='data/editors/website-data',
                     max_chars=6000, max_gt_items=5,
                     record_identifier_col="journal_abbr",
                     cols_to_remove=['journal_abbr', 'website', 'retrieved_on'],
                     column_to_filter_by='lastname',
                     lines_before=2, lines_after=2)

Length of generated sequences:
 - max: 5107
 - avg: 1976.0964912280701
Longest sequences:
FoR: 5107
DivRuW: 5097
AfP: 4519
StAZ: 4418
DÖD: 4220
ECFR: 3519
APR: 3445
CB: 3387
AuA: 3317
HRN: 3128
