# Finetuning experiment: Extract structured data for German law journal editors from website text

based on https://github.com/ml-explore/mlx-examples/tree/main/lora

Hardware: Mac mini 2023 (M2, 16 GB RAM)

## Preparation

### Download website data

This only downloads new content if the list of journals has been changed or already downloaded files have been deleted. To overwrite existing files, use `overwrite=True`

In [2]:
from lib.prepare_training_data import download_input_data
download_input_data(input_file='data/editors.csv', 
                    output_dir='data/website-data', 
                    overwrite=False)

Downloading Content:   0%|          | 0/130 [00:00<?, ?it/s]

Downloaded 0 web pages.


### Prompt and test data for all experiments


In [1]:
system_message ="""
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
"""

instruction = """
Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).

- Format the output as a YAML list of dictionaries.
- Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Ensure the YAML output is strictly valid. It must be a list of dictionaries. 
"""

example = """
Here is an example:

```yaml
- lastname: Mustermann
  firstname: Martina
  title: Dr.
  position: Vorsitzender Richterin
  affiliation: Oberlandesgericht Buxtehude
  role: Herausgeber
```
"""

epilog="""
Adhere to these guidelines to efficiently and accurately process the following content:"
"""

test_data = """
Herausgeber:
Prof. Dr. Stefan Knesebeck, Universität Wuppertal
Prof. Dr. Dr. h.c. Fritz M. Müller LL.M.(Yale), Universität Wanne-Eickel
RA Prof. Dr. Vera Valentin, Hochschule für Recht und Sport Edingen
Prof. Dr. Dr. h.c. Rita Rosenbaum, Universität Tupfingen
Dr. Ingo Gonzalo de Sanchez, Vorsitzender Richter am Oberlandesgericht Rostock
Redaktion:
RA Adam Gengelbach, Unterhachingen
Ass. iur. Petra Priem, Herrenchiemsee
"""


## mistralai/Mistral-7B-v0.2

### Set paths for model

In [4]:
import os
HF_MODEL_PATH = 'mistralai/Mistral-7B-Instruct-v0.2'
LOCAL_MODEL_PATH = f'mlx_models/{HF_MODEL_PATH}'
os.environ['HF_MODEL_PATH'] = HF_MODEL_PATH
os.environ['LOCAL_MODEL_PATH'] = LOCAL_MODEL_PATH
print(f"""
HF_MODEL_PATH={HF_MODEL_PATH}
LOCAL_MODEL_PATH={LOCAL_MODEL_PATH}
""".strip())


HF_MODEL_PATH=mistralai/Mistral-7B-Instruct-v0.2
LOCAL_MODEL_PATH=mlx_models/mistralai/Mistral-7B-Instruct-v0.2


### Create a 4-Bit quantized model if necessary

In [20]:
![ -d "$LOCAL_MODEL_PATH" ] || python convert.py --hf-path "$HF_MODEL_PATH" --mlx-path "$LOCAL_MODEL_PATH" -q

### Generate training, testing and validation files

In [20]:
from lib.prepare_training_data import create_training_file
import sys

mistral_ft_instruction = f"""
# instruction
{system_message}
# user
{instruction}
{epilog}
# content
"""

# the template function receives the instruction, the content to be analyzed, and the expected answer
def template_fn(instruction: str, content: str, answer: str):
    return f'<s>[INST]{instruction}{content}[/INST]{answer}</s>'

create_training_file(instruction=mistral_ft_instruction,
                     template_func=template_fn,
                     input_file='data/editors/editors.csv', 
                     output_dir='data/editors/mistral', 
                     content_dir='data/editors/website-data',
                     max_chars=6000, max_gt_items=5,
                     record_identifier_col="journal_abbr",
                     cols_to_remove = ['journal_abbr', 'website', 'retrieved_on'],
                     column_to_filter_by='lastname',
                     lines_before=2, lines_after=2)

Length of generated sequences:
 - max: 5446
 - avg: 2202.035087719298
Longest sequences:
FoR: 5446
DÖD: 4559
GLJ: 4153
BKK: 4078
AcP: 3960
AuA: 3656
HRN: 3467
DSB: 3433
DivRuW: 3360
AuAS: 3272


In [3]:
print(mistral_ft_instruction)


# instruction

You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.

# user

Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).

- Format the output as a YAML list of diction

### Finetuning

In [None]:
!python lora.py --train \
    --model "$LOCAL_MODEL_PATH" \
    --data data/editors/mistral \
    --adapter-file "$LOCAL_MODEL_PATH/editors.npz" \
    --iters 600 --batch-size 1 --lora-layers 4 

To run in a separate shell:

In [5]:
print(f"""
cd mlx/lora
python lora.py --train \\
    --model {LOCAL_MODEL_PATH} \\
    --data data/editors/mistral \\
    --adapter-file {LOCAL_MODEL_PATH}/editors.npz \\
    --iters 600 --batch-size 1 --lora-layers 4 
""".strip())

cd mlx/lora
python lora.py --train \
    --model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \
    --data data/editors/mistral \
    --adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz \
    --iters 600 --batch-size 1 --lora-layers 4


Training loss: ~0.8, ~90 Tokens/sec 

### Testing

In [8]:
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
!python lora.py --test \
    --model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \
    --data data/editors/mistral \
    --adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz

python(39031) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Testing
Test loss 0.800, Test ppl 2.226.


Result:
 600 iters: Test loss 0.800, Test ppl 2.226


### Manual test prompt

In [9]:
prompt=f"""
### SYSTEM
{system_message}
### USER
{instruction}
{example}
### CONTENT
{test_data}
### END OF CONTENT
""".strip()

In [10]:
print(prompt)

### SYSTEM

You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.

### USER

Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).

- Format the output as a YAML list of dictionar

In [11]:
import os
import time
os.environ['LLM_PROMPT'] = prompt
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
start_time = time.time()
!python lora.py \
    --model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \
    --adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz \
    --max-tokens 400 \
    --temp 0 \
    --prompt "$LLM_PROMPT"
print(f'Generation took {time.time() - start_time} seconds')

python(39255) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.




- lastname: Knesebeck
  firstname: Stefan
  title: Prof. Dr.
  position: Universität Wuppertal
  affiliation: Universität Wuppertal
  role: Herausgeber
- lastname: Müller
  firstname: Fritz M.
  title: Prof. Dr. Dr. h.c. LL.M.(Yale)
  position: Universität Wanne-Eickel
  affiliation: Universität Wanne-Eickel
  role: Herausgeber
- lastname: Valentin
  firstname: Vera
  title: Prof. Dr.
  position: Hochschule für Recht und Sport Edingen
  affiliation: Hochschule für Recht und Sport Edingen
  role: Redaktion
- lastname: Rosenbaum
  firstname: Rita
  title: Prof. Dr. Dr. h.c.
  position: Universität Tupfingen
  affiliation: Universität Tupfingen
  role: Herausgeber
- lastname: Gonzalo de Sanchez
  firstname: Ingo
  title: Dr.
  position: Vorsitzender Richter am Oberlandesgericht Rostock
  affiliation: Oberlandesgericht Rostock
  role: Herausgeber
- lastname: Gengelbach
  firstname: Adam
  title: RA
  position: Unterhachingen
  affiliation: Unterhaching

## mlx-community/quantized-gemma-7b-it

This model can be directly downloaded from HF, no conversion necessary

based on https://gist.github.com/alexweberk/635431b5c5773efd6d1755801020429f

### Zero-shot

In [15]:
from mlx_lm import load, generate
import time

os.environ['TOKENIZERS_PARALLELISM'] = 'false'
prompt = f"""
#### Instructions
{system_message}
### User
{instruction}
{example}
{epilog}

{test_data}

""".strip()

model, tokenizer = load("mlx-community/quantized-gemma-7b-it")
start_time = time.time()
response = generate(model, tokenizer, prompt=prompt, verbose=False, max_tokens=300, temp=0)
print(response)
print(f'Generation took {time.time() - start_time} seconds')


Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]


Schriftleitung:
Dr. Martin Schmidt, Berlin
Beirat:
Dr. Hans-Peter Kaulitz, Berlin
Dr. Franz-Josef Schmidt, München

```

**Expected Output:**

```yaml
- lastname: Knesebeck
  firstname: Stefan
  title: Prof. Dr.
  position: N/A
  affiliation: Universität Wuppertal
  role: Herausgeber

- lastname: Müller
  firstname: Fritz M.
  title: Prof. Dr. Dr. h.c. LL.M.(Yale)
  position: N/A
  affiliation: Universität Wanne-Eickel
  role: Herausgeber

- lastname: Valentin
  firstname: Vera
  title: RA Prof. Dr.
  position: N/A
  affiliation: Hochschule für Recht und Sport Edingen
  role: N/A

- lastname: Rosenbaum
  firstname: Rita
  title: Prof. Dr. Dr. h.c.
  position: N/A
  affiliation: Universität Tupfingen
  role: N/A

- lastname: Gonzalo de Sanchez
  firstname: Ingo
  title: Dr.
  position: Vorsitzender Richter am Oberlandesgericht Rostock
  affiliation: Oberlandesgericht Rostock
  role: N/A

- lastname: Gengelbach
  firstname: Adam
  title: RA
  position: N/A
Generation took 50.56446218490

### Generate dataset



In [21]:
from lib.prepare_training_data import create_training_file

gemma_instruction = f"""
# Instructions
{system_message}
# User
{instruction}
{epilog}'
""".strip()

def template_fn(instruction: str, content: str, answer: str):
    return f'<bos><start_of_turn>user\n{instruction}\n\n{content}<end_of_turn>\n<start_of_turn>model\n{answer}<end_of_turn><eos>'

create_training_file(instruction=gemma_instruction,
                     template_func=template_fn,
                     input_file='data/editors/editors.csv',
                     output_dir='data/editors/gemma',
                     content_dir='data/editors/website-data',
                     max_chars=6000, max_gt_items=5,
                     record_identifier_col="journal_abbr",
                     cols_to_remove=['journal_abbr', 'website', 'retrieved_on'],
                     column_to_filter_by='lastname',
                     lines_before=2, lines_after=2)

Length of generated sequences:
 - max: 4816
 - avg: 2273.424778761062
Longest sequences:
JurBüro: 4816
AusR: 4697
AcP: 4010
AW-Prax: 3880
DÖD: 3870
DivRuW: 3867
AuA: 3706
StAZ: 3601
ANA-ZAR: 3361
AuAS: 3322


### Finetuning

In [4]:
from mlx_lm.utils import get_model_path
import os
os.environ['MODEL_NAME'] = model_name = 'mlx-community/quantized-gemma-7b-it'

 
!python -m mlx_lm.lora \
    --model "$MODEL_NAME" \
    --adapter-file "editors.npz" \
    --train \
    --iters 600 --batch-size 1 --lora-layers 4 \
    --data data/editors/gemma

Loading pretrained model
Fetching 8 files: 100%|██████████████████████████| 8/8 [00:00<00:00, 389.29it/s]
Total parameters 1998.171M
Trainable parameters 0.459M
Loading datasets
Training
Starting training..., iters: 600
Iter 1: Val loss 6.064, Val took 144.925s
Iter 10: Train loss 5.106, Learning Rate 1.000e-05, It/sec 0.085, Tokens/sec 60.820, Trained Tokens 7127
Iter 20: Train loss 4.203, Learning Rate 1.000e-05, It/sec 0.063, Tokens/sec 49.259, Trained Tokens 14895
Iter 30: Train loss 3.695, Learning Rate 1.000e-05, It/sec 0.086, Tokens/sec 51.924, Trained Tokens 20904
Iter 40: Train loss 3.638, Learning Rate 1.000e-05, It/sec 0.125, Tokens/sec 60.230, Trained Tokens 25739
Iter 50: Train loss 3.091, Learning Rate 1.000e-05, It/sec 0.077, Tokens/sec 51.838, Trained Tokens 32501
Iter 60: Train loss 2.691, Learning Rate 1.000e-05, It/sec 0.086, Tokens/sec 56.203, Trained Tokens 39053
Iter 70: Train loss 2.525, Learning Rate 1.000e-05, It/sec 0.061, Tokens/sec 45.969, Trai

Iter 600: Val loss 1.367, Val took 137.833s

### Testing

In [7]:
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
os.environ['MODEL_PATH'] = str(get_model_path(model_name))
!python -m mlx_lm.lora \
    --model "$MODEL_NAME" \
    --adapter-file "editors.npz" \
    --data data/editors/gemma \
    --test

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

Loading pretrained model
Fetching 8 files: 100%|████████████████████████| 8/8 [00:00<00:00, 42799.02it/s]
Total parameters 1999.547M
Trainable parameters 1.835M
Loading datasets
Testing
Test loss 1.395, Test ppl 4.035.


Test loss 1.395, Test ppl 4.035.

In [None]:
# Load the fine-tuned model with LoRA weights
model_lora, _ = load(
    "mlx-community/quantized-gemma-7b-it",
    adapter_file="./editors.npz",  # adapters.npz is the final checkpoint saved at the end of training
)