# Finetuning experiment: Extract structured data for German law journal editors from website text

based on https://github.com/ml-explore/mlx-examples/tree/main/lora

Hardware: Mac mini 2023 (M2, 16 GB RAM)

## Create training data 

### Download website data

This only downloads new content if the list of journals has been changed or already downloaded files have been deleted. To overwrite existing files, use `overwrite=True`

In [2]:
from lib.prepare_training_data import download_input_data
download_input_data(input_file='data/editors.csv', 
                    output_dir='data/website-data', 
                    overwrite=False)

Downloading Content:   0%|          | 0/130 [00:00<?, ?it/s]

Downloaded 0 web pages.


### Instruction prompt


In [21]:
system_message ="""
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
"""

instruction = """
Analyze the content from a German law journal's website which follows this instruction. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).

- Format the output as a YAML list of dictionaries.
- Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Ensure the YAML output is strictly valid.

Adhere to these guidelines to efficiently and accurately process the following content:"
""".strip()

### Generate training, testing and validation files

In [2]:
from lib.prepare_training_data import create_training_file

def template_fn(prompt: str, answer: str):
    return f'<s>[INST]{prompt}[/INST]{answer}</s>'

create_training_file(instruction=instruction,
                     template_func=template_fn,
                     input_file='data/editors/editors.csv', 
                     output_dir='data/editors', 
                     content_dir='data/editors/website-data',
                     max_chars=6000, max_gt_items=5,
                     record_identifier_col="journal_abbr",
                     cols_to_remove = ['journal_abbr', 'website', 'retrieved_on'],
                     column_to_filter_by='lastname',
                     lines_before=2, lines_after=2)

Sequence length:
 - max: 4730
 - avg: 1631.2719298245613
Longest sequences:
AfkKR: 4730
StAZ: 3493
BB: 3419
DÖD: 3236
ECFR: 3164
AfP: 3157
AuA: 3074
HRN: 2885
DivRuW: 2778
DJZ: 2750


## mistralai/Mistral-7B-v0.2

In [7]:
import os
HF_MODEL_PATH = 'mistralai/Mistral-7B-Instruct-v0.2'
LOCAL_MODEL_PATH = f'mlx_models/{HF_MODEL_PATH}'
os.environ['HF_MODEL_PATH'] = HF_MODEL_PATH
os.environ['LOCAL_MODEL_PATH'] = LOCAL_MODEL_PATH
print(f"""
HF_MODEL_PATH={HF_MODEL_PATH}
LOCAL_MODEL_PATH={LOCAL_MODEL_PATH}
""".strip())


HF_MODEL_PATH=mistralai/Mistral-7B-Instruct-v0.2
LOCAL_MODEL_PATH=mlx_models/mistralai/Mistral-7B-Instruct-v0.2


### Create a 4-Bit quantized model

In [None]:
!python convert.py --hf-path "$HF_MODEL_PATH" --mlx-path "$LOCAL_MODEL_PATH" -q

### Create adapter with fine-tuned weights

In [None]:
!python lora.py --train \
    --model "$LOCAL_MODEL_PATH" \
    --data data/editors \
    --adapter-file "$LOCAL_MODEL_PATH/editors.npz" \
    --iters 600 --batch-size 1 --lora-layers 4 

To run in a separate shell:

In [11]:
print(f"""
cd mlx/lora
python lora.py --train \\
    --model {LOCAL_MODEL_PATH} \\
    --data data/editors \\
    --adapter-file {LOCAL_MODEL_PATH}/editors.npz \\
    --iters 600 --batch-size 1 --lora-layers 4 
""".strip())

cd mlx/lora
python lora.py --train \
    --model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \
    --data data/editors \
    --adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz \
    --iters 600 --batch-size 1 --lora-layers 4


### Test the model with adapter

In [13]:
!python lora.py --test \
    --model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \
    --data data/editors \
    --adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz

Testing
Test loss 0.928, Test ppl 2.529.


last result:
Test loss 0.928, Test ppl 2.529.

### Prompt it with an example

In [25]:
prompt=f"""
{system_message}

{instruction}

### CONTENT

Herausgeber:
Prof. Dr. Stefan Knesebeck, Universität Wuppertal
Prof. Dr. Dr. h.c. Fritz M. Müller LL.M.(Yale), Universität Wanne-Eickel
RA Prof. Dr. Vera Valentin, Hochschule für Recht und Sport Edingen
Prof. Dr. Dr. h.c. Rita Rosenbaum, Universität Tupfingen
Dr. Ingo Gonzalo de Sanchez, Vorsitzender Richter am Oberlandesgericht Rostock
Redaktion:
RA Adam Gengelbach, Unterhachingen
Ass. iur. Petra Priem, Herrenchiemsee

### END OF CONTENT
""".strip()

In [26]:
print(prompt)

You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.


Analyze the content from a German law journal's website which follows this instruction. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).

- Format the output as a YAML lis

In [27]:
import os
import time
os.environ['LLM_PROMPT'] = prompt
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
start_time = time.time()
!python lora.py \
    --model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \
    --adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz \
    --max-tokens 400 \
    --temp 0 \
    --prompt "$LLM_PROMPT"
print(f'Generation took {time.time() - start_time} seconds')



### OUTPUT

- lastname: Knesebeck
  firstname: Stefan
  title: Prof. Dr.
  position: Herausgeber
  affiliation: Universität Wuppertal
  role: Herausgeber
- lastname: Müller
  firstname: Fritz M.
  title: Prof. Dr. Dr. h.c.
  position: Herausgeber
  affiliation: Universität Wanne-Eickel
  role: Herausgeber
- lastname: Valentin
  firstname: Vera
  title: RA Prof. Dr.
  position: Redaktion
  affiliation: Hochschule für Recht und Sport Edingen
  role: Redaktion
- lastname: Rosenbaum
  firstname: Rita
  title: Prof. Dr. Dr. h.c.
  position: Herausgeber
  affiliation: Universität Tupfingen
  role: Herausgeber
- lastname: Gonzalo de Sanchez
  firstname: Ingo
  title: Dr.
  position: Vorsitzender Richter am Oberlandesgericht Rostock
  affiliation: Empty
  role: Empty

Generation took 65.16657900810242 seconds


# mlx-community/quantized-gemma-7b-it

In [15]:
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
from mlx_lm import load, generate
import time

model, tokenizer = load("mlx-community/quantized-gemma-7b-it")
start_time = time.time()
response = generate(model, tokenizer, prompt=prompt, verbose=False, max_tokens=300)
print(response)
print(f'Generation took {time.time() - start_time} seconds')


Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

```yaml
- lastname: Knesebeck
  firstname: Stefan
  title: Prof. Dr.
  position:
  affiliation: Universität Wuppertal
  role:Herausgeber

- lastname: Müller
  firstname: Dr. Dr. h.c. Fritz M.
  title: Prof. Dr. LL.M.(Yale)
  position:
  affiliation: Universität Wanne-Eickel
  role:Herausgeber

- lastname: Valentin
  firstname: RA Prof. Dr. Vera
  title: Prof. Dr.
  position:
  affiliation: Hochschule für Recht und Sport Edingen
  role:Herausgeber

- lastname: Rosenbaum
  firstname: Prof. Dr. Dr. h.c. Rita
  title: Prof. Dr.
  position:
  affiliation: Universität Tupfingen
  role:Herausgeber

- lastname: Gonzalo de Sanchez
  firstname: Dr. Ingo
  title:
  position: Vorsitzender Richter am Oberlandesgericht Rostock
  role:

- lastname: Gengelbach
  firstname: RA Adam
  title:
  position:
  affiliation: Unterhachingen
  role:

- lastname: Priem
  firstname: Ass. iur. Petra
  title:
  position:
  affiliation: Herrenchiemsee
  role:
```

This is the requested output. Please extract the requ