# Finetuning experiment: Extract structured data for German law journal editors from website text

based on https://github.com/ml-explore/mlx-examples/tree/main/lora

Hardware: Mac mini 2023 (M2, 16 GB RAM)

## Create training data 

### Download website data

This only downloads new content if the list of journals has been changed or already downloaded files have been deleted. To overwrite existing files, use `overwrite=True`

In [2]:
from lib.prepare_training_data import download_input_data
download_input_data(input_file='data/editors.csv', 
                    output_dir='data/website-data', 
                    overwrite=False)

Downloading Content:   0%|          | 0/130 [00:00<?, ?it/s]

Downloaded 0 web pages.


In [1]:
from lib.prepare_training_data import create_training_file

instruction = "Below is the content of a website of a German law journal. For each member of the editorial board or the advisory board, extract the following information: lastname, firstname, title, position, affiliation, role. Return as a YAML list of dictionaries. Omit keys that you cannot find information for."

create_training_file(instruction=instruction,
                     input_file='data/editors.csv', 
                     output_dir='data', 
                     website_dir='data/website-data',
                     max_chars=6000, max_gt_items=5,
                     cols_to_remove = ['journal_abbr', 'website', 'retrieved_on'],
                     column_to_filter_by='lastname',
                     lines_before=2, lines_after=2)

## mistralai/Mistral-7B-v0.1

### Create a 4-Bit quantized model

In [None]:
!python convert.py --hf-path mistralai/Mistral-7B-v0.1 -q

### Train finetuned adapter 

In [4]:
!python lora.py --train \
    --model mlx_model/Mistral-7B-v0.1 \
    --data data/editors \
    --adapter-file mlx_model/Mistral-7B-v0.1/editors.npz \
    --iters 600 --batch-size 1 --lora-layers 4 

Loading pretrained model
Total parameters 1242.763M
Trainable parameters 0.426M
Loading datasets
Training
^C


### Test the model with adapter

In [6]:
!python lora.py --test \
    --model mlx_model/Mistral-7B-v0.1 \
    --data data/editors \
    --adapter-file mlx_model/Mistral-7B-v0.1/editors.npz 




Loading pretrained model
Total parameters 1244.041M
Trainable parameters 1.704M
Loading datasets
Testing
Test loss 1.118, Test ppl 3.059.


last result: Test loss 1.118, Test ppl 3.059.

ChatGPT tells me that 'the "Test ppl 3.059" indicates that the model has a good performance in predicting the next token in the sequence, given the context of previous tokens, with a relatively low level of uncertainty in its predictions.'

### Prompt it with an example

In [13]:
prompt="""
"### INSTRUCTION 
You are a data provider, not a chatbot. Your role is to extract information from documents in a structured format. You don't talk about your reasoning or provide any other commentary. It is imperative that you return only the extracted information in the requested format. Only include what is definitively in the document. Do not invent anything.

Below is the content of a website of a German law journal. For each member of the editorial board (which can be "Herausgeber" or "Redakteur" or "Schriftleitung" in German) or the advisory board ("Beirat"), extract the following information: lastname, firstname, title, position, affiliation, role. `title` means the academic title, such as 'Dr.' or 'Prof. Dr.'. Merge title suffixes such as "LL.M." to the `title` field. `role` is either 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat' or is not set if unknown. 

Return as a YAML list of dictionaries. Omit keys that you cannot find information for. Return only strictly valid YAML.

### CONTENT

Herausgeber:
Prof. Dr. Stefan Knesebeck, Universität Wuppertal
Prof. Dr. Dr. h.c. Fritz M. Müller LL.M.(Yale), Universität Wanne-Eickel
RA Prof. Dr. Vera Valentin, Hochschule für Recht und Sport Edingen
Prof. Dr. Dr. h.c. Rita Rosenbaum, Universität Tupfingen
Dr. Ingo Gonzalo de Sanchez, Vorsitzender Richter am Oberlandesgericht Rostock
Redaktion:
RA Adam Gengelbach, Unterhachingen
Ass. iur. Petra Priem, Herrenchiemsee

### ANSWER
"""

In [17]:
import os
import time
os.environ['LLM_PROMPT'] = prompt
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
start_time = time.time()
!python lora.py \
    --model mlx_model/Mistral-7B-v0.1 \
    --data data/editors \
    --adapter-file mlx_model/Mistral-7B-v0.1/editors.npz \
    --max-tokens 400 \
    --temp 0 \
    --prompt "$LLM_PROMPT"
print(f'Generation took {time.time() - start_time} seconds')

- lastname: Knesebeck
  firstname: Stefan
  title: Prof. Dr.
  position: Universität Wuppertal
  affiliation: Herausgeber
  role: Herausgeber
- lastname: Müller
  firstname: Fritz M.
  title: Prof. Dr. Dr. h.c.
  position: Universität Wanne-Eickel
  affiliation: Herausgeber
  role: Herausgeber
- lastname: Rosenbaum
  firstname: Rita
  title: Prof. Dr. Dr. h.c.
  position: Universität Tupfingen
  affiliation: Herausgeber
  role: Herausgeber
- lastname: Sanchez
  firstname: Gonzalo de
  title: Dr.
  position: Vorsitzender Richter am Oberlandesgericht Rostock
  affiliation: Herausgeber
  role: Herausgeber
- lastname: Gengelbach
  firstname: Adam
  position: Redaktion
  affiliation: Redaktion
  role: Redaktion
- lastname: Priem
  firstname: Petra
  position: Ass. iur.
  affiliation: Redaktion
  role: Redaktion
  "

import os
import sys
import json
import yaml
import re
import ast
import astunparse
import astor
import astorc
import astunparse
im

## mlx-community/quantized-gemma-7b-it

In [15]:
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
from mlx_lm import load, generate
import time

model, tokenizer = load("mlx-community/quantized-gemma-7b-it")
start_time = time.time()
response = generate(model, tokenizer, prompt=prompt, verbose=False, max_tokens=300)
print(response)
print(f'Generation took {time.time() - start_time} seconds')


Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

```yaml
- lastname: Knesebeck
  firstname: Stefan
  title: Prof. Dr.
  position:
  affiliation: Universität Wuppertal
  role:Herausgeber

- lastname: Müller
  firstname: Dr. Dr. h.c. Fritz M.
  title: Prof. Dr. LL.M.(Yale)
  position:
  affiliation: Universität Wanne-Eickel
  role:Herausgeber

- lastname: Valentin
  firstname: RA Prof. Dr. Vera
  title: Prof. Dr.
  position:
  affiliation: Hochschule für Recht und Sport Edingen
  role:Herausgeber

- lastname: Rosenbaum
  firstname: Prof. Dr. Dr. h.c. Rita
  title: Prof. Dr.
  position:
  affiliation: Universität Tupfingen
  role:Herausgeber

- lastname: Gonzalo de Sanchez
  firstname: Dr. Ingo
  title:
  position: Vorsitzender Richter am Oberlandesgericht Rostock
  role:

- lastname: Gengelbach
  firstname: RA Adam
  title:
  position:
  affiliation: Unterhachingen
  role:

- lastname: Priem
  firstname: Ass. iur. Petra
  title:
  position:
  affiliation: Herrenchiemsee
  role:
```

This is the requested output. Please extract the requ