# Study on generating structured output from local Hugging Face models
In this notebook I demonstrate possibility to generate structured output from locally downloaded LLMs as well as show weaknesses of current method for this.
#### outlines module
python module that enables enforcing constraints on generative models outputs. We will use it to guarantee JSON format of output.

### Dependencies import

In [10]:
from pydantic import BaseModel, Field, PositiveFloat
import outlines
import json
from huggingface_hub import notebook_login
from transformers import pipeline
import pandas as pd

### Hugging Face login

In [2]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Simple example of outlines

In [4]:
# define output structure
class CarBrands(BaseModel):
    brands: list[str]


# download model
model = outlines.models.transformers('gpt2')
# create generator
generator = outlines.generate.json(model, json.dumps(CarBrands.model_json_schema()))
# simple use
generator("Here is list of car brands: \n Ferrari, Polonez, Fiat\nJSON format: {'brands': []}\nHere is list in JSON:")

{'brands': ['cclintte',
  'cllintte',
  'phamotimete',
  'matemetro',
  'cccntyte',
  'esquimbe',
  'estamte',
  'travletecredit72',
  'esquimbe',
  'poissonbeau',
  'trumpettes',
  'doublestrotchettien',
  'tambiquetox',
  'neverebeau',
  'métal ,brichtte',
  'métalcox',
  'neverepeufte',
  "emmet's",
  'lourbs',
  'miles',
  'nixconveurs',
  'perenniest',
  'mountai',
  'zuehusband',
  'hevinations',
  'screddewaten',
  'i hanl8',
  'jerelsten',
  '300kltmen',
  'swike',
  'infernim,',
  'pals',
  'bachelors-menlete',
  'firsf/d/']}

The results are far from perfect but requested JSON structure is preserved. For comparison results from raw gpt2:

In [4]:
generator = pipeline("text-generation", model="gpt2")
generator("Here is list of car brands: \n Ferrari, Polonez, Fiat\nJSON format: {'brands': []}\nHere is list in JSON:")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'Here is list of car brands: \n Ferrari, Polonez, Fiat\nJSON format: {\'brands\': []}\nHere is list in JSON:\n\n{"brands": [{"type":"fiat"}]\n\n'}]

### Attempt on coupon data extraction
In the following part I will aim from determining performance of `outlies` driven approach to solving simplified task for extracting coupon data.
##### Simplified problem statement
* We want to extract only 3 fields for each coupon: `price_old`, `price_new` and `product_name`
* We will assume that data given to model will always contain info about exactly one coupon
* We will test different forms of input data, starting from raw csv as in original solution, through csv with excluded columns and csv encoded into json to only 'Text' fields extracted

In [4]:
class Coupon(BaseModel):
  product_name: str
  old_price: PositiveFloat
  new_price: PositiveFloat

#### Define single coupon row ranges

In [5]:
frame = pd.read_csv('ds/18789327023.csv')
coupons = {
    "1": (2, 7),
    "2": (7, 12),
    "3": (12, 17)
}

#### Model download
I used here Llama model which is available only after making special request. I got answer to my request within 24h.

In [6]:
model = outlines.models.transformers('meta-llama/Llama-3.2-1B')
generator = outlines.generate.json(model, json.dumps(Coupon.model_json_schema()))

In [7]:
inputs = {}

In [8]:
def frame_to_json(df: pd.DataFrame, list_name: str = 'rows') -> dict:
    res = {
        list_name: []
    }
    for ix, row in df.iterrows():
        obj  = {}
        for col in df.columns:
            obj[col] = row[col]
        res[list_name].append(obj)
    return res

In [9]:
REDUCED_COLUMNS = ["Text", "X 1", "Y 1", "X 2", "Y 2"]

inputs["raw_csv"] = {k: frame[v[0]:v[1]].to_csv() for k, v in coupons.items()}
inputs["reduced_csv"] = {k: frame[REDUCED_COLUMNS][v[0]:v[1]].to_csv() for k, v in coupons.items()}
inputs["json_encoded"] = {k: json.dumps(frame_to_json(frame[REDUCED_COLUMNS][v[0]:v[1]])) for k, v in coupons.items()}
inputs["only_text"] = {k: '\n'.join(frame['Text'][v[0]:v[1]].to_list()) for k, v in coupons.items()}

In [10]:
inputs["only_text"]

{'1': 'UVP 1.79\n1.29\nUVP\nRAUCH Eistee\nje 1,5 I',
 '2': 'UVP 0.99\n0.69\nUVP\nLIPTON Ice Tea\nje 0,33 I',
 '3': 'UVP 1.49\n1.19\nUVP\nHOHES C Water\nje 0,75 I'}

#### Prompt proposals
Here I test performance of different prompts

In [11]:
# prompt in form of direct command
prompt_1 = """
Your task is to extract info about discount coupon. You are given texts that user sees on smartphone screen in format: 'Text', 'X 1', 'X 2', 'Y 1', 'Y 2'. 'Text' field usually contains most useful infos. Provide me following data about given coupon: product_name: name of discounted product, old_price: price before discount, new_price: price after discount. Note that in input Text fields might be in wrong order. The results should be put inside JSON. Here is input:\n\n {}
"""
# prompts that aims to trick model into completing sequence
prompt_2 = """
Here is text representation of what user sees on smartphone screen: \n{}\nThis data contains info about single discount coupon visible on screen. Most of this data lies in 'Text' fields.\n Here is JSON with extracted info about this coupon. JSON contains discounted product name, old price and new price. JSON:\n 
"""
# direct command, shorter
prompt_3 = """
Here is data about coupon from smartphone screen: \n{}\n extract info about it and and put inside json.\n  
"""
# 'tricky' prompt, shorter
prompt_4 = """
Here is data from smartphone screen with info about discount coupon:\n{}\n. Here is the same info, but extracted in JSON:\n
"""
prompts = [prompt_1, prompt_2, prompt_3, prompt_4]

In [12]:
for input_fmt in inputs:
    for i, prompt in enumerate(prompts, start=1):
        print(f"\n{input_fmt=}, prompt_{i}")
        for k in coupons.keys():
            print(f"\tcoupon {k}")
            print('\t', generator(prompt.format(inputs[input_fmt][k])))


input_fmt='raw_csv', prompt_1
	coupon 1
	 {'product_name': 'je 1,5 I', 'old_price': 11.47, 'new_price': 7.75}
	coupon 2
	 {'product_name': 'je 0,33 I', 'old_price': 219.99, 'new_price': 161.99}
	coupon 3
	 {'product_name': 'je 0,75 I', 'old_price': 8.99, 'new_price': 7.99}

input_fmt='raw_csv', prompt_2
	coupon 1
	 {'product_name': 'je 1,5 I', 'old_price': 100, 'new_price': 30.05}
	coupon 2
	 {'product_name': 'ANIKO ULTRA-CINKURENT, AMOLUSHI', 'old_price': 59.88, 'new_price': 39.99}
	coupon 3
	 {'product_name': '1.19', 'old_price': 8720.0, 'new_price': 1920.0}

input_fmt='raw_csv', prompt_3
	coupon 1
	 {'product_name': 'UVP 1.79', 'old_price': 1.79, 'new_price': 1.79}
	coupon 2
	 {'product_name': 'URUUsP15r9tqk', 'old_price': 60, 'new_price': 30.99}
	coupon 3
	 {'product_name': 'Water (75 ml)', 'old_price': 0, 'new_price': 0.75}

input_fmt='raw_csv', prompt_4
	coupon 1
	 {'product_name': 'Benfotiamine', 'old_price': 79.5, 'new_price': 49.7}
	coupon 2
	 {'product_name': 'de.penny.app.p

In [15]:
# try again with evolution of prompt_4 and some experiments and only 2 last input formats
# 'tricky' prompt, shorter
prompt_5 = """
This is data from screen with info (like price and product name) about discount coupon:\n{}\n. Here extracted info, in form of JSON:\n
"""
prompt_6 = """
Consider the following data about discount coupon: \n\n{}\n\n This is JSON with extracted info about this coupon.:\n
"""

prompt_7 = """
Here is data from mobile_app containing discount coupon that needs to be converted to JSON:\n {} \n Here is output JSON:\n
"""
prompts += [prompt_5, prompt_6, prompt_7]

In [16]:
for input_fmt in ["only_text", "json_encoded"]:
    for i, prompt in enumerate(prompts, start=1):
        if i < 5:
            continue
        print(f"\n{input_fmt=}, prompt_{i}")
        for k in coupons.keys():
            print(f"\tcoupon {k}")
            print('\t', generator(prompt.format(inputs[input_fmt][k])))


input_fmt='only_text', prompt_5
	coupon 1
	 {'product_name': 'JegetELLI 1,5 I', 'old_price': 1.5, 'new_price': 1.29}
	coupon 2
	 {'product_name': 'LIPTON Ice Tea', 'old_price': 15.99, 'new_price': 7.99}
	coupon 3
	 {'product_name': 'UVP 1.49', 'old_price': 1.69, 'new_price': 1.19}

input_fmt='only_text', prompt_6
	coupon 1
	 {'product_name': 'RAUCH Eistee', 'old_price': 1.5, 'new_price': 1.29}
	coupon 2
	 {'product_name': 'LIPTON Ice Tea', 'old_price': 0.69, 'new_price': 0.33}
	coupon 3
	 {'product_name': 'Water', 'old_price': 1.19, 'new_price': 0.75}

input_fmt='only_text', prompt_7
	coupon 1
	 {'product_name': 'RAUCH Eistee', 'old_price': 1.5, 'new_price': 1.29}
	coupon 2
	 {'product_name': 'keex', 'old_price': 0.69, 'new_price': 0.99}
	coupon 3
	 {'product_name': 'je 0,75 I', 'old_price': 0, 'new_price': 0.8}

input_fmt='json_encoded', prompt_5
	coupon 1
	 {'product_name': 'UVP 1.79', 'old_price': 0, 'new_price': 0.0}
	coupon 2
	 {'product_name': 'Icingwa', 'old_price': 855.6, 'new

# Final prompt proposition
testing only on 'only text' input format as it was most effective one in previous experiments

In [25]:
prompt = """
This is input from screen with info (old price, new price and product name) about discount coupon/ticket:\n{}\n. Here extracted info, in form of JSON:\n
"""
for k in coupons.keys():
    print(f"\tcoupon {k}")
    print('\t', generator(prompt.format(inputs['only_text'][k])))

	coupon 1
	 {'product_name': 'UVP 1.79', 'old_price': 1.79, 'new_price': 1.29}
	coupon 2
	 {'product_name': 'LIPTON Ice Tea', 'old_price': 0.69, 'new_price': 0.99}
	coupon 3
	 {'product_name': 'HOHES C Water', 'old_price': 1.49, 'new_price': 1.19}


## Summary
* after some prompt engineering it looks like it is possible to achieve some results on 1B order of magnitude model on extensively preprocessed input
* reliable solution would require fine-tuning the model or rethinking overall approach
* maybe using some tricks like taking most frequent answer from several predictions would be good 

## Abandoned Ideas
* LangChain library provides nice support for structured output for llms but this feature is not implemented for huggingface port
