# Study on generating structured output from local Hugging Face models
In this notebook I demonstrate possibility to generate structured output from locally downloaded LLMs as well as show weaknesses of current method for this.
#### outlines module
python module that enables enforcing constraints on generative models outputs. We will use it to guarantee JSON format of output.

### Dependencies import

In [3]:
from pydantic import BaseModel, Field, PositiveFloat
import outlines
import json
from huggingface_hub import notebook_login
from transformers import pipeline
import pandas as pd

### Hugging Face login

In [2]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Simple example of outlines

In [4]:
# define output structure
class CarBrands(BaseModel):
    brands: list[str]


# download model
model = outlines.models.transformers('gpt2')
# create generator
generator = outlines.generate.json(model, json.dumps(CarBrands.model_json_schema()))
# simple use
generator("Here is list of car brands: \n Ferrari, Polonez, Fiat\nJSON format: {'brands': []}\nHere is list in JSON:")

{'brands': ['cclintte',
  'cllintte',
  'phamotimete',
  'matemetro',
  'cccntyte',
  'esquimbe',
  'estamte',
  'travletecredit72',
  'esquimbe',
  'poissonbeau',
  'trumpettes',
  'doublestrotchettien',
  'tambiquetox',
  'neverebeau',
  'métal ,brichtte',
  'métalcox',
  'neverepeufte',
  "emmet's",
  'lourbs',
  'miles',
  'nixconveurs',
  'perenniest',
  'mountai',
  'zuehusband',
  'hevinations',
  'screddewaten',
  'i hanl8',
  'jerelsten',
  '300kltmen',
  'swike',
  'infernim,',
  'pals',
  'bachelors-menlete',
  'firsf/d/']}

The results are far from perfect but requested JSON structure is preserved. For comparison results from raw gpt2:

In [4]:
generator = pipeline("text-generation", model="gpt2")
generator("Here is list of car brands: \n Ferrari, Polonez, Fiat\nJSON format: {'brands': []}\nHere is list in JSON:")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'Here is list of car brands: \n Ferrari, Polonez, Fiat\nJSON format: {\'brands\': []}\nHere is list in JSON:\n\n{"brands": [{"type":"fiat"}]\n\n'}]

### Attempt on coupon data extraction
In the following part I will aim from determining performance of `outlies` driven approach to solving simplified task for extracting coupon data.
##### Simplified problem statement
* We want to extract only 3 fields for each coupon: `price_old`, `price_new` and `product_name`
* We will assume that data given to model will always contain info about exactly one coupon
* We will test different forms of input data, starting from raw csv as in original solution, through csv with excluded columns and csv encoded into json to only 'Text' fields extracted

In [5]:
class Coupon(BaseModel):
  product_name: str
  old_price: PositiveFloat
  new_price: PositiveFloat

#### Define single coupon row ranges

In [6]:
frame = pd.read_csv('ds/18789327023.csv')
coupons = {
    "1": (2, 7),
    "2": (7, 12),
    "3": (12, 17)
}

#### Model download
I used here Llama model which is available only after making special request. I got answer to my request within 24h.

In [7]:
model = outlines.models.transformers('meta-llama/Llama-3.2-1B')
generator = outlines.generate.json(model, json.dumps(Coupon.model_json_schema()))

In [8]:
inputs = {}

In [9]:
def frame_to_json(df: pd.DataFrame, list_name: str = 'rows') -> dict:
    res = {
        list_name: []
    }
    for ix, row in df.iterrows():
        obj  = {}
        for col in df.columns:
            obj[col] = row[col]
        res[list_name].append(obj)
    return res

In [10]:
REDUCED_COLUMNS = ["Text", "X 1", "Y 1", "X 2", "Y 2"]

inputs["raw_csv"] = {k: frame[v[0]:v[1]].to_csv() for k, v in coupons.items()}
inputs["reduced_csv"] = {k: frame[REDUCED_COLUMNS][v[0]:v[1]].to_csv() for k, v in coupons.items()}
inputs["json_encoded"] = {k: json.dumps(frame_to_json(frame[REDUCED_COLUMNS][v[0]:v[1]])) for k, v in coupons.items()}
inputs["only_text"] = {k: '\n'.join(frame['Text'][v[0]:v[1]].to_list()) for k, v in coupons.items()}

In [11]:
inputs["only_text"]

{'1': 'UVP 1.79\n1.29\nUVP\nRAUCH Eistee\nje 1,5 I',
 '2': 'UVP 0.99\n0.69\nUVP\nLIPTON Ice Tea\nje 0,33 I',
 '3': 'UVP 1.49\n1.19\nUVP\nHOHES C Water\nje 0,75 I'}

#### Prompt proposals
Here I test performance of different prompts

In [12]:
# prompt in form of direct command
prompt_1 = """
Your task is to extract info about discount coupon. You are given texts that user sees on smartphone screen in format: 'Text', 'X 1', 'X 2', 'Y 1', 'Y 2'. 'Text' field usually contains most useful infos. Provide me following data about given coupon: product_name: name of discounted product, old_price: price before discount, new_price: price after discount. Note that in input Text fields might be in wrong order. The results should be put inside JSON. Here is input:\n\n {}
"""
# prompts that aims to trick model into completing sequence
prompt_2 = """
Here is text representation of what user sees on smartphone screen: \n{}\nThis data contains info about single discount coupon visible on screen. Most of this data lies in 'Text' fields.\n Here is JSON with extracted info about this coupon. JSON contains discounted product name, old price and new price. JSON:\n 
"""
prompts = [prompt_1, prompt_2]

In [13]:
for input_fmt in inputs:
    for i, prompt in enumerate(prompts):
        print(f"{input_fmt=}, prompt_{i}")
        for k in coupons.keys():
            print(f"coupon {k}")
            print(generator(prompt.format(inputs[input_fmt][k])))

input_fmt='raw_csv', prompt_0
coupon 1
{'product_name': '1.29', 'old_price': 2.99, 'new_price': 1.29}
coupon 2
{'product_name': 'Confité Time: COOL EUROS > 10 ', 'old_price': 9, 'new_price': 4.78}
coupon 3
{'product_name': 'PENNY / Savings Up to 80% - Up to 50% Cashback', 'old_price': 1.59, 'new_price': 1.19}
input_fmt='raw_csv', prompt_1
coupon 1
{'product_name': 'Je 1,5', 'old_price': 1000, 'new_price': 600.0}
coupon 2
{'product_name': 'LINTPTYPE Worth of: US$16; Sale: US$10 ($5 off) | Typical Sale Price: US$10; Real Sale Price: US$5', 'old_price': 16, 'new_price': 5}
coupon 3
{'product_name': 'DONATIOS', 'old_price': 32, 'new_price': 19.95}
input_fmt='reduced_csv', prompt_0
coupon 1
{'product_name': 'Up to 1.79', 'old_price': 355, 'new_price': 494999}
coupon 2
{'product_name': 'UVP 0.99', 'old_price': 855, 'new_price': 637.3849032390941}
coupon 3
{'product_name': 'UVP 1.49', 'old_price': 355, 'new_price': 1487}
input_fmt='reduced_csv', prompt_1
coupon 1
{'product_name': ' Je 1,5 I',

## todo
* try different prompts and models, try using In Context Learning
* models designed for structured output generation - there are some on hf ???
* explore larger models - how big model should be to generate satisfactory results

## Abandoned Ideas
* LangChain library provides nice support for structured output for llms but this feature is not implemented for huggingface port
