# Receipt Extraction

Document extraction is a use case that is [near and dear to my heart](https://www.youtube.com/watch?v=hoBtFhZRxzw). The last time I dug deeply into it, there were not nearly as many models
capable of solving for it as there are today. In honor of Pixtral and LLaMa3.2, I thought it would be fun to revisit it with the classic SROIE dataset.

Let's jump right in!

## Install dependencies


In [9]:
%pip install autoevals braintrust requests openai

Note: you may need to restart the kernel to use updated packages.


## Setup LLM clients

We'll use OpenAI's GPT-4o, LLaMa 3.2 11B and 90B, and Pixtral 12B with a bunch of test cases from SROIE and see how they perform. You can access each of these models
behind the vanilla OpenAI client using Braintrust's proxy.


In [29]:
import os

import braintrust
import openai

client = braintrust.wrap_openai(
    openai.AsyncOpenAI(
        api_key=os.environ["BRAINTRUST_API_KEY"],
        base_url="http://localhost:8000/v1/proxy",
        # base_url="https://api.braintrust.dev/v1/proxy",
    )
)


## Downloading the data and sanity testing it

The [zzzDavid/ICDAR-2019-SROIE](https://github.com/zzzDavid/ICDAR-2019-SROIE/tree/master) repo has neatly organized the data for us. The files are enumerated in a 3 digit convention and for each image (e.g. 002.jpg), there is a corresponding
file (e.g. 002.json) with the key value pairs. There are a few different ways we could test the models:

- Ask each model to extract values for specific keys
- Ask each model to generate a value for each of a set of keys
- Ask the model to extract all keys and values from the receipt

To keep things simple, we'll go with the first option, but it would be interesting to do each and see how that affects the results.


In [26]:
import requests

indices = [str(i).zfill(3) for i in range(100)]


def load_receipt(index):
    img_path = f"https://raw.githubusercontent.com/zzzDavid/ICDAR-2019-SROIE/refs/heads/master/data/img/{index}.jpg"
    json_path = f"https://raw.githubusercontent.com/zzzDavid/ICDAR-2019-SROIE/refs/heads/master/data/key/{index}.json"

    json_response = requests.get(json_path).json()
    return json_response, img_path


fields, img_path = load_receipt("001")
fields

{'company': 'INDAH GIFT & HOME DECO',
 'date': '19/10/2018',
 'address': '27, JALAN DEDAP 13, TAMAN JOHOR JAYA, 81100 JOHOR BAHRU, JOHOR.',
 'total': '60.30'}

![receipt](https://raw.githubusercontent.com/zzzDavid/ICDAR-2019-SROIE/refs/heads/master/data/img/001.jpg)


In [30]:
MODELS = [
    "gpt-4o",
    "gpt-4o-mini",
    "meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo",
    "meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo",
    "pixtral-12b-2409",
]

SYSTEM_PROMPT = """Extract the field '{key}' from the provided receipt. Return the extracted
value, and nothing else. For example, if the field is 'Total' and the value is '100',
you should just return '100'. If the field is not present, return null.

Do not decorate the output with any explanation, or markdown. Just return the extracted value.
"""


@braintrust.traced
async def extract_value(model, key, img_path):
    response = await client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT.format(key=key)},
            {"role": "user", "content": [{"type": "image_url", "image_url": {"url": img_path}}]},
        ],
        temperature=0,
    )
    return response.choices[0].message.content.strip()


for model in MODELS:
    print("Running model: ", model)
    print(await extract_value(model, "company", img_path))
    print("\n")

Running model:  gpt-4o
INDAH GIFT & HOME DECO


Running model:  gpt-4o-mini
INDAH GIFT & HOME DECO


Running model:  meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo
60.30


Running model:  meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo
INDAH GIFT & HOME DECO


Running model:  pixtral-12b-2409
tan woon yann




## Running the evaluation

Now that we were able to perform a basic sanity test, let's run an evaluation! We'll use the `Levenshtein` and `Factuality` scorers to assess performance.
`Levenshtein` is heuristic and will tell us how closely the actual and expected strings match. Assuming some of the models will occasionally spit out superfluous
explanation text, `Factuality`, which is LLM based, should be able to still give us an accuracy measurement.


In [34]:
from braintrust import Eval

from autoevals import Factuality, Levenshtein

NUM_RECEIPTS = 20

data = [
    {
        "input": {
            "key": key,
            "img_path": img_path,
        },
        "expected": value,
        "metadata": {
            "idx": idx,
        },
    }
    for idx, (fields, img_path) in [(idx, load_receipt(idx)) for idx in indices[:NUM_RECEIPTS]]
    for key, value in fields.items()
]

for model in MODELS:

    async def task(input):
        return await extract_value(model, input["key"], input["img_path"])

    await Eval(
        "Receipt Extraction",
        data=data,
        task=task,
        scores=[Levenshtein, Factuality],
        experiment_name=f"Receipt Extraction - {model}",
        metadata={"model": model},
    )

Experiment Receipt Extraction - gpt-4o-5084b768 is running at http://localhost:3000/app/braintrustdata.com/p/Receipt%20Extraction/experiments/Receipt%20Extraction%20-%20gpt-4o-5084b768
Receipt Extraction [experiment_name=Receipt Extraction - gpt-4o] (data): 80it [00:00, 160011.60it/s]
Receipt Extraction [experiment_name=Receipt Extraction - gpt-4o] (tasks): 100%|██████████| 80/80 [00:36<00:00,  2.20it/s]
Experiment Receipt Extraction - gpt-4o-mini is running at http://localhost:3000/app/braintrustdata.com/p/Receipt%20Extraction/experiments/Receipt%20Extraction%20-%20gpt-4o-mini



81.25% 'Factuality'  score
87.85% 'Levenshtein' score

4.28s duration
4.27s llm_duration
1002tok prompt_tokens
11.71tok completion_tokens
1013.71tok total_tokens
0.01$ estimated_cost

See results for Receipt Extraction - gpt-4o-5084b768 at http://localhost:3000/app/braintrustdata.com/p/Receipt%20Extraction/experiments/Receipt%20Extraction%20-%20gpt-4o-5084b768


Receipt Extraction [experiment_name=Receipt Extraction - gpt-4o-mini] (data): 80it [00:00, 284842.38it/s]
Receipt Extraction [experiment_name=Receipt Extraction - gpt-4o-mini] (tasks): 100%|██████████| 80/80 [00:11<00:00,  6.73it/s]
Experiment Receipt Extraction - meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo is running at http://localhost:3000/app/braintrustdata.com/p/Receipt%20Extraction/experiments/Receipt%20Extraction%20-%20meta-llama%2FLlama-3.2-11B-Vision-Instruct-Turbo



Receipt Extraction - gpt-4o-mini compared to Receipt Extraction - gpt-4o-5084b768:
82.75% (+01.50%) 'Factuality'  score	(6 improvements, 6 regressions)
91.42% (+03.57%) 'Levenshtein' score	(17 improvements, 5 regressions)

3.65s (-62.78%) 'duration'         	(44 improvements, 31 regressions)
3.64s (-62.81%) 'llm_duration'     	(44 improvements, 31 regressions)
30685.30tok (+2968330.00%) 'prompt_tokens'    	(0 improvements, 80 regressions)
13.14tok (+142.50%) 'completion_tokens'	(4 improvements, 13 regressions)
30698.44tok (+2968472.50%) 'total_tokens'     	(0 improvements, 80 regressions)
0.00$ (-00.06%) 'estimated_cost'   	(75 improvements, 0 regressions)

See results for Receipt Extraction - gpt-4o-mini at http://localhost:3000/app/braintrustdata.com/p/Receipt%20Extraction/experiments/Receipt%20Extraction%20-%20gpt-4o-mini


Receipt Extraction [experiment_name=Receipt Extraction - meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo] (data): 80it [00:00, 220462.76it/s]
Receipt Extraction [experiment_name=Receipt Extraction - meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo] (tasks): 100%|██████████| 80/80 [00:13<00:00,  5.96it/s]
Experiment Receipt Extraction - meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo is running at http://localhost:3000/app/braintrustdata.com/p/Receipt%20Extraction/experiments/Receipt%20Extraction%20-%20meta-llama%2FLlama-3.2-90B-Vision-Instruct-Turbo



Receipt Extraction - meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo compared to Receipt Extraction - gpt-4o-mini:
50.04% (-41.37%) 'Levenshtein' score	(3 improvements, 47 regressions)
51.50% (-31.25%) 'Factuality'  score	(6 improvements, 31 regressions)

5.82s (+217.45%) 'duration'         	(11 improvements, 64 regressions)
5.81s (+217.49%) 'llm_duration'     	(11 improvements, 64 regressions)
89tok (-3059630.00%) 'prompt_tokens'    	(80 improvements, 0 regressions)
10.15tok (-298.75%) 'completion_tokens'	(29 improvements, 49 regressions)
99.15tok (-3059928.75%) 'total_tokens'     	(80 improvements, 0 regressions)
0.00$ (-00.48%) 'estimated_cost'   	(75 improvements, 0 regressions)

See results for Receipt Extraction - meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo at http://localhost:3000/app/braintrustdata.com/p/Receipt%20Extraction/experiments/Receipt%20Extraction%20-%20meta-llama%2FLlama-3.2-11B-Vision-Instruct-Turbo


Receipt Extraction [experiment_name=Receipt Extraction - meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo] (data): 80it [00:00, 163600.35it/s]
Receipt Extraction [experiment_name=Receipt Extraction - meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo] (tasks): 100%|██████████| 80/80 [00:16<00:00,  4.71it/s]
Experiment Receipt Extraction - pixtral-12b-2409 is running at http://localhost:3000/app/braintrustdata.com/p/Receipt%20Extraction/experiments/Receipt%20Extraction%20-%20pixtral-12b-2409



Receipt Extraction - meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo compared to Receipt Extraction - meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo:
79.75% (+28.25%) 'Factuality'  score	(31 improvements, 8 regressions)
80.10% (+30.05%) 'Levenshtein' score	(40 improvements, 9 regressions)

6.99s (+116.80%) 'duration'         	(27 improvements, 48 regressions)
6.98s (+116.78%) 'llm_duration'     	(27 improvements, 48 regressions)
89tok (-) 'prompt_tokens'    	(0 improvements, 0 regressions)
15.20tok (+505.00%) 'completion_tokens'	(13 improvements, 35 regressions)
104.20tok (+505.00%) 'total_tokens'     	(13 improvements, 35 regressions)
0.00$ (+00.01%) 'estimated_cost'   	(0 improvements, 75 regressions)

See results for Receipt Extraction - meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo at http://localhost:3000/app/braintrustdata.com/p/Receipt%20Extraction/experiments/Receipt%20Extraction%20-%20meta-llama%2FLlama-3.2-90B-Vision-Instruct-Turbo


Receipt Extraction [experiment_name=Receipt Extraction - pixtral-12b-2409] (data): 80it [00:00, 280790.23it/s]
Receipt Extraction [experiment_name=Receipt Extraction - pixtral-12b-2409] (tasks): 100%|██████████| 80/80 [01:09<00:00,  1.15it/s]



Receipt Extraction - pixtral-12b-2409 compared to Receipt Extraction - meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo:
75.23% (-04.87%) 'Levenshtein' score	(15 improvements, 27 regressions)
68.00% (-11.75%) 'Factuality'  score	(12 improvements, 16 regressions)

35.77s (+2878.14%) 'duration'         	(24 improvements, 51 regressions)
35.76s (+2878.14%) 'llm_duration'     	(25 improvements, 50 regressions)
1949.35tok (+186035.00%) 'prompt_tokens'    	(0 improvements, 80 regressions)
20.51tok (+531.25%) 'completion_tokens'	(25 improvements, 52 regressions)
1969.86tok (+186566.25%) 'total_tokens'     	(0 improvements, 80 regressions)
0.00$ (+00.02%) 'estimated_cost'   	(0 improvements, 73 regressions)

See results for Receipt Extraction - pixtral-12b-2409 at http://localhost:3000/app/braintrustdata.com/p/Receipt%20Extraction/experiments/Receipt%20Extraction%20-%20pixtral-12b-2409
