<a href="https://colab.research.google.com/github/duper203/upstage_cookbook/blob/main/Structured_Text_Extraction_from_Images.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting Structured Data from Images
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Structured_Text_Extraction_from_Images.ipynb)

## Introduction

In this notebook we will demonstrate how you can use a language vision model(Llama 3.2 90B Vision) along with an LLM that has JSON mode enabled(Llama 3.1 70B) to extract structured text from images.

In our case we will extract line items from an invoice in the form of a JSON.

<img src="https://github.com/togethercomputer/together-cookbook/blob/main/images/structured_text_image.png?raw=1" width="750">


### Install relevant libraries

In [1]:
!pip install -qU openai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/386.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m378.9/386.9 kB[0m [31m11.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m386.9/386.9 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.0/78.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.2/325.2 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import os, json
from google.colab import userdata

os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")

## Create Invoice Structure using Pydantic

We need a way of telling the LLM what structure to organize information into - including what information to expect in the receipt. We will do this using `pydantic` models.

Below we define the required classes.

- Each line item on the receipt will have a `name`, `price` and `quantity`. The `Item` class specifies this.
- Each receipt/invoice is a combination of multiple line `Item` elements along with a `total` price. The `Receipt` class specifies this.

In [3]:
import json
from pydantic import BaseModel, Field

class Item(BaseModel):
    name: str
    price: float
    quantity: int = Field(default=1)

class Receipt(BaseModel):
    items: list[Item]
    total: float

## Lets bring in the reciept that we want to extract information from

Notice that this is a real receipt with multiple portions that are not relevant to the line item extraction structure we've outlined above.

<img src="https://ocr.space/Content/Images/receipt-ocr-original.webp" height="500">

## 1. Extract Information Receipt

We will use the Llama 3.2 90B Vision model to extract out information in normal text format.

In [36]:
from openai import OpenAI

getDescriptionPrompt = "Extract out the details from the receipt image. Identify the name, price and quantity of each item. Also specify the total. In json format"

imageUrl = "https://ocr.space/Content/Images/receipt-ocr-original.jpg"

client = OpenAI(
    api_key=os.environ["UPSTAGE_API_KEY"],
    base_url="https://api.upstage.ai/v1/solar"
)

response = client.chat.completions.create(
    model="solar-docvision",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": getDescriptionPrompt},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": imageUrl,
                    },
                },
            ],

        }
    ],
    response_format={
            "type": "json_object",
            "schema": Receipt.model_json_schema(),
        },
)

info=response.choices[0].message.content

In [37]:
print(info)

 {
  "receipt_image": None,
  "items": [
    {
      "name": "Manager Diana Earnest",
      "price": "3330",
      "quantity": "3339",
      "total": "33991"
    },
    {
      "name": "Bluebell Dr Sw",
      "price": "2331",
      "quantity": "444663",
      "total": "66300"
    },
    {
      "name": "New Phyllis Orh",
      "price": "009044",
      "quantity": "44",
      "total": "004444"
    },
    {
      "name": "Sth#02115",
      "price": "004747",
      "quantity": "003215",
      "total": "001547"
    },
    {
      "name": "Pete Toy",
      "price": "004747",
      "quantity": "7571658",
      "total": "7571658"
    },
    {
      "name": "Floppy Puppy",
      "price": "070060",
      "quantity": "3321153",
      "total": "3321153"
    },
    {
      "name": "Sssupreme Ss",
      "price": "084699",
      "quantity": "083238",
      "total": "083238"
    },
    {
      "name": "Munchy Dmeak",
      "price": "068113",
      "quantity": "087996",
      "total": "087996"
    },


Notice that the model is not perfect and wasn't able to extract out some line items. It's hard for most models to perform this zero-shot extraction of data from images. A way to improve this is to finetune the model using [Visual Intruction Tuning](https://arxiv.org/abs/2304.08485).

## 2. Organize Information as JSON

We will use Llama 3.1 70B with structured generation in JSON mode to organize the information extracted by the vision model into an acceptable JSON format that can be parsed.

`Meta-Llama-3.1-70B-Instruct-Turbo` will strcitly respect the JSON schema passed to it.

In [32]:
extract = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": "The following is a detailed description of all the items, prices and quantities on a receipt. Extract out information. Only answer in JSON.",
            },
            {
                "role": "user",
                "content": info,
            },
        ],
        model="solar-pro",
        response_format={
            "type": "json_object",
            "schema": Receipt.model_json_schema(),
        },
    )

In [33]:
output = json.loads(extract.choices[0].message.content)
print(json.dumps(output, indent=2))

{
  "items": [
    {
      "name": "Walmart",
      "price": 1.97,
      "quantity": 20
    },
    {
      "name": "Manager Diana Earnest",
      "price": 1.97,
      "quantity": 20
    },
    {
      "name": "New Phyllis Dr. Sw",
      "price": 1.97,
      "quantity": 20
    },
    {
      "name": "S.T.H.# 022115",
      "price": 1.97,
      "quantity": 20
    },
    {
      "name": "Peet Toy",
      "price": 1.97,
      "quantity": 20
    },
    {
      "name": "Floppy Puppy",
      "price": 1.97,
      "quantity": 20
    },
    {
      "name": "S.S.S.Spreme S",
      "price": 1.97,
      "quantity": 20
    },
    {
      "name": "Muncho Dumebel",
      "price": 1.97,
      "quantity": 20
    },
    {
      "name": "Dog Treat",
      "price": 1.97,
      "quantity": 20
    },
    {
      "name": "Peed Poch",
      "price": 1.97,
      "quantity": 20
    },
    {
      "name": "Coupon 2 3100",
      "price": 1.97,
      "quantity": 20
    },
    {
      "name": "Hnytmd Smore",
      "

Althought with some missed line items we were able to extract out structured JSON from an image in a zero shot manner! To improve the results for your pipeline and make them production ready I recommend you [finetune](https://docs.together.ai/docs/fine-tuning-overview) the vision model on your own dataset!

Learn more about how to use JSON mode in the [docs](https://docs.together.ai/docs/json-mode) here!