## R&D layoutlmv3-finetuned-invoice

In this notebook, we evaluate a fine-tuned model from Hugging Face that has been specifically trained on invoice-related datasets. If the model demonstrates satisfactory accuracy, it will be integrated into our existing solution.

In [11]:
import json
from PIL import Image
import torch
from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification

In [13]:
json_path = "../notebook/invoice_output/image_sample_invoice_res.json"
with open(json_path, "r") as f:
    data = json.load(f)

In [14]:
ocr_res = data["overall_ocr_res"]
words = ocr_res["rec_texts"]
boxes = ocr_res["rec_boxes"]

# Image dimensions from JSON
width = data["width"]  # 680
height = data["height"]  # 800

In [15]:
normalized_boxes = []
for box in boxes:
    x0, y0, x1, y1 = box
    norm_box = [
        int(round(x0 / width * 1000)),
        int(round(y0 / height * 1000)),
        int(round(x1 / width * 1000)),
        int(round(y1 / height * 1000))
    ]
    normalized_boxes.append(norm_box)

In [17]:
image_path = "../data/image_sample_invoice.pdf.png"
image = Image.open(image_path).convert("RGB")

In [18]:
model_name = "Theivaprakasham/layoutlmv3-finetuned-invoice"
processor = LayoutLMv3Processor.from_pretrained(model_name)
model = LayoutLMv3ForTokenClassification.from_pretrained(model_name)

Loading weights:   0%|          | 0/216 [00:00<?, ?it/s]

LayoutLMv3ForTokenClassification LOAD REPORT from: Theivaprakasham/layoutlmv3-finetuned-invoice
Key                                | Status     | Details
-----------------------------------+------------+--------
layoutlmv3.embeddings.position_ids | UNEXPECTED |        

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [19]:
encoding = processor(image,words,boxes=normalized_boxes,max_length=512,padding="max_length",truncation=True,return_tensors="pt")
with torch.no_grad():
    outputs = model(**encoding)

In [20]:
logits = outputs.logits
predictions = logits.argmax(-1).squeeze().tolist()

In [21]:
id2label = model.config.id2label
labels = [id2label[pred] for pred in predictions]

In [22]:
# Post-process: Group entities (improved BIO handling)
entities = {}
current_entity = None
current_text = []

for word, label in zip(words, labels):
    if label == 'O':
        if current_entity:
            field = current_entity[2:]  # remove B- or I-
            if field not in entities:
                entities[field] = []
            entities[field].append(" ".join(current_text).strip())
        current_entity = None
        current_text = []
        continue
    
    if label.startswith("B-"):
        if current_entity:
            field = current_entity[2:]
            if field not in entities:
                entities[field] = []
            entities[field].append(" ".join(current_text).strip())
        
        current_entity = label
        current_text = [word]
    
    elif label.startswith("I-") and current_entity and label[2:] == current_entity[2:]:
        current_text.append(word)
    
    else:
        # unexpected transition
        if current_entity:
            field = current_entity[2:]
            if field not in entities:
                entities[field] = []
            entities[field].append(" ".join(current_text).strip())
        current_entity = None
        current_text = []

# Don't forget the last one
if current_entity:
    field = current_entity[2:]
    if field not in entities:
        entities[field] = []
    entities[field].append(" ".join(current_text).strip())

# Map to your required fields (using actual model labels)
extracted = {
    "Vendor Name": " ".join(entities.get("BILLER", [])) if entities.get("BILLER") else "",
    "Invoice Number": " ".join(entities.get("INVOICE_NUMBER", [])) if entities.get("INVOICE_NUMBER") else "",
    "Invoice Date": " ".join(entities.get("INVOICE_DATE", [])) if entities.get("INVOICE_DATE") else "",
    "Tax Amount": " ".join(entities.get("GST", [])) if entities.get("GST") else "",
    "Total Amount": " ".join(entities.get("TOTAL", [])) if entities.get("TOTAL") else "",
    "Sub Total": " ".join(entities.get("SUBTOTAL", [])) if entities.get("SUBTOTAL") else "",
}

# Print results
print("Extracted Fields:")
for key, value in extracted.items():
    print(f"- {key}: {value.strip()}")

print("\nAll Detected Entities (debug):")
for field, texts in entities.items():
    print(f"{field}: {texts}")

Extracted Fields:
- Vendor Name: 123 Business St, Cityville,CA90210 info@abctraders.com INVOICE Invoice No:INV-223 Quantity
- Invoice Number: 
- Invoice Date: 
- Tax Amount: 
- Total Amount: 
- Sub Total: 

All Detected Entities (debug):
BILLER: ['123 Business St,', 'Cityville,CA90210', 'info@abctraders.com', 'INVOICE', 'Invoice No:INV-223', 'Quantity']
BILLER_POST_CODE: ['2', '425.00']


Based on our images and PaddleOCR JSON outputs, we experimented with a fine-tuned version of LayoutLMv3 that was originally trained on the base model for invoice-related downstream tasks. However, this model did not perform well for our specific invoice formats.

To address this issue, we need to fine-tune the LayoutLMv3 base model using our own invoice data in order to accurately extract the required key information.