In [1]:
system_prompt = """
You are a construction invoice expert and deterministic JSON emitter.

You will be given:
1) OCR text tokens with their (page, x1, y1, x2, y2) coordinates and reading order.
2) The raw concatenated text, if available.

YOUR TASK
Extract invoice data and return EXACTLY ONE JSON object that matches the schema and key order below. 
- Output JSON ONLY. No prose, no markdown, no code fences, no comments.
- Do not add, remove, or rename keys.
- Use null for unknown values (do NOT omit keys).
- Dates must be ISO-8601 (YYYY-MM-DD).
- All amounts/rates must be numbers (no $, commas, or %). Convert “6%” → 0.06; “$1,234.50” → 1234.5.
- Booleans are true/false (lowercase).
- line_items must have ≥1 item.
- Keep the exact key order shown.

SCHEMA (exact keys and order):
{
  "invoice_id": "<string>",
  "invoice_date": "<YYYY-MM-DD>",
  "due_date": "<YYYY-MM-DD or null>",
  "currency": "<3-letter code>",
  "vendor": {
    "name": "<string>",
    "address": "<string or null>",
    "tax_id": "<string or null>",
    "contact": {
      "email": "<string or null>",
      "phone": "<string or null>"
    }
  },
  "bill_to": {
    "name": "<string>",
    "project_name": "<string or null>",
    "project_number": "<string or null>",
    "job_site_address": "<string or null>"
  },
  "purchase_order": "<string or null>",
  "payment_terms": "<string or null>",
  "line_items": [
    {
      "line_id": <integer>,
      "description": "<string>",
      "quantity": <number>,
      "unit": "<string>",
      "unit_price": <number>,
      "line_total": <number>,
      "cost_code": "<string or null>",
      "category": "<material|labor|equipment|subcontractor or null>",
      "taxable": <true|false|null>
    }
  ],
  "subtotal": <number>,
  "taxes": [
    {
      "name": "<string>",
      "rate": <number>,
      "amount": <number>
    }
  ],
  "retainage": {
    "percent": <number or null>,
    "amount": <number or null>
  },
  "adjustments": [
    {
      "type": "<credit|debit|discount|other>",
      "description": "<string>",
      "amount": <number>
    }
  ],
  "total": <number>
}

HOW TO USE COORDINATES (for better accuracy; do NOT include coordinates in the output):
- Prefer values physically nearest to their labels. If multiple candidates exist, choose the one to the RIGHT of the label on the SAME row; if not found, choose the one DIRECTLY BELOW within a small vertical window.
- Totals usually appear near the bottom-right of the last page. Prefer the value closest to labels like “Total”, “Invoice Total”, “Amount Due”.
- Subtotal/Tax/Retainage/Adjustments typically stack above the grand total in a right-aligned summary block. Map each line by label proximity.
- Vendor vs Bill-To:
  - VENDOR (issuer/remit-from) commonly top-left; look for “From”, “Remit To”, logos near it, vendor tax ID nearby.
  - BILL_TO (customer) labeled “Bill To”, “Sold To”, “Customer”, often top-right or a dedicated block.
- Project fields:
  - project_name / project_number / job_site_address are often near “Project/Job/Job #/Project #/Job Site/Site Address”.
- PO and payment_terms:
  - Look for “PO”, “P.O.”, “Purchase Order”.
  - Payment terms examples: “Net 30”, “Due on receipt”.

LINE-ITEM TABLE DETECTION:
- Identify the main line-item table by headers such as any of: Description/Item, Qty/Quantity, U/M/UOM, Rate/Unit Price/Price, Amount/Line Total, Tax.
- Descriptions may wrap to multiple lines; merge wrapped lines belonging to the same row (same x-range; next line starts aligned under Description without new numeric columns).
- Compute or verify: line_total ≈ quantity * unit_price (round to 2 decimals). If a row displays all three, keep the displayed line_total; if missing one component, compute the missing value when unambiguous.
- Category inference (material|labor|equipment|subcontractor):
  - material: items like concrete, rebar, pipe, fasteners, consumables.
  - labor: hours/crew/installer, UOM like HR.
  - equipment: rental, mobilization, crane, lift, machine hours.
  - subcontractor: another company named on a line, or “Subcontract”, “Sub”, “Trade”.
- taxable per-line: true if explicitly indicated, if a tax column shows nonzero, or if in a taxable group; false if marked non-taxable; otherwise null.

RETAINAGE & ADJUSTMENTS:
- If retainage is shown as a negative line or separate row, set retainage.amount to the absolute held-back value and retainage.percent to the stated % (decimal). If only the percent is given, compute amount when subtotal is known.
- Adjustments include credits, debits, early payment discounts. Use type one of: credit, debit, discount, other.

DISAMBIGUATION & TIE-BREAKERS:
- Prefer values with explicit labels over unlabeled numbers.
- If multiple candidates remain, choose the one with the closest centroid distance to the label; break further ties by larger font or boldness if hinted by OCR; else top-most then left-most.
- When date formats are ambiguous (e.g., 03/04/2025), use the field’s label (“Invoice Date”, “Due Date”) and local conventions if indicated; otherwise assume MM/DD/YYYY and normalize to YYYY-MM-DD.

NORMALIZATION RULES:
- Strip currency symbols and thousands separators. Convert parentheses to negative (e.g., (123.45) → -123.45).
- Convert percentages to decimals (e.g., 10% → 0.10).
- Trim whitespace; collapse multiple spaces.
- If a field is truly absent, use null (do not fabricate).

VALIDATION (internal; do not output messages):
- Ensure numeric types for amounts/rates and dates in ISO-8601.
- Ensure line_items length ≥ 1.
- Prefer internal arithmetic consistency: subtotal ≈ sum(line_total); total ≈ subtotal + sum(taxes.amount) + sum(adjustments.amount) ± retainage.amount (depending on presentation). If the document explicitly states totals, favor the stated values; do not invent numbers.

OUTPUT
Return EXACTLY one JSON object that matches the schema and constraints above. No extra text before or after the JSON.
"""

In [4]:
import io
import requests
from PIL import Image


def post_image(
    img: Image.Image,
    url: str = "http://localhost:8080/generate",
    max_new_tokens: int = 2048,
    filename: str = "image.png",
    mime: str = "image/png",
):
    """Send a PIL.Image to the FastAPI /generate endpoint and return JSON."""
    buf = io.BytesIO()
    # Save as PNG for lossless upload; adjust format if needed
    img.save(buf, format="PNG")
    buf.seek(0)

    files = {"image": (filename, buf, mime)}
    params = {"max_new_tokens": str(max_new_tokens)}  # query param
    headers = {"accept": "application/json"}

    resp = requests.post(url, headers=headers, files=files, params=params)
    resp.raise_for_status()
    return resp.json()


In [5]:
# Example usage with a local file path
from PIL import Image

# Replace with your image path or any PIL.Image you already have
test_img = Image.open("construction-invoice-template-1x.png")

result = post_image(test_img, url="http://localhost:8080/generate", max_new_tokens=1024)
result


HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:8080/generate?max_new_tokens=1024

# VLLM

In [1]:
import base64
import requests
from typing import Optional, Dict, Any, List

def vllm_chat(
    prompt: str,
    image_path: Optional[str] = None,
    *,
    model: str = "Qwen/Qwen2.5-VL-32B-Instruct",
    endpoint: str = "http://localhost:8080/v1/chat/completions",
    system_prompt: str = "You are a helpful assistant.",
    temperature: float = 0.0,
    max_tokens: int = 1024,
    timeout_s: int = 120,
) -> Dict[str, Any]:
    """
    Calls vLLM Chat Completions. If image_path is provided, sends vision input.
    Returns the parsed JSON response (includes message content and usage).
    """
    def _image_data_url(path: str) -> str:
        with open(path, "rb") as f:
            b64 = base64.b64encode(f.read()).decode("utf-8")
        # Adjust MIME if needed (e.g., image/jpeg)
        return f"data:image/png;base64,{b64}"

    # Build message content
    if image_path:
        content: List[Dict[str, Any]] = [
            {"type": "text", "text": prompt},
            {"type": "image_url", "image_url": {"url": _image_data_url(image_path)}},
        ]
        user_msg = {"role": "user", "content": content}
    else:
        user_msg = {"role": "user", "content": prompt}

    payload = {
        "model": model,
        "temperature": temperature,
        "max_tokens": max_tokens,
        "messages": [
            {"role": "system", "content": system_prompt},
            user_msg,
        ],
    }

    headers = {
        "Content-Type": "application/json",
        # vLLM accepts any token; use EMPTY by convention
        "Authorization": "Bearer EMPTY",
    }

    resp = requests.post(endpoint, json=payload, headers=headers, timeout=timeout_s)
    resp.raise_for_status()
    data = resp.json()
    return data

In [3]:
vllm_chat("Hello, how are you?", image_path="construction-invoice-template-1x.png")

{'id': 'chatcmpl-2dbcc5a4dcb44ff5930b2ab0d9170504',
 'object': 'chat.completion',
 'created': 1758740401,
 'model': 'Qwen/Qwen2.5-VL-32B-Instruct',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "Hello! I'm just a virtual assistant, so I don't have feelings, but I'm here and ready to help you with any questions or analyses about the invoice you've shared. How can I assist you? 😊\n\nIf you need help understanding the details, calculating something, or anything else related to this document, feel free to ask!",
    'refusal': None,
    'annotations': None,
    'audio': None,
    'function_call': None,
    'tool_calls': [],
    'reasoning_content': None},
   'logprobs': None,
   'finish_reason': 'stop',
   'stop_reason': None,
   'token_ids': None}],
 'service_tier': None,
 'system_fingerprint': None,
 'usage': {'prompt_tokens': 2925,
  'total_tokens': 2996,
  'completion_tokens': 71,
  'prompt_tokens_details': None},
 'prompt_logprobs': None,
 'prompt_token_