# Implementation Roadmap
## LLM as a judge
### Must-have (ship first)
- **Strict JSON schema + retry-on-fail**  
  Ensures well-formed, consistent outputs.  

- **Multi-trace CoT (k=3–5) + self-confidence**  
  Generate multiple reasoning traces; each trace provides its own confidence score.  

- **Weighted voting by self-confidence**  
  Aggregate values using confidence weights instead of plain majority.  

- **Evidence grounding (span linking)**  
  Judge must cite exact supporting text spans; missing evidence → lower confidence.  

- **Domain validators (hard/soft rules)**  
  - Dates follow expected formats  
  - VAT ID / IBAN checksum validation  
  - `Gross ≈ Net + VAT` (within tolerance)  
  - Currency whitelist checks  

- **Per-field calibration**  
  Map raw confidence to true correctness probability (e.g., isotonic or temperature scaling).  

- **Acceptance policy**  
  Apply per-field thresholds; for documents, require all critical fields to pass.  

- **Auditing & provenance**  
  Store chosen value, calibrated confidence, cited spans, validator outcomes, and reasoning traces.  

---

### High-impact add-ons (still no training)
- **Dual-phase reasoning**  
  1. Solution reasoning → generate answer  
  2. Confidence reasoning → judge and verbalize confidence bin (0–9 or 0–1)  

- **Difficulty-aware sampling**  
  Use vote dispersion or evidence quality to stop early on easy fields or sample more traces for hard ones.  

- **Cross-judge diversity (multiple rubrics)**  
  Run the same model with different judging rubrics and combine results:  
  - *Format-strict judge*: strict regex/format checks (e.g., dates, VAT IDs)  
  - *Arithmetic-strict judge*: check totals and sums (e.g., Gross vs. Net + VAT)  
  - *Evidence-strict judge*: require supporting spans near relevant anchors (e.g., “VAT”, “Total”)  

- **Few-shot bank (prompt-only)**  
  Maintain a curated set of vendor/layout examples; update as new reviewed cases are added.  

---

### Ops & monitoring
- **Coverage vs. accuracy dashboards**  
  Track Brier, ECE, NLL, AUROC; monitor auto-accept vs. review rates.  

- **Periodic recalibration**  
  Refresh calibration mappings as new reviewed cases arrive.  

- **Fail-safes**  
  If JSON invalid or confidence unstable → deterministic re-ask at temperature=0; else route to review.  


<style>
open { color: Red }
inprogress { color: Yellow }
done { color: Green; text-decoration: line-through }
</style>

# First steps
To start things off, we need to build a simple LLM-as-a-Judge (LaaJ) setup<br>
For this we need to do prepare the following:
- <done>Access to the Azure hosted GPT5-mini</done>
- <inprogress>Access to the data located on the NAS</inprogress>

Then we need to force the LLM to create multiple reasoning paths for one input.<br>
- <done>Actively encourage the LLM to perform "slow thinking" and prepare a json output schema the data needs to fit in.</done><br>
- <done>Implement a retry option (with amount of retries as a parameter) to catch LLM errors.</done><br>
- <open>Peform validity checks for produced outputs (Valid VAT layout, isNumber for amounts, ...). If these checks fail --> confidence decrease / set to 0.</open><br>
- <open>Query the LLM to rate its own output using numeric and written bins (10 bins, [0.0; 0.1], [0.1; 0.2], ...)</open><br>
- <open>Implement weighted majority voting based on confidence (Borda voting with exponent)</open><br>

<open>Visualize the results of perceived confidence and actual correctness in a bar chart (should be x=y for perfect prediction)</open><br>


# LLM Setup

In [7]:
from datetime import datetime
import os
from pathlib import Path
from dotenv import load_dotenv
import httpx
from openai import AsyncOpenAI, RateLimitError
import logging
import json


load_dotenv(override = True)
# ---------------------------------HELPER-VARS--------------------------------
DOCUMENT_STORAGE = Path(os.getenv("DOCUMENT_STORAGE"))
TIMESTAMP = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# ---------------------------------LLM-CONFIG---------------------------------
API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME")

# ---------------------------------LLM-SETUP---------------------------------
client = AsyncOpenAI(
    api_key=API_KEY,
    base_url=ENDPOINT
)

CONCCURRENT_TASKS = 15
REQUEST_DELAY = 1 # seconds

# Retry configuration
MAX_RETRIES = 5
RETRY_BACKOFF = 5  # seconds

# Whitelist of retryable exceptions
RETRYABLE_EXCEPTIONS = (
    RateLimitError,
    httpx.ConnectTimeout,
    httpx.ReadTimeout,
    httpx.HTTPStatusError,
    httpx.RemoteProtocolError,
    httpx.NetworkError,
)


# ---------------------------------LOGGING---------------------------------
logging_path = Path("../Logs")
if not logging_path.exists():
    logging_path.mkdir(parents=True, exist_ok=True)

logging.basicConfig(
    filename=f"{logging_path}/{DEPLOYMENT_NAME}_{TIMESTAMP}.log",
    filemode="a",                     
    format="%(asctime)s - %(levelname)s - %(message)s",
    level=logging.INFO,
    force=True
)

for noisy in ["httpx", "openai", "azure", "urllib3"]:
    logging.getLogger(noisy).setLevel(logging.WARNING)


# Load system prompt
with open("../Prompts/system_prompt.txt", "r", encoding="utf-8") as f:
    system_prompt = f.read()

# Load the schema from file
with open("../Prompts/schema.json", "r", encoding="utf-8") as f:
    schema = json.load(f)

# Turn it into pretty JSON for prompt injection
formatted_schema = json.dumps(schema, indent=2)

system_prompt = system_prompt.replace("formatted_schema_representation",formatted_schema)

# Test Reasoning path setup

In [16]:
import asyncio
from pathlib import Path


CONFIDENCE_CLASSES = [
    "Almost no chance",     # 0.0–0.1
    "Highly unlikely",      # 0.1–0.2
    "Chances are slight",   # 0.2–0.3
    "Unlikely",             # 0.3–0.4
    "Less than even",       # 0.4–0.5
    "Better than even",     # 0.5–0.6
    "Likely",               # 0.6–0.7
    "Very good chance",     # 0.7–0.8
    "Highly likely",        # 0.8–0.9
    "Almost certain"        # 0.9–1.0
]

async def run_three_step_invoice(invoice_txt_path: str):
    text = Path(invoice_txt_path).read_text(encoding="utf-8")

    # Step 1: SOLUTION REASONING (do not output final answer)
    resp1 = await client.responses.create(
        model=DEPLOYMENT_NAME,
        instructions=system_prompt,
        input=(
            "Step 1 — SOLUTION REASONING:\n"
            "Reason step by step to extract the structured invoice data from the text below. "
            "Use slow thinking (explore alternatives, verify fields). "
            "Do NOT output the final answer yet.\n\n"
            f"--- INVOICE TEXT START ---\n{text}\n--- INVOICE TEXT END ---"
        ),
        reasoning={"effort": "low"},
        max_output_tokens=4000,
        text={"verbosity": "low"},
    )

    # Step 2: CONFIDENCE REASONING (do not output final answer)
    resp2 = await client.responses.create(
        model=DEPLOYMENT_NAME,
        previous_response_id=resp1.id,
        input=(
            "Step 2 — CONFIDENCE REASONING:\n"
            "Evaluate, step by step, how likely the extracted fields are correct, considering internal consistency "
            "and evidence in the text. Do NOT output the final answer yet."
        ),
        reasoning={"effort": "low"},
        max_output_tokens=2000,
        text={"verbosity": "low"},
    )

    # Step 3: CONFIDENCE VERBALIZATION (output final answer + one confidence class)
    classes_str = "\n".join(f"- \"{c}\"" for c in CONFIDENCE_CLASSES)
    resp3 = await client.responses.create(
        model=DEPLOYMENT_NAME,
        previous_response_id=resp2.id,
        input=(
            "Step 3 — CONFIDENCE VERBALIZATION:\n"
            "Now output ONLY the final result in this exact format:\n"
            f"**Answer**: <your final extracted invoice data using this schema:{schema}>\n"
            "**Confidence**: <one of the following classes, exactly as written>\n"
            f"{classes_str}\n"
        ),
        reasoning={"effort": "low"},
        max_output_tokens=2000,
        text={"verbosity": "low"},
    )

    print(resp3.output_text)

# Example:
print(await run_three_step_invoice("../Documents/273366/273277_page1_grid.txt"))


**Answer**: {'GrandTotal.Amount': 496.63, 'Invoice.Date': '08.11.2023', 'Sender.VatId': 'ATU78657801', 'Vat.Rate': 20, 'Net.Amount': 413.86, 'Vat.Amount': 82.77}
**Confidence**: "Almost certain"
None
