# Implementation Roadmap
## LLM as a judge
### Must-have (ship first)
- **Strict JSON schema + retry-on-fail**  
  Ensures well-formed, consistent outputs.  

- **Multi-trace CoT (k=3–5) + self-confidence**  
  Generate multiple reasoning traces; each trace provides its own confidence score.  

- **Weighted voting by self-confidence**  
  Aggregate values using confidence weights instead of plain majority.  

- **Evidence grounding (span linking)**  
  Judge must cite exact supporting text spans; missing evidence → lower confidence.  

- **Domain validators (hard/soft rules)**  
  - Dates follow expected formats  
  - VAT ID / IBAN checksum validation  
  - `Gross ≈ Net + VAT` (within tolerance)  
  - Currency whitelist checks  

- **Per-field calibration**  
  Map raw confidence to true correctness probability (e.g., isotonic or temperature scaling).  

- **Acceptance policy**  
  Apply per-field thresholds; for documents, require all critical fields to pass.  

- **Auditing & provenance**  
  Store chosen value, calibrated confidence, cited spans, validator outcomes, and reasoning traces.  

---

### High-impact add-ons (still no training)
- **Dual-phase reasoning**  
  1. Solution reasoning → generate answer  
  2. Confidence reasoning → judge and verbalize confidence bin (0–9 or 0–1)  

- **Difficulty-aware sampling**  
  Use vote dispersion or evidence quality to stop early on easy fields or sample more traces for hard ones.  

- **Cross-judge diversity (multiple rubrics)**  
  Run the same model with different judging rubrics and combine results:  
  - *Format-strict judge*: strict regex/format checks (e.g., dates, VAT IDs)  
  - *Arithmetic-strict judge*: check totals and sums (e.g., Gross vs. Net + VAT)  
  - *Evidence-strict judge*: require supporting spans near relevant anchors (e.g., “VAT”, “Total”)  

- **Few-shot bank (prompt-only)**  
  Maintain a curated set of vendor/layout examples; update as new reviewed cases are added.  

---

### Ops & monitoring
- **Coverage vs. accuracy dashboards**  
  Track Brier, ECE, NLL, AUROC; monitor auto-accept vs. review rates.  

- **Periodic recalibration**  
  Refresh calibration mappings as new reviewed cases arrive.  

- **Fail-safes**  
  If JSON invalid or confidence unstable → deterministic re-ask at temperature=0; else route to review.  


<style>
open { color: Red }
inprogress { color: Yellow }
done { color: Green; text-decoration: line-through }
</style>

# First steps
To start things off, we need to build a simple LLM-as-a-Judge (LaaJ) setup<br>
For this we need to do prepare the following:
- <open>Access to the Azure hosted GPT5-mini</open>
- <open>Access to the data located on the NAS</open>

Then we need to force the LLM to create multiple reasoning paths for one input.<br>
- <open>Actively encourage the LLM to perform "slow thinking" and prepare a json output schema the data needs to fit in.</open><br>
- <open>Implement a retry option (with amount of retries as a parameter) to catch LLM errors.</open><br>
- <open>Peform validity checks for produced outputs (Valid VAT layout, isNumber for amounts, ...). If these checks fail --> confidence decrease / set to 0.</open><br>
- <open>Query the LLM to rate its own output using numeric and written bins (10 bins, [0.0; 0.1], [0.1; 0.2], ...)</open><br>
- <open>Implement weighted majority voting based on confidence (Borda voting with exponent)</open><br>

<open>Visualize the results of perceived confidence and actual correctness in a bar chart (should be x=y for perfect prediction)</open><br>
<open><b>Bonus:</b> Investigate run stability with a given seed and compare results</open>


# LLM Setup

In [6]:
from datetime import datetime
import os
from pathlib import Path
from dotenv import load_dotenv
import httpx
from openai import AsyncAzureOpenAI, RateLimitError
import logging
import json


load_dotenv(override = True)
# ---------------------------------HELPER-VARS--------------------------------
document_storage = Path(os.getenv("DOCUMENT_STORAGE"))
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# ---------------------------------LLM-CONFIG---------------------------------
api_key = os.getenv("AZURE_OPENAI_API_KEY")
endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME")

# ---------------------------------LLM-SETUP---------------------------------
client = AsyncAzureOpenAI(
    api_key=api_key,
    api_version="2024-10-01-preview",
    azure_endpoint=endpoint
)

conccurrent_tasks = 15
delay_between_requests = 1 # seconds

# Retry configuration
MAX_RETRIES = 5
RETRY_BACKOFF = 5  # seconds

# Whitelist of retryable exceptions
RETRYABLE_EXCEPTIONS = (
    RateLimitError,
    httpx.ConnectTimeout,
    httpx.ReadTimeout,
    httpx.HTTPStatusError,
    httpx.RemoteProtocolError,
    httpx.NetworkError,
)


# ---------------------------------LOGGING---------------------------------
logging_path = Path("../Logs")
if not logging_path.exists():
    logging_path.mkdir(parents=True, exist_ok=True)

logging.basicConfig(
    filename=f"{logging_path}/{deployment_name}_{timestamp}.log",
    filemode="a",                     
    format="%(asctime)s - %(levelname)s - %(message)s",
    level=logging.INFO,
    force=True
)

for noisy in ["httpx", "openai", "azure", "urllib3"]:
    logging.getLogger(noisy).setLevel(logging.WARNING)


# Load system prompt
with open("../Prompts/system_prompt.txt", "r", encoding="utf-8") as f:
    system_prompt = f.read()

# Load the schema from file
with open("../Prompts/schema.json", "r", encoding="utf-8") as f:
    schema = json.load(f)

# Turn it into pretty JSON for prompt injection
formatted_schema = json.dumps(schema, indent=2)

system_prompt = system_prompt.replace("formatted_schema_representation",formatted_schema)

# Test Reasoning path setup

In [7]:
# --- LLM-as-a-Judge: Load one file and produce 3 reasoning paths (seed-based) ---

import asyncio
from pathlib import Path
from typing import Dict, Any, List

# ====== INPUT: set your file path here ======
INVOICE_TEXT_PATH = Path("../Documents/273366/273277_page1_grid.txt")  # <-- replace with your file
# ============================================

invoice_text = INVOICE_TEXT_PATH.read_text(encoding="utf-8")

# Seeds to drive variation in reasoning paths
SEEDS = [42, 1337, 2025, 9001, 123456]


async def _chat_once(seed: int) -> Dict[str, Any]:
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": invoice_text},
    ]

    for attempt in range(1, MAX_RETRIES + 1):
        try:
            resp = await client.chat.completions.create(
                model=deployment_name,
                messages=messages,
                # temperature=1, may not be changed for GPT5-mini! 
                response_format={"type": "json_object"},
                seed=seed,
            )
            content = resp.choices[0].message.content
            return json.loads(content)
        except RETRYABLE_EXCEPTIONS as e:
            logging.warning(f"[ReasoningPath] Attempt {attempt}/{MAX_RETRIES} failed: {e}")
            if attempt >= MAX_RETRIES:
                raise
            await asyncio.sleep(RETRY_BACKOFF * attempt)

async def generate_reasoning_paths(n_paths: int = 3) -> List[Dict[str, Any]]:
    tasks = []
    for i in range(n_paths):
        seed = SEEDS[i % len(SEEDS)]
        tasks.append(_chat_once(seed))
    return await asyncio.gather(*tasks)

# Run and print paths
paths = await generate_reasoning_paths(3)

for i, p in enumerate(paths, 1):
    print(f"\n=== Reasoning Path {i} ===")
    print(json.dumps(p, indent=2, ensure_ascii=False))



=== Reasoning Path 1 ===
{
  "GrandTotal.Amount": 496.63,
  "Invoice.Date": "08.11.2023",
  "Sender.VatId": "ATU78657801",
  "Vat.Rate": 20,
  "Net.Amount": 413.86,
  "Vat.Amount": 82.77
}

=== Reasoning Path 2 ===
{
  "GrandTotal.Amount": 496.63,
  "Invoice.Date": "08.11.2023",
  "Sender.VatId": "ATU78657801",
  "Vat.Rate": 20,
  "Net.Amount": 413.86,
  "Vat.Amount": 82.77
}

=== Reasoning Path 3 ===
{
  "GrandTotal.Amount": 496.63,
  "Invoice.Date": "08.11.2023",
  "Sender.VatId": "ATU78657801",
  "Vat.Rate": 20,
  "Net.Amount": 413.86,
  "Vat.Amount": 82.77
}
