# Implementation Roadmap
## LLM as a judge
### Must-have (ship first)
- **Strict JSON schema + retry-on-fail**  
  Ensures well-formed, consistent outputs.  

- **Multi-trace CoT (k=3–5) + self-confidence**  
  Generate multiple reasoning traces; each trace provides its own confidence score.  

- **Weighted voting by self-confidence**  
  Aggregate values using confidence weights instead of plain majority.  

- **Evidence grounding (span linking)**  
  Judge must cite exact supporting text spans; missing evidence → lower confidence.  

- **Domain validators (hard/soft rules)**  
  - Dates follow expected formats  
  - VAT ID / IBAN checksum validation  
  - `Gross ≈ Net + VAT` (within tolerance)  
  - Currency whitelist checks  

- **Per-field calibration**  
  Map raw confidence to true correctness probability (e.g., isotonic or temperature scaling).  

- **Acceptance policy**  
  Apply per-field thresholds; for documents, require all critical fields to pass.  

- **Auditing & provenance**  
  Store chosen value, calibrated confidence, cited spans, validator outcomes, and reasoning traces.  

---

### High-impact add-ons (still no training)
- **Dual-phase reasoning**  
  1. Solution reasoning → generate answer  
  2. Confidence reasoning → judge and verbalize confidence bin (0–9 or 0–1)  

- **Difficulty-aware sampling**  
  Use vote dispersion or evidence quality to stop early on easy fields or sample more traces for hard ones.  

- **Cross-judge diversity (multiple rubrics)**  
  Run the same model with different judging rubrics and combine results:  
  - *Format-strict judge*: strict regex/format checks (e.g., dates, VAT IDs)  
  - *Arithmetic-strict judge*: check totals and sums (e.g., Gross vs. Net + VAT)  
  - *Evidence-strict judge*: require supporting spans near relevant anchors (e.g., “VAT”, “Total”)  

- **Few-shot bank (prompt-only)**  
  Maintain a curated set of vendor/layout examples; update as new reviewed cases are added.  

---

### Ops & monitoring
- **Coverage vs. accuracy dashboards**  
  Track Brier, ECE, NLL, AUROC; monitor auto-accept vs. review rates.  

- **Periodic recalibration**  
  Refresh calibration mappings as new reviewed cases arrive.  

- **Fail-safes**  
  If JSON invalid or confidence unstable → deterministic re-ask at temperature=0; else route to review.  


<style>
open { color: Red }
inprogress { color: Yellow }
done { color: Green; text-decoration: line-through }
</style>

# First steps
To start things off, we need to build a simple LLM-as-a-Judge (LaaJ) setup<br>
For this we need to do prepare the following:
- <done>Access to the Azure hosted GPT5-mini</done>
- <inprogress>Access to the data located on the NAS</inprogress>

Then we need to force the LLM to create multiple reasoning paths for one input.<br>
- <done>Actively encourage the LLM to perform "slow thinking" and prepare a json output schema the data needs to fit in.</done><br>
- <done>Implement a retry option (with amount of retries as a parameter) to catch LLM errors.</done><br>
- <open>Peform validity checks for produced outputs (Valid VAT layout, isNumber for amounts, ...). If these checks fail --> confidence decrease / set to 0.</open><br>
- <open>Query the LLM to rate its own output using numeric and written bins (10 bins, [0.0; 0.1], [0.1; 0.2], ...)</open><br>
- <open>Implement weighted majority voting based on confidence (Borda voting with exponent)</open><br>

<open>Visualize the results of perceived confidence and actual correctness in a bar chart (should be x=y for perfect prediction)</open><br>


# LLM Setup

In [21]:
from datetime import datetime
import os
from pathlib import Path
from dotenv import load_dotenv
import httpx
from openai import AsyncOpenAI, RateLimitError
import logging
import json


load_dotenv(override = True)
# ---------------------------------HELPER-VARS--------------------------------
DOCUMENT_STORAGE = Path(os.getenv("DOCUMENT_STORAGE"))
TIMESTAMP = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# ---------------------------------LLM-CONFIG---------------------------------
API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME")

# ---------------------------------LLM-SETUP---------------------------------
client = AsyncOpenAI(
    api_key=API_KEY,
    base_url=ENDPOINT
)

CONCCURRENT_TASKS = 15
REQUEST_DELAY = 1 # seconds

# Retry configuration
MAX_RETRIES = 5
RETRY_BACKOFF = 5  # seconds

# Whitelist of retryable exceptions
RETRYABLE_EXCEPTIONS = (
    RateLimitError,
    httpx.ConnectTimeout,
    httpx.ReadTimeout,
    httpx.HTTPStatusError,
    httpx.RemoteProtocolError,
    httpx.NetworkError,
)


# ---------------------------------LOGGING---------------------------------
logging_path = Path("../Logs")
if not logging_path.exists():
    logging_path.mkdir(parents=True, exist_ok=True)

logging.basicConfig(
    filename=f"{logging_path}/{DEPLOYMENT_NAME}_{TIMESTAMP}.log",
    filemode="a",                     
    format="%(asctime)s - %(levelname)s - %(message)s",
    level=logging.INFO,
    force=True
)

for noisy in ["httpx", "openai", "azure", "urllib3"]:
    logging.getLogger(noisy).setLevel(logging.WARNING)


# Load system prompt
with open("../Prompts/system_prompt.txt", "r", encoding="utf-8") as f:
    system_prompt = f.read()

# Load the schema from file
with open("../Prompts/schema.json", "r", encoding="utf-8") as f:
    schema = json.load(f)

# Turn it into pretty JSON for prompt injection
formatted_schema = json.dumps(schema, indent=2)

system_prompt = system_prompt.replace("formatted_schema_representation",formatted_schema)

# Test Reasoning path setup

In [None]:
# --- GPT-5 via Responses API: system prompt + hardcoded candidate file ---

import asyncio
from pathlib import Path
from typing import Optional, Tuple

# 1) Hardcode the candidate file path
CANDIDATE_TEXT_PATH = Path(
    "../Documents/273366/273277_page1_grid.txt"
)  # <-- adjust as needed


def _read_candidate_text(p: Path) -> str:
    if not p.exists():
        raise FileNotFoundError(f"Candidate file not found: {p}")
    return p.read_text(encoding="utf-8").strip()


async def _responses_create_with_retry(
    instructions: str,
    user_text: str,
    *,
    reasoning_effort: str = "minimal",  # GPT-5 supports 'minimal' | 'low' | 'medium' | 'high'
    verbosity: str = "low",  # GPT-5 supports 'low' | 'medium' | 'high'
    max_output_tokens: int = 4000,
) -> Tuple[str, Optional[str]]:
    """
    Calls the Azure Responses API (GPT-5) with retries.
    Returns (assistant_text, reasoning_summary_text_or_none).
    """
    last_err: Optional[Exception] = None

    for attempt in range(1, MAX_RETRIES + 1):
        try:
            print(DEPLOYMENT_NAME)
            resp = await client.responses.create(
                model=DEPLOYMENT_NAME,  # <- your GPT-5 (or gpt-5-mini) *deployment name*
                instructions=instructions,  # <- system prompt goes here in Responses API
                input=user_text,  # <- user content (the candidate text)
                max_output_tokens=max_output_tokens,
                # Reasoning-specific controls for GPT-5:
                reasoning={
                    "effort": reasoning_effort,
                    "summary": "auto",  # request a reasoning summary when available
                },
                # GPT-5 output verbosity control:
                text={"verbosity": verbosity},
            )

            # Primary text
            out_text = getattr(resp, "output_text", None)
            if not out_text:
                # Fallback: concatenate all output_text parts if needed
                parts = []
                for item in resp.output or []:
                    if getattr(item, "type", "") == "message":
                        for c in getattr(item, "content", []) or []:
                            if getattr(c, "type", "") in (
                                "output_text",
                                "text",
                            ) and getattr(c, "text", None):
                                parts.append(c.text)
                out_text = "\n".join(parts).strip() if parts else ""

            # Optional: extract a short reasoning summary if present
            reasoning_summary = None
            for item in resp.output or []:
                if getattr(item, "type", "") == "reasoning":
                    # item.summary may be a list of structured segments; join any text fields
                    segs = []
                    for seg in getattr(item, "summary", []) or []:
                        t = getattr(seg, "text", None)
                        if t:
                            segs.append(t)
                    if segs:
                        reasoning_summary = "\n".join(segs).strip()
                        break

            logging.info(
                "Responses API success on attempt %d | tokens: %s",
                attempt,
                getattr(resp, "usage", None),
            )
            return out_text, reasoning_summary

        except RETRYABLE_EXCEPTIONS as e:
            last_err = e
            logging.warning(
                "Responses API retry %d/%d due to %s", attempt, MAX_RETRIES, repr(e)
            )
            await asyncio.sleep(RETRY_BACKOFF * attempt)

    # If we exhausted retries, raise the last error
    raise RuntimeError(
        f"Responses API failed after {MAX_RETRIES} attempts"
    ) from last_err


# 2) One-shot helper you can call from a notebook cell
async def run_gpt5_on_candidate(
    reasoning_effort: str = "minimal",  # GPT-5 supports 'minimal' | 'low' | 'medium' | 'high'
    verbosity: str = "low",
) -> None:
    candidate_text = _read_candidate_text(CANDIDATE_TEXT_PATH)

    # You can tweak reasoning_effort/verbosity per run:
    assistant_text, reasoning_summary = await _responses_create_with_retry(
        instructions=system_prompt,
        user_text=candidate_text,
        reasoning_effort="reasoning_effort",  # fastest; raise to 'medium'/'high' for tougher tasks
        verbosity="verbosity",
        max_output_tokens=8000,
    )

    print("=== ASSISTANT OUTPUT ===")
    print(assistant_text.strip())

    if reasoning_summary:
        print("\n=== REASONING SUMMARY (if provided) ===")
        print(reasoning_summary.strip())


# 3) Kick it off (uncomment in notebooks or call from your event loop)

In [None]:
await run_gpt5_on_candidate(reasoning_effort="low", verbosity="low")


gpt-5-mini
=== ASSISTANT OUTPUT ===
{
  "GrandTotal.Amount": 496.63,
  "Invoice.Date": "08.11.2023",
  "Sender.VatId": "ATU78657801",
  "Vat.Rate": 20,
  "Net.Amount": 413.86,
  "Vat.Amount": 82.77
}

=== REASONING SUMMARY (if provided) ===
**Extracting invoice details**

I need to carefully follow the developer's instructions to extract specific invoice details from the text document. I must adhere to the specified JSON schema and not alter any fields. Since the document contains the term “Rechnung” and invoice number “RE073891” with the date “vom 08.11.2023," I’ll fill the fields based on the text. The grand total is listed as "ENDBETRAG: 496,63 EUR," so GrandTotal.Amount should be numeric 496.63, considering formatting.
**Processing invoice amounts**

I need to keep the number of decimal places consistent with the input and ensure not to truncate any numbers. For "Net.Amount" and "Vat.Amount," I should return numbers without currency symbols. However, JSON numeric format requires a 