# Implementation Roadmap
## LLM as a judge
### Must-have (ship first)
- **Strict JSON schema + retry-on-fail**  
  Ensures well-formed, consistent outputs.  

- **Multi-trace CoT (k=3–5) + self-confidence**  
  Generate multiple reasoning traces; each trace provides its own confidence score.  

- **Weighted voting by self-confidence**  
  Aggregate values using confidence weights instead of plain majority.  

- **Evidence grounding (span linking)**  
  Judge must cite exact supporting text spans; missing evidence → lower confidence.  

- **Domain validators (hard/soft rules)**  
  - Dates follow expected formats  
  - VAT ID / IBAN checksum validation  
  - `Gross ≈ Net + VAT` (within tolerance)  
  - Currency whitelist checks  

- **Per-field calibration**  
  Map raw confidence to true correctness probability (e.g., isotonic or temperature scaling).  

- **Acceptance policy**  
  Apply per-field thresholds; for documents, require all critical fields to pass.  

- **Auditing & provenance**  
  Store chosen value, calibrated confidence, cited spans, validator outcomes, and reasoning traces.  

---

### High-impact add-ons (still no training)
- **Dual-phase reasoning**  
  1. Solution reasoning → generate answer  
  2. Confidence reasoning → judge and verbalize confidence bin (0–9 or 0–1)  

- **Difficulty-aware sampling**  
  Use vote dispersion or evidence quality to stop early on easy fields or sample more traces for hard ones.  

- **Cross-judge diversity (multiple rubrics)**  
  Run the same model with different judging rubrics and combine results:  
  - *Format-strict judge*: strict regex/format checks (e.g., dates, VAT IDs)  
  - *Arithmetic-strict judge*: check totals and sums (e.g., Gross vs. Net + VAT)  
  - *Evidence-strict judge*: require supporting spans near relevant anchors (e.g., “VAT”, “Total”)  

- **Few-shot bank (prompt-only)**  
  Maintain a curated set of vendor/layout examples; update as new reviewed cases are added.  

---

### Ops & monitoring
- **Coverage vs. accuracy dashboards**  
  Track Brier, ECE, NLL, AUROC; monitor auto-accept vs. review rates.  

- **Periodic recalibration**  
  Refresh calibration mappings as new reviewed cases arrive.  

- **Fail-safes**  
  If JSON invalid or confidence unstable → deterministic re-ask at temperature=0; else route to review.  


<style>
open { color: Red }
inprogress { color: Yellow }
done { color: Green; text-decoration: line-through }
</style>

# First steps
To start things off, we need to build a simple LLM-as-a-Judge (LaaJ) setup<br>
For this we need to do prepare the following:
- <done>Access to the Azure hosted GPT5-mini</done>
- <inprogress>Access to the data located on the NAS</inprogress>

Then we need to force the LLM to create multiple reasoning paths for one input.<br>
- <done>Actively encourage the LLM to perform "slow thinking" and prepare a json output schema the data needs to fit in.</done><br>
- <done>Implement a retry option (with amount of retries as a parameter) to catch LLM errors.</done><br>
- <open>Peform validity checks for produced outputs (Valid VAT layout, isNumber for amounts, ...). If these checks fail --> confidence decrease / set to 0.</open><br>
- <open>Query the LLM to rate its own output using numeric and written bins (10 bins, [0.0; 0.1], [0.1; 0.2], ...)</open><br>
- <open>Implement weighted majority voting based on confidence (Borda voting with exponent)</open><br>

<open>Visualize the results of perceived confidence and actual correctness in a bar chart (should be x=y for perfect prediction)</open><br>


# LLM Setup

In [34]:
from datetime import datetime
import os
from pathlib import Path
from dotenv import load_dotenv
import httpx
from openai import AsyncOpenAI, RateLimitError
import logging
import json


load_dotenv(override = True)
# ---------------------------------HELPER-VARS--------------------------------
DOCUMENT_STORAGE = Path(os.getenv("DOCUMENT_STORAGE"))
TIMESTAMP = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# ---------------------------------LLM-CONFIG---------------------------------
API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME")

# ---------------------------------LLM-SETUP---------------------------------
client = AsyncOpenAI(
    api_key=API_KEY,
    base_url=ENDPOINT
)

CONCCURRENT_TASKS = 15
REQUEST_DELAY = 1 # seconds

# Retry configuration
MAX_RETRIES = 5
RETRY_BACKOFF = 5  # seconds

# Whitelist of retryable exceptions
RETRYABLE_EXCEPTIONS = (
    RateLimitError,
    httpx.ConnectTimeout,
    httpx.ReadTimeout,
    httpx.HTTPStatusError,
    httpx.RemoteProtocolError,
    httpx.NetworkError,
)


# ---------------------------------LOGGING---------------------------------
logging_path = Path("../Logs")
if not logging_path.exists():
    logging_path.mkdir(parents=True, exist_ok=True)

logging.basicConfig(
    filename=f"{logging_path}/{DEPLOYMENT_NAME}_{TIMESTAMP}.log",
    filemode="a",                     
    format="%(asctime)s - %(levelname)s - %(message)s",
    level=logging.INFO,
    force=True
)

for noisy in ["httpx", "openai", "azure", "urllib3"]:
    logging.getLogger(noisy).setLevel(logging.WARNING)


# Load system prompt
with open("../Prompts/system_prompt.txt", "r", encoding="utf-8") as f:
    system_prompt = f.read()

# Load the schema from file
with open("../Prompts/schema.json", "r", encoding="utf-8") as f:
    schema = json.load(f)

# Turn it into pretty JSON for prompt injection
formatted_schema = json.dumps(schema, indent=2)

system_prompt = system_prompt.replace("formatted_schema_representation",formatted_schema)

# Test Reasoning path setup

In [35]:
import asyncio
from pathlib import Path
import json
from datetime import datetime

# --- Verbal classes (kept) ---
CONFIDENCE_CLASSES = [
    "Almost no chance",
    "Highly unlikely",
    "Chances are slight",
    "Unlikely",
    "Less than even",
    "Better than even",
    "Likely",
    "Very good chance",
    "Highly likely",
    "Almost certain"
]

# --- Numeric→verbal mapping via bins (inclusive lower bounds) ---
# [0.00,0.10) -> Almost no chance, [0.10,0.20) -> Highly unlikely, ..., [0.90,1.00] -> Almost certain
_SCORE_BOUNDS = [0.00, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90]
def score_to_verbal(score: float) -> str:
    if score is None or not isinstance(score, (int, float)):
        return None
    s = max(0.0, min(1.0, float(score)))
    idx = max(i for i, b in enumerate(_SCORE_BOUNDS) if s >= b)  # choose the last bound ≤ s
    return CONFIDENCE_CLASSES[idx]

def _extract_reasoning_text(resp) -> str:
    chunks = []
    for item in getattr(resp, "output", []) or []:
        if getattr(item, "type", "") == "reasoning":
            summary = getattr(item, "summary", None)
            if isinstance(summary, list):
                for seg in summary:
                    t = getattr(seg, "text", None)
                    if t:
                        chunks.append(t)
            for seg in getattr(item, "content", []) or []:
                t = getattr(seg, "text", None)
                if t:
                    chunks.append(t)
    if chunks:
        return "\n".join(chunks).strip()
    return getattr(resp, "output_text", "") or ""

def _coerce_answer_dict(ans: dict) -> dict:
    if isinstance(ans, dict) and "properties" in ans and "values" in ans:
        v = ans.get("values")
        return v if isinstance(v, dict) else ans
    return ans if isinstance(ans, dict) else {}

def _postprocess_final_payload(payload: dict) -> dict:
    """
    Expecting payload shape:
    {
      "answer": {...},                       # dict conforming to your schema
      "confidence": {"field.path": float}    # 0..1 per field that appears in 'answer'
    }
    Returns lean record with numeric + verbal per field.
    """
    answer = _coerce_answer_dict(payload.get("answer", {}))
    conf_num = payload.get("confidence", {}) or {}

    # Build verbal per field for fields present in numeric dict
    conf_verbal = {k: score_to_verbal(v) for k, v in conf_num.items()}

    return {
        "answer": answer,
        "confidence_numeric": conf_num,
        "confidence_verbal": conf_verbal,
    }

async def run_three_step_invoice_with_reasoning(invoice_txt_path: str, output_dir: Path):
    text = Path(invoice_txt_path).read_text(encoding="utf-8")

    # Step 1 — SOLUTION REASONING
    resp1 = await client.responses.create(
        model=DEPLOYMENT_NAME,
        instructions=system_prompt,
        input=(
            "Step 1 — SOLUTION REASONING:\n"
            "Reason step by step to extract the structured invoice data from the text below. "
            "Use slow thinking (explore alternatives, verify fields). "
            "Do NOT output the final answer yet.\n\n"
            f"--- INVOICE TEXT START ---\n{text}\n--- INVOICE TEXT END ---"
        ),
        reasoning={"effort": "low"},
        max_output_tokens=4000,
        text={"verbosity": "low"},
    )

    # Step 2 — CONFIDENCE REASONING
    resp2 = await client.responses.create(
        model=DEPLOYMENT_NAME,
        previous_response_id=resp1.id,
        input=(
            "Step 2 — CONFIDENCE REASONING:\n"
            "Evaluate, step by step, how likely each extracted field is correct, based on internal consistency "
            "and evidence in the text. Do NOT output the final answer yet."
        ),
        reasoning={"effort": "low"},
        max_output_tokens=2000,
        text={"verbosity": "low"},
    )

    # Step 3 — FINAL JSON WITH PER-FIELD NUMERIC CONFIDENCE
    resp3 = await client.responses.create(
        model=DEPLOYMENT_NAME,
        previous_response_id=resp2.id,
        input=(
            "Step 3 — FINAL OUTPUT:\n"
            "Return ONLY a single valid JSON object with this structure and nothing else:\n"
            "{\n"
            '  "answer": <your final extracted invoice data strictly following this schema: '
            f"{json.dumps(schema)}"
            ">,\n"
            '  "confidence": {\n'
            "    // For every field you output in 'answer', provide a numeric confidence in [0,1].\n"
            '    // Use the exact same field keys as in "answer". For nested fields, use dot-notation keys.\n'
            '    // Example: "Net.Amount": 0.82\n'
            "  }\n"
            "}\n"
            "Do not include comments in the actual JSON. Do not include any extra keys. No markdown."
        ),
        reasoning={"effort": "low"},
        max_output_tokens=2000,
        text={"verbosity": "low"},
    )

    # Parse strict JSON
    try:
        payload = json.loads(resp3.output_text)
    except json.JSONDecodeError:
        # Fallback: try to strip code fences or accidental text
        cleaned = resp3.output_text.strip()
        if cleaned.startswith("```"):
            cleaned = cleaned.strip("`")
            cleaned = cleaned[cleaned.find("{") : cleaned.rfind("}") + 1]
        payload = json.loads(cleaned)

    processed = _postprocess_final_payload(payload)

    lean_record = {
        "input_path": str(invoice_txt_path),
        "answer": processed["answer"],
        "confidence_numeric": processed["confidence_numeric"],
        "confidence_verbal": processed["confidence_verbal"],
        "reasoning": {
            "step1": _extract_reasoning_text(resp1),
            "step2": _extract_reasoning_text(resp2),
        },
    }

    out_path = output_dir / (Path(invoice_txt_path).stem + ".json")
    out_path.write_text(json.dumps(lean_record, ensure_ascii=False, indent=2), encoding="utf-8")
    return lean_record

async def run_many_invoices(invoice_txt_paths):
    run_ts = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    output_dir = Path(f"../Output/{run_ts}")
    output_dir.mkdir(parents=True, exist_ok=True)

    sem = asyncio.Semaphore(CONCCURRENT_TASKS)

    async def _bounded(path):
        async with sem:
            return await run_three_step_invoice_with_reasoning(path, output_dir)

    tasks = [asyncio.create_task(_bounded(p)) for p in invoice_txt_paths]
    results = await asyncio.gather(*tasks)

    (output_dir / "_index.json").write_text(json.dumps(results, ensure_ascii=False, indent=2), encoding="utf-8")
    return {"output_dir": str(output_dir), "results": results}

# Example:
paths = ["../Documents/273366/273277_page1_grid.txt"]
out = await run_many_invoices(paths)
