# Implementation Roadmap
## LLM as a judge
### Must-have (ship first)
- **Strict JSON schema + retry-on-fail**  
  Ensures well-formed, consistent outputs.  

- **Multi-trace CoT (k=3–5) + self-confidence**  
  Generate multiple reasoning traces; each trace provides its own confidence score.  

- **Weighted voting by self-confidence**  
  Aggregate values using confidence weights instead of plain majority.  

- **Evidence grounding (span linking)**  
  Judge must cite exact supporting text spans; missing evidence → lower confidence.  

- **Domain validators (hard/soft rules)**  
  - Dates follow expected formats  
  - VAT ID / IBAN checksum validation  
  - `Gross ≈ Net + VAT` (within tolerance)  
  - Currency whitelist checks  

- **Per-field calibration**  
  Map raw confidence to true correctness probability (e.g., isotonic or temperature scaling).  

- **Acceptance policy**  
  Apply per-field thresholds; for documents, require all critical fields to pass.  

- **Auditing & provenance**  
  Store chosen value, calibrated confidence, cited spans, validator outcomes, and reasoning traces.  

---

### High-impact add-ons (still no training)
- **Dual-phase reasoning**  
  1. Solution reasoning → generate answer  
  2. Confidence reasoning → judge and verbalize confidence bin (0–9 or 0–1)  

- **Difficulty-aware sampling**  
  Use vote dispersion or evidence quality to stop early on easy fields or sample more traces for hard ones.  

- **Cross-judge diversity (multiple rubrics)**  
  Run the same model with different judging rubrics and combine results:  
  - *Format-strict judge*: strict regex/format checks (e.g., dates, VAT IDs)  
  - *Arithmetic-strict judge*: check totals and sums (e.g., Gross vs. Net + VAT)  
  - *Evidence-strict judge*: require supporting spans near relevant anchors (e.g., “VAT”, “Total”)  

- **Few-shot bank (prompt-only)**  
  Maintain a curated set of vendor/layout examples; update as new reviewed cases are added.  

---

### Ops & monitoring
- **Coverage vs. accuracy dashboards**  
  Track Brier, ECE, NLL, AUROC; monitor auto-accept vs. review rates.  

- **Periodic recalibration**  
  Refresh calibration mappings as new reviewed cases arrive.  

- **Fail-safes**  
  If JSON invalid or confidence unstable → deterministic re-ask at temperature=0; else route to review.  


<style>
open { color: Red }
inprogress { color: Yellow }
done { color: Green; text-decoration: line-through }
</style>

# First steps
To start things off, we need to build a simple LLM-as-a-Judge (LaaJ) setup<br>
For this we need to do prepare the following:
- <done>Access to the Azure hosted GPT5-mini</done>
- <inprogress>Access to the data located on the NAS</inprogress>

Then we need to force the LLM to create multiple reasoning paths for one input.<br>
- <done>Actively encourage the LLM to perform "slow thinking" and prepare a json output schema the data needs to fit in.</done><br>
- <done>Implement a retry option (with amount of retries as a parameter) to catch LLM errors.</done><br>
- <open>Peform validity checks for produced outputs (Valid VAT layout, isNumber for amounts, ...). If these checks fail --> confidence decrease / set to 0.</open><br>
- <open>Query the LLM to rate its own output using numeric and written bins (10 bins, [0.0; 0.1], [0.1; 0.2], ...)</open><br>
- <open>Implement weighted majority voting based on confidence (Borda voting with exponent)</open><br>

<open>Visualize the results of perceived confidence and actual correctness in a bar chart (should be x=y for perfect prediction)</open><br>


# LLM Setup

In [32]:
from datetime import datetime
import os
from pathlib import Path
from dotenv import load_dotenv
import httpx
from openai import AsyncOpenAI, RateLimitError
import logging
import json


load_dotenv(override = True)
# ---------------------------------HELPER-VARS--------------------------------
DOCUMENT_STORAGE = Path(os.getenv("DOCUMENT_STORAGE"))
TIMESTAMP = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# ---------------------------------LLM-CONFIG---------------------------------
API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME")

# ---------------------------------LLM-SETUP---------------------------------
client = AsyncOpenAI(
    api_key=API_KEY,
    base_url=ENDPOINT
)

CONCCURRENT_TASKS = 15
REQUEST_DELAY = 1 # seconds

# Retry configuration
MAX_RETRIES = 5
RETRY_BACKOFF = 5  # seconds

# Whitelist of retryable exceptions
RETRYABLE_EXCEPTIONS = (
    RateLimitError,
    httpx.ConnectTimeout,
    httpx.ReadTimeout,
    httpx.HTTPStatusError,
    httpx.RemoteProtocolError,
    httpx.NetworkError,
)


# ---------------------------------LOGGING---------------------------------
logging_path = Path("../Logs")
if not logging_path.exists():
    logging_path.mkdir(parents=True, exist_ok=True)

logging.basicConfig(
    filename=f"{logging_path}/{DEPLOYMENT_NAME}_{TIMESTAMP}.log",
    filemode="a",                     
    format="%(asctime)s - %(levelname)s - %(message)s",
    level=logging.INFO,
    force=True
)

for noisy in ["httpx", "openai", "azure", "urllib3"]:
    logging.getLogger(noisy).setLevel(logging.WARNING)


# Load system prompt
with open("../Prompts/system_prompt.txt", "r", encoding="utf-8") as f:
    system_prompt = f.read()

# Load the schema from file
with open("../Prompts/schema.json", "r", encoding="utf-8") as f:
    schema = json.load(f)

# Turn it into pretty JSON for prompt injection
formatted_schema = json.dumps(schema, indent=2)

system_prompt = system_prompt.replace("formatted_schema_representation",formatted_schema)

# Test Reasoning path setup

In [33]:
# =========================
# Block 2.1 — Confidence + Self-Consistency (unified)
# Requires Block 1 variables: client, DEPLOYMENT_NAME, system_prompt, schema, CONCCURRENT_TASKS
# =========================

import asyncio
from pathlib import Path
from datetime import datetime
from collections import defaultdict, Counter
from typing import Any, Dict, List, Optional, Tuple
import json
import re
from uuid import uuid4

# ----- Verbal classes -----
CONFIDENCE_CLASSES = [
    "Almost no chance",
    "Highly unlikely",
    "Chances are slight",
    "Unlikely",
    "Less than even",
    "Better than even",
    "Likely",
    "Very good chance",
    "Highly likely",
    "Almost certain",
]

# Numeric → verbal bins (inclusive lower bounds)
_SCORE_BOUNDS = [0.00, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90]

def score_to_verbal(score: float) -> Optional[str]:
    """Map [0,1] numeric score to one of 10 verbal classes."""
    if score is None or not isinstance(score, (int, float)):
        return None
    s = max(0.0, min(1.0, float(score)))
    idx = max(i for i, b in enumerate(_SCORE_BOUNDS) if s >= b)
    return CONFIDENCE_CLASSES[idx]

# ----- Helpers for reasoning text capture -----
def _extract_reasoning_text(resp) -> str:
    chunks = []
    for item in getattr(resp, "output", []) or []:
        if getattr(item, "type", "") == "reasoning":
            summary = getattr(item, "summary", None)
            if isinstance(summary, list):
                for seg in summary:
                    t = getattr(seg, "text", None)
                    if t:
                        chunks.append(t)
            for seg in getattr(item, "content", []) or []:
                t = getattr(seg, "text", None)
                if t:
                    chunks.append(t)
    if chunks:
        return "\n".join(chunks).strip()
    return getattr(resp, "output_text", "") or ""

def _coerce_answer_dict(ans: dict) -> dict:
    """
    If the model returned a schema-like wrapper with 'values', unwrap it.
    Otherwise return dict as-is.
    """
    if isinstance(ans, dict) and "properties" in ans and "values" in ans:
        v = ans.get("values")
        return v if isinstance(v, dict) else ans
    return ans if isinstance(ans, dict) else {}

def _postprocess_final_payload(payload: dict) -> dict:
    """
    Expected payload:
      {
        "answer": { ... },                   # keys per your schema (dot-notation allowed)
        "confidence": { "Field.Key": 0..1 }  # per-field numeric confidences
      }
    Returns:
      {
        "answer": {...},
        "confidence_numeric": {...},
        "confidence_verbal": {...},
      }
    """
    answer = _coerce_answer_dict(payload.get("answer", {}))
    conf_num = payload.get("confidence", {}) or {}
    conf_verbal = {k: score_to_verbal(v) for k, v in conf_num.items()}
    return {
        "answer": answer,
        "confidence_numeric": conf_num,
        "confidence_verbal": conf_verbal,
    }

# =========================
# Single-pass, per-field numeric confidence
# =========================
async def run_three_step_invoice_with_reasoning(invoice_txt_path: str, output_dir: Path):
    text = Path(invoice_txt_path).read_text(encoding="utf-8")
    conv_id = f"inv-{uuid4()}"

    resp1 = await client.responses.create(
        model=DEPLOYMENT_NAME,
        instructions=system_prompt,
        input=(
            "Step 1 — SOLUTION REASONING:\n"
            "Reason step by step to extract the structured invoice data from the text below. "
            "Use slow thinking. Do NOT output the final answer yet.\n\n"
            f"--- INVOICE TEXT START ---\n{text}\n--- INVOICE TEXT END ---"
        ),
        reasoning={"effort": "low"},
        text={"verbosity": "low"},
        max_output_tokens=4000,
        conversation=conv_id,
        store=True,  # <-- required so the ID can be referenced
    )

    resp2 = await client.responses.create(
        model=DEPLOYMENT_NAME,
        previous_response_id=resp1.id,
        input=(
            "Step 2 — CONFIDENCE REASONING:\n"
            "Evaluate how likely each extracted field is correct. Do NOT output the final answer yet."
        ),
        reasoning={"effort": "low"},
        text={"verbosity": "low"},
        max_output_tokens=2000,
        conversation=conv_id,
        store=True,  # <-- keep stored for step 3 to find it
    )

    resp3 = await client.responses.create(
        model=DEPLOYMENT_NAME,
        previous_response_id=resp2.id,
        input=(
            "Step 3 — FINAL OUTPUT:\n"
            "Return ONLY a single valid JSON object:\n"
            "{\n"
            f'  "answer": <your final extracted invoice data strictly following this schema: {json.dumps(schema)}>,\n'
            '  "confidence": { /* per-field numeric [0,1], keys mirror "answer"; dot-notation for nested */ }\n'
            "}\n"
            "No markdown. No extra keys. No comments in the JSON."
        ),
        reasoning={"effort": "low"},
        text={"verbosity": "low"},
        max_output_tokens=2000,
        conversation=conv_id,
        store=False,
    )

    # Strict JSON parse with light fallback
    try:
        payload = json.loads(resp3.output_text)
    except json.JSONDecodeError:
        cleaned = resp3.output_text.strip()
        if cleaned.startswith("```"):
            cleaned = cleaned.strip("`")
            cleaned = cleaned[cleaned.find("{") : cleaned.rfind("}") + 1]
        payload = json.loads(cleaned)

    processed = _postprocess_final_payload(payload)

    lean_record = {
        "input_path": str(invoice_txt_path),
        "answer": processed["answer"],
        "confidence_numeric": processed["confidence_numeric"],
        "confidence_verbal": processed["confidence_verbal"],
        "reasoning": {
            "step1": _extract_reasoning_text(resp1),
            "step2": _extract_reasoning_text(resp2),
        },
    }

    out_path = output_dir / (Path(invoice_txt_path).stem + ".json")
    out_path.write_text(json.dumps(lean_record, ensure_ascii=False, indent=2), encoding="utf-8")
    return lean_record

async def run_many_invoices(invoice_txt_paths):
    run_ts = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    output_dir = Path(f"../Output/{run_ts}")
    output_dir.mkdir(parents=True, exist_ok=True)

    sem = asyncio.Semaphore(CONCCURRENT_TASKS)

    async def _bounded(p):
        async with sem:
            return await run_three_step_invoice_with_reasoning(p, output_dir)

    tasks = [asyncio.create_task(_bounded(p)) for p in invoice_txt_paths]
    results = await asyncio.gather(*tasks)

    (output_dir / "_index.json").write_text(json.dumps(results, ensure_ascii=False, indent=2), encoding="utf-8")
    return {"output_dir": str(output_dir), "results": results}

# =========================
# Self-consistency: multi-trace + weighted voting
# =========================

def _norm_str(s: str) -> str:
    return re.sub(r"\s+", " ", s.strip()).casefold()

def _round_num(x: float, ndigits: int = 2) -> float:
    try:
        return round(float(x), ndigits)
    except Exception:
        return x

def _value_key(v: Any) -> Any:
    """
    Canonicalize values to stable keys for voting.
    - numbers rounded
    - strings normalized
    - lists/tuples/dicts converted to sortable tuples
    """
    if isinstance(v, (int, float)):
        return _round_num(v, 2)
    if isinstance(v, str):
        return _norm_str(v)
    if isinstance(v, list):
        return tuple(_value_key(x) for x in v)
    if isinstance(v, tuple):
        return tuple(_value_key(x) for x in v)
    if isinstance(v, dict):
        return tuple(sorted((k, _value_key(val)) for k, val in v.items()))
    return v

def _pick_representative(raw_values: List[Any], winner_key: Any) -> Any:
    """Pick a representative original value for the winning bucket."""
    candidates = [v for v in raw_values if _value_key(v) == winner_key]
    if not candidates:
        return None
    cnt = Counter(json.dumps(v, ensure_ascii=False, sort_keys=True) for v in candidates)
    rep_json = cnt.most_common(1)[0][0]
    return json.loads(rep_json)

def _weighted_vote_per_field(results: List[Dict[str, Any]]) -> Tuple[Dict[str, Any], Dict[str, float], Dict[str, str], Dict[str, float], Dict[str, int]]:
    """
    results: list of trace dicts with keys:
      - answer: dict of field -> value
      - confidence_numeric: dict of field -> weight in [0,1]
    Returns:
      consensus_answer, consensus_conf_num, consensus_conf_verbal, agreement_ratio, support_count
    """
    all_fields = set()
    for r in results:
        all_fields.update(r.get("answer", {}).keys())

    consensus_answer = {}
    consensus_conf_num = {}
    consensus_conf_verbal = {}
    agreement_ratio = {}
    support_count = {}

    for field in sorted(all_fields):
        buckets = defaultdict(float)
        raw_vals = []
        total_weight = 0.0
        votes = 0

        for r in results:
            ans = r.get("answer", {})
            confs = r.get("confidence_numeric", {})
            if field in ans:
                v = ans[field]
                w = float(confs.get(field, 0.0) or 0.0)
                k = _value_key(v)
                w_clipped = max(0.0, min(1.0, w))
                buckets[k] += w_clipped
                raw_vals.append(v)
                total_weight += w_clipped
                votes += 1

        if not buckets:
            continue

        # Winner by max total weight
        best_key, best_weight = max(buckets.items(), key=lambda kv: (kv[1],))
        if total_weight > 0:
            agree = best_weight / total_weight
        else:
            # Fallback: unweighted majority
            count_buckets = defaultdict(int)
            for v in raw_vals:
                count_buckets[_value_key(v)] += 1
            best_key = max(count_buckets.items(), key=lambda kv: (kv[1],))[0]
            agree = count_buckets[best_key] / max(1, votes)

        rep_value = _pick_representative(raw_vals, best_key)
        consensus_answer[field] = rep_value
        conf_num = max(0.0, min(1.0, agree))
        consensus_conf_num[field] = conf_num
        consensus_conf_verbal[field] = score_to_verbal(conf_num)
        agreement_ratio[field] = agree
        support_count[field] = votes

    return consensus_answer, consensus_conf_num, consensus_conf_verbal, agreement_ratio, support_count

async def _run_single_trace(invoice_txt_path: str) -> Dict[str, Any]:
    text = Path(invoice_txt_path).read_text(encoding="utf-8")
    conv_id = f"inv-{uuid4()}"

    resp1 = await client.responses.create(
        model=DEPLOYMENT_NAME,
        instructions=system_prompt,
        input=(
            "Step 1 — SOLUTION REASONING:\n"
            "Reason step by step to extract the structured invoice data. "
            "Do NOT output the final answer yet.\n\n"
            f"--- INVOICE TEXT START ---\n{text}\n--- INVOICE TEXT END ---"
        ),
        reasoning={"effort": "low"},
        text={"verbosity": "low"},
        max_output_tokens=4000,
        conversation=conv_id,
        store=True,
    )

    resp2 = await client.responses.create(
        model=DEPLOYMENT_NAME,
        previous_response_id=resp1.id,
        input="Step 2 — CONFIDENCE REASONING:\nAssess per-field likelihoods. Do NOT output the final answer.",
        reasoning={"effort": "low"},
        text={"verbosity": "low"},
        max_output_tokens=2000,
        conversation=conv_id,
        store=True,
    )

    resp3 = await client.responses.create(
        model=DEPLOYMENT_NAME,
        previous_response_id=resp2.id,
        input=(
            "Step 3 — FINAL OUTPUT:\n"
            "{\n"
            f'  "answer": <schema: {json.dumps(schema)}>,\n'
            '  "confidence": { /* per-field numeric [0,1], keys mirror "answer"; dot-notation for nested */ }\n'
            "}\n"
            "No markdown. No extra text."
        ),
        reasoning={"effort": "low"},
        text={"verbosity": "low"},
        max_output_tokens=2000,
        conversation=conv_id,
        store=False,
    )

    payload = json.loads(resp3.output_text if resp3.output_text.strip().startswith("{")
                         else resp3.output_text.strip()[resp3.output_text.find("{"):resp3.output_text.rfind("}")+1])
    processed = _postprocess_final_payload(payload)
    return {
        "answer": processed["answer"],
        "confidence_numeric": processed["confidence_numeric"],
        "reasoning": {
            "step1": _extract_reasoning_text(resp1),
            "step2": _extract_reasoning_text(resp2),
        },
        "raw_output_text": resp3.output_text,
    }

async def run_self_consistent_invoice(invoice_txt_path: str, n_paths: int = 5):
    """
    Multi-trace self-consistency for one document.
    Weighted voting per field using internal confidences.
    """
    run_ts = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    output_dir = Path(f"../Output/{run_ts}")
    output_dir.mkdir(parents=True, exist_ok=True)


    sem = asyncio.Semaphore(CONCCURRENT_TASKS)

    async def _bounded():
        async with sem:
            return await _run_single_trace(invoice_txt_path)

    traces = await asyncio.gather(*[asyncio.create_task(_bounded()) for _ in range(n_paths)])


    # Persist individual traces
    traces_dir = output_dir / "traces"
    traces_dir.mkdir(parents=True, exist_ok=True)
    for i, tr in enumerate(traces, 1):
        (traces_dir / f"trace_{i:02d}.json").write_text(json.dumps(tr, ensure_ascii=False, indent=2), encoding="utf-8")

    # Consensus
    ans, conf_num, conf_verbal, agree_ratio, support = _weighted_vote_per_field(traces)

    final_record = {
        "input_path": str(invoice_txt_path),
        "consensus": {
            "answer": ans,
            "confidence_numeric": conf_num,
            "confidence_verbal": conf_verbal,
            "agreement_ratio": agree_ratio,   # weighted agreement 0..1
            "support_count": support          # number of traces that produced a value per field
        },
        "meta": {
            "n_paths": n_paths,
            "output_dir": str(output_dir),
        },
        "traces": traces,  # per-trace summaries
    }

    (output_dir / (Path(invoice_txt_path).stem + "_consensus.json")).write_text(
        json.dumps(final_record, ensure_ascii=False, indent=2), encoding="utf-8"
    )
    return final_record

# ===== Examples =====
# Single-pass for many files:
paths = ["../Documents/273366/273277_page1_grid.txt"]
out = await run_many_invoices(paths)

print ("intermediate done")
# Self-consistency for one file:
consensus = await run_self_consistent_invoice("../Documents/273366/273277_page1_grid.txt", n_paths=5)


BadRequestError: Error code: 400 - {'error': {'message': "Unknown parameter: ''.", 'type': 'invalid_request_error', 'param': None, 'code': 'unknown_parameter'}}