# 2 Ways of predicting confidence
## 1. LLM as a judge
One or more (Majority Voting) LLMs judge the output generated and reach a final verdict on a [0;1] scale (0 = unusable, 1 = perfect)

## 2. Logprobs
The probability of an output token is inspected<br>
If it is high enough (threshhold per sample), the judgment is most likely good


# Notes
Derzeit immer 0.85 als default Wert von Typhoon<br>
Implementierung via Jupyter Notebook


# Open Questions
- Which fields are inspected
    - If multiple fields, global score or score per field? --> per field
    - If aggregated field (VAT Group), global or per sub-item? --> per field


- Which LLMs to use
    - OpenAI --> Which versions (4.1-mini, 5.x?) --> 5 mini 
    - Azure Document AI? --> eigener Mistral account (Derzeit noch nicht verfügbar)
    - Azure AI Document intelligence --> Research what this can do

- Do tests with RAG or with base setup? --> Without RAG

# Related Works
## LLM as a judge
### Self consistency
<b>Link:</b>https://arxiv.org/pdf/2203.11171<br>
<b>Description:</b> Origin paper for self consistency.<br>
Using CoT-reasoning and multiple thinking paths, a single LLM can try diverse answer paths.<br>
Most of the time, the correct answer is the one that most paths agree upon --> Majority voting with self-consistency

### Reasoning for confidence
<b>Link:</b> https://arxiv.org/pdf/2505.14489v1<br>
<b>Description:</b> This paper proposes that <i>slow thinking</i> helps the LLM to better judge its own confidence for an answer.<br>
The backtracking abilites and uncertainty helps develop multiple paths --> good for self-consistency as well. <br>
For CoT prompting they use:
1. Solution reasoning: Reasoning to generate the answer
2. Confidence reasoning: Reasoning to evaluate own confidence
3. Confidence verbalization: Put confidence into 1 of 10 bins [0;1]

For evaluation, Brier Score is used, which helps to combine Excpected Calibration Error (ECE) and the Aread under the ROC curve (AUROC) score<br>
Slow thinking apparently works better in larger models!

## Logprobs
### DeepConf
<b>Link:</b> https://arxiv.org/pdf/2508.15260 <br>
<b>Description:</b> DeepConf tries to optimize self-consistency (majority voting on multiple reasoning paths), as this <i>parallel thinking</i> is very token intensive.<br>
In addition, performance degrades as all reasoning paths are weighted equally<br>
DeepConf can:
 - Evaluate reasoning traces 
 - Abort low quality traces
 - Set a task adaptive threshold for "low quality" (Offline warmup)
 - Grade the difficuluty of a task by comparing vote shares in majority voting (Adaptive Sampling)
 <br>
![Overview of DeepConf with Sample Code](../Images/Research/DeepConfOverview.png)

### llm_confidence Package
<b>Link:</b>https://medium.com/%40vatvenger/confidence-unlocked-a-method-to-measure-certainty-in-llm-outputs-1d921a4ca43c<br>
<b>Description:</b> This medium post explains the working of log probs and introduces a python package to work with them. <br>
Log probs are:
- logarithmic
- Work best in key-value pairs
- can be summed up (all tokens) per pair to get a confidence score for the pair<br>

In this test, differences have been uncovered when extracting fields individually, or grouped together in a json dict.<br>


### Temperature scaling
<b>Link:</b>https://arxiv.org/pdf/2409.19817<br>
<b>Description:</b> After RLHF(Reinfocment Learning with Human Feedback) the calibration of LLMs (accuracy to confidence ratio) is off.<br>
With ATS (Adaptive Temperature Scaling) this paper tries to restore the balance, in a task adaptive manner.<br>
The token-level probabilites should be restored

### Self-certainty
<b>Link:</b>https://arxiv.org/pdf/2502.18581<br>
<b>Description:</b> A successor to self-consistency, self-certainty measures the deviation of the probability distribution from a uniform distribution.<br>
By assigning more votes to voters with a higher confidence score, a weighted majority voting system is implemented


## Additional papers
HYCEDIS (2022) — Combines multimodal confidence predictors + anomaly detection for invoices and receipts.

PatchFinder (2024) — Uses patch-wise max-softmax uncertainty for scanned receipts, robust against noisy OCR.

KIEval (2025) — Defines evaluation metrics linking confidence thresholds to automation rate in KIE pipelines.

# Additional information
For this experiment we will use the "grid" layout .txt files

# Implementation Roadmap
## LLM as a judge
### Must-have (ship first)
- **Strict JSON schema + retry-on-fail**  
  Ensures well-formed, consistent outputs.  

- **Multi-trace CoT (k=3–5) + self-confidence**  
  Generate multiple reasoning traces; each trace provides its own confidence score.  

- **Weighted voting by self-confidence**  
  Aggregate values using confidence weights instead of plain majority.  

- **Evidence grounding (span linking)**  
  Judge must cite exact supporting text spans; missing evidence → lower confidence.  

- **Domain validators (hard/soft rules)**  
  - Dates follow expected formats  
  - VAT ID / IBAN checksum validation  
  - `Gross ≈ Net + VAT` (within tolerance)  
  - Currency whitelist checks  

- **Per-field calibration**  
  Map raw confidence to true correctness probability (e.g., isotonic or temperature scaling).  

- **Acceptance policy**  
  Apply per-field thresholds; for documents, require all critical fields to pass.  

- **Auditing & provenance**  
  Store chosen value, calibrated confidence, cited spans, validator outcomes, and reasoning traces.  

---

### High-impact add-ons (still no training)
- **Dual-phase reasoning**  
  1. Solution reasoning → generate answer  
  2. Confidence reasoning → judge and verbalize confidence bin (0–9 or 0–1)  

- **Difficulty-aware sampling**  
  Use vote dispersion or evidence quality to stop early on easy fields or sample more traces for hard ones.  

- **Cross-judge diversity (multiple rubrics)**  
  Run the same model with different judging rubrics and combine results:  
  - *Format-strict judge*: strict regex/format checks (e.g., dates, VAT IDs)  
  - *Arithmetic-strict judge*: check totals and sums (e.g., Gross vs. Net + VAT)  
  - *Evidence-strict judge*: require supporting spans near relevant anchors (e.g., “VAT”, “Total”)  

- **Few-shot bank (prompt-only)**  
  Maintain a curated set of vendor/layout examples; update as new reviewed cases are added.  

---

### Ops & monitoring
- **Coverage vs. accuracy dashboards**  
  Track Brier, ECE, NLL, AUROC; monitor auto-accept vs. review rates.  

- **Periodic recalibration**  
  Refresh calibration mappings as new reviewed cases arrive.  

- **Fail-safes**  
  If JSON invalid or confidence unstable → deterministic re-ask at temperature=0; else route to review.  


## Logprobs