# Motivation
- **Hallucination & over‑confidence**: Modern LLMs (e.g. GPT‑3.5, GPT‑4) often emit plausible‑looking but incorrect answers, with no in‑built calibration or access to training data or token‑probabilities.

- **High‑stakes need**: In critical applications (legal drafting, medical advice, automated evaluation), users must know when to trust an LLM’s response.

- **Black‑box constraint**: Cannot modify LLM training or access internal probabilities—must wrap around API calls.

# BSDetector Method Overview
- BSDetector produces, alongside any LLM answer, a numeric confidence score C ∈ [0,1] by combining two complementary signals:

- **Observed Consistency (O)** – an extrinsic measure of how much the model’s various plausible outputs contradict the original answer.

- **Self‑reflection Certainty (S)** – an intrinsic measure obtained by prompting the LLM to introspectively judge its own answer’s correctness.

Overall score:
    `C = β·O + (1–β)·S`<br>
with β a tunable trade‑off parameter.

![alt text](<papers/Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness/images/BSDetectorPipeline.png>)

# Key Drawbacks of Raw Embedding Similarity Metrics

A Simple Intuitive Example<br>
- **Sentence A**: "The quick brown fox jumps over the lazy dog."<br>
- **Sentence B**: "A swift auburn fox leaped above the lethargic canine."

- True semantic relationship:
Both sentences describe exactly the same scene—only the word choices differ.

- Embedding‑space outcome:
  - A raw cosine‑similarity model might score them around 0.65–0.75 (out of 1), which looks only moderately similar—even though to a human reader they’re identical in meaning.
  - This drop occurs because the model’s vectors are sensitive to different token distributions ("quick" vs. "swift", "brown" vs. "auburn", etc.) and to how often those specific words appeared during pretraining.
  - High‑frequency words tend to cluster differently than low‑frequency ones, causing point estimates of similarity to systematically over‑ or under‑estimate human judgments of meaning closeness. This means that even two semantically identical phrases can appear distant simply because one uses more common words than the other
  - Small stylistic tweaks—like adding an adverb or swapping a single adjective—can shift an embedding enough to drop cosine similarity below a threshold, despite no real change in meaning. 
- Why this matters for uncertainty estimation:
If you treated that moderate cosine score as evidence of “uncertainty” in meaning, you’d incorrectly conclude the model is unsure about the answer, when in fact it’s just reworded it.

# A better strategy? NLI
## Summary
- Natural Language Inference (NLI), also called Recognizing Textual Entailment (RTE), is the task of determining whether a **hypothesis** sentence is **entailed by**, **contradicts**, or is **neutral** with respect to a **premise** sentence.
- An NLI model outputs a probability distribution over these three labels, enabling us to judge whether two pieces of text agree in meaning, oppose each other, or have no definite relationship.
- The NLI framework is particularly powerful for semantic comparison because it focuses on logical relationships rather than surface‑level word overlap or embedding proximity.

## What Is NLI?
NLI frames meaning comparison as a three‑way classification task:
- **Entailment**: The premise logically implies the hypothesis.
- **Contradiction**: The premise and hypothesis cannot both be true simultaneously.
- **Neutral**: Neither entailment nor contradiction—there’s no clear inference in either direction.

## Why Use NLI?
- **Logic‑focused**: Goes beyond surface‐form similarity to capture true semantic agreement or opposition
- **Graded confidence**: By producing probabilities for each label, NLI gives a fine‑grained measure of how strongly two sentences align or conflict.
- **Robust to rephrasing**: Ignores trivial wording differences that might fool embedding‑based metrics.

## How Does an NLI Model Work?
![alt text](<papers/Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness/images/How NLI works.png>)

## Simple Examples
![alt text](<papers/Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness/images/Examples of NLI.png>)

## Intuitive Key-Takeaways
By using an NLI classifier, we move from brittle string‑matching or embedding‑distance heuristics to a more robust, logic‑driven measure of semantic consistency—key for assessing uncertainty in model outputs or flagging truly conflicting responses.

# Effect of different sentence similarity metrics
![alt text](<papers/Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness/images/Effect of different sentence similarity metrics.png>)

# Observed Consistency
- The authors propose to quantify the uncertainty of a language model’s answer by sampling multiple outputs and then measuring how semantically “aligned” each sampled answer is with the original (reference) answer.
- Prompt Template
![alt text](<papers/Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness/images/Prompt Template for Observed Consistency.png>)
 
- **Diverse sampling**: Re‑invoke the LLM k times (e.g. k = 5) with a Chain‑of‑Thought prompt and high temperature to yield answers {y₁…yₖ}.

- **Semantic contradiction scoring**: For each sample yᵢ, measure contradiction vs. original y via a Natural Language Inference (NLI) model:
  - Compute pᵢ = probability(“contradiction”) for the pair (yᵢ,y) (averaging both input orders).
  - Also include an indicator rᵢ = 1[yᵢ = y] to stabilize on closed‑form tasks.
  - Similarity score oᵢ = α·(1–pᵢ) + (1–α)·rᵢ.
- Aggregate: O = meanᵢ oᵢ.
  - High O means sampled answers largely agree with y; low O signals many contradictions.

### Prompt template for observed consistency
![alt text](<papers/Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness/images/Prompt Template for Observed Consistency.png>)


In [1]:
# To run this demo, first install dependencies:
# pip install openai transformers torch

import os
import openai
from dotenv import load_dotenv
from transformers import pipeline
from huggingface_hub import login

In [2]:
load_dotenv(override=True)

openai.api_key = os.getenv("OPENAI_API_KEY")
login(token=os.getenv("HUGGINGFACE_TOKEN"))

In [6]:
def get_original_answer(question: str) -> str:
    resp = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
        temperature=0.0,
    )
    return resp.choices[0].message.content.strip()

# Function to sample k diverse responses with CoT prompt at temperature=1.0
def sample_responses(question: str, k: int = 5, cot_prompt: str = "Think step-by-step:") -> list[str]:
    samples = []
    for _ in range(k):
        prompt = f"{question}\n{cot_prompt}"
        resp = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=1.0,
        )
        samples.append(resp.choices[0].message.content.strip())
    return samples

# Load NLI model for contradiction detection
nli = pipeline("text-classification", model="roberta-large-mnli")

def contradiction_prob(hypothesis: str, premise: str) -> float:
    """
    Returns the probability that 'hypothesis' contradicts 'premise',
    averaging both input orders to mitigate positional bias.
    """
    def p_contra(text_pair: str) -> float:
        # Use the NLI model to get the contradiction score
        # The model returns a list of scores for 'entailment', 'neutral', and 'contradiction'
        result = nli(text_pair)

        # For roberta-large-mnli, we need to find the contradiction probability
        # The model returns probabilities for all classes (contradiction, entailment, neutral)

        if isinstance(result, list) and len(result) > 0:
            if isinstance(result[0], dict) and 'label' in result[0]:
                # If the model returns the highest probability class only
                if result[0]['label'].lower() == "contradiction":
                    return result[0]['score']
                return 0.0
            else:
                # If model returns probabilities for all classes
                for item in result:
                    if item['label'].lower() == "contradiction":
                        return item['score']
        
        # If using a different format (e.g., with pipeline that returns all class probabilities)
        try:
            probs = {item['label'].lower(): item['score'] for item in result}
            return probs.get('contradiction', 0.0)
        except:
            # Fallback
            return 0.0
    
    p1 = p_contra(f"{hypothesis} </s> {premise}")
    p2 = p_contra(f"{premise} </s> {hypothesis}")
    return (p1 + p2) / 2.0

# 5. Compute Observed Consistency
def observed_consistency(question: str, k: int = 5, alpha: float = 0.9):
    # original answer
    y = get_original_answer(question)
    # sampled alternatives
    samples = sample_responses(question, k=k)
    # compute o_i for each sample
    o_scores = []
    for yi in samples:
        p_c = contradiction_prob(yi, y)
        sim = 1.0 - p_c
        ri = 1.0 if yi.strip() == y.strip() else 0.0
        o_i = alpha * sim + (1.0 - alpha) * ri
        o_scores.append(o_i)
    O = sum(o_scores) / len(o_scores)
    return y, samples, o_scores, O

# 6. Demo run
question = "Who was the only survivor of Titanic?"
original, variants, scores, O_score = observed_consistency(question)

print(f"Question: {question}")
print(f"Original Answer: {original}\n")
for idx, (v, sc) in enumerate(zip(variants, scores), 1):
    print(f"{idx}. {v[:60]}...   o_{idx} = {sc:.3f}")
print(f"\nObserved Consistency score O = {O_score:.3f}")


Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Question: Who was the only survivor of Titanic?
Original Answer: The only survivor of the Titanic disaster was not a single individual, as there were actually many survivors. However, if you are referring to a specific person often highlighted in stories about the Titanic, it could be referring to individuals like Molly Brown, who became famous for her efforts to help others during the disaster. 

If you meant the last living survivor, that was Millvina Dean, who was just a few months old at the time of the sinking in 1912. She passed away in 2009. If you have a specific context in mind, please clarify!

1. The Titanic sank on April 15, 1912, during its maiden voyage...   o_1 = 0.900
2. To determine the only survivor of the Titanic, we need to cl...   o_2 = 0.900
3. To clarify the question about the only survivor of the Titan...   o_3 = 0.900
4. The only survivor of the Titanic is a common misconception. ...   o_4 = 0.627
5. The Titanic, which sank on April 15, 1912, had many passenge.

# Self‑reflection Certainty
- Like CoT prompting, self-reflection allows the LLM to employ additional computation to reason more deeply about the correctness of its answer.
- To specifically calculate self-reflection certainty, we prompt the LLM to state how confident it is that its original answer was correct.
- LLMs to rate their confidence numerically on a continuous scale (0-100) tended to always yield overly high scores (> 90). Instead we ask the LLM to rate its confidence in its original answer via multiple follow-up questions each on a multiple-choice (e.g. 3-way) scale.
- Follow‑up multiple‑choice: Prompt the LLM with its original question & answer, asking e.g.:
  - "Is your answer correct? (A) Correct (B) Incorrect (C) I am not sure."
  - Repeat with varied wording.
- Scoring: Map A→1.0, B→0.0, C→0.5, average across rounds → S.
  - Captures the model’s own calibrated judgment.

### Prompt template for self-reflection consistency
![alt text](<papers/Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness/images/Prompt Template for Self-Reflection Consistency.png>)

In [11]:
def get_original_answer(question: str) -> str:
    """Fetch the model's original answer at temperature=0."""
    resp = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
        temperature=0.0
    )
    return resp.choices[0].message.content.strip()

def self_reflection_certainty(question: str, answer: str, rounds: int = 5):
    """
    Compute Self-reflection Certainty S by prompting the LLM to judge its own answer.
    Returns:
      - judgments: list of choices (A/B/C)
      - scores: list of numeric scores (A→1.0, B→0.0, C→0.5)
      - S: average score
    """
    score_map = {"A": 1.0, "B": 0.0, "C": 0.5}
    prompts = [
        f"Question: {question}\nAnswer: {answer}\nIs this answer correct? Choose (A) Correct (B) Incorrect (C) I am not sure.",
        f"Question: {question}\nAnswer: {answer}\nHow confident are you in your answer? (A) Correct (B) Incorrect (C) I am not sure."
    ]
    judgments, scores = [], []

    for i in range(rounds):
        prompt = prompts[i % len(prompts)]
        resp = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0
        )
        text = resp.choices[0].message.content.strip()
        # Extract letter choice
        choice = text[0].upper() if text and text[0].upper() in score_map else None
        if choice not in score_map:
            # Fallback: scan for any valid letter
            for letter in score_map:
                if letter in text:
                    choice = letter
                    break
        judgments.append(choice)
        scores.append(score_map.get(choice, 0.0))

    S = sum(scores) / len(scores) if scores else 0.0
    return judgments, scores, S

# Demo usage
question = "Who was the only survivor of Titanic?"
original_answer = get_original_answer(question)
judgments, numeric_scores, S_score = self_reflection_certainty(question, original_answer)

print(f"Question: {question}")
print(f"Original Answer: {original_answer}\n")
for idx, (j, s) in enumerate(zip(judgments, numeric_scores), 1):
    print(f"Round {idx}: choice = {j}, score = {s}")
print(f"\nSelf-reflection Certainty S = {S_score:.3f}")


Question: Who was the only survivor of Titanic?
Original Answer: The only survivor of the Titanic disaster was not a single individual, but rather there were several survivors. However, if you are referring to a notable survivor, one of the most famous is Molly Brown, often referred to as "The Unsinkable Molly Brown." She was known for her efforts to help others during the disaster and for her later activism. 

If you meant to ask about a specific individual who was the last living survivor, that would be Millvina Dean, who was just a few months old at the time of the sinking. She passed away in 2009. 

If you have a different context in mind, please clarify!

Round 1: choice = A, score = 1.0
Round 2: choice = A, score = 1.0
Round 3: choice = A, score = 1.0
Round 4: choice = A, score = 1.0
Round 5: choice = A, score = 1.0

Self-reflection Certainty S = 1.000


# Total Trustworthiness score

#### Overall Confidence Estimate
![alt text](<papers/Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness/images/Overall Confidence Estimate.png>)

In [14]:
# === Final Confidence ===
def bsdetector_confidence(question: str,
                          k: int = 5,
                          alpha: float = 0.9,
                          beta: float = 0.5) -> float:
    """
    Compute final BSDETECTOR confidence score C = beta*O + (1-beta)*S.
    """
    original = get_original_answer(question)
    O = observed_consistency(question, k=k, alpha=alpha)
    score_O = O[3]  # Extract the observed consistency score
    S = self_reflection_certainty(question, original, rounds=2)
    score_S = S[2]  # Extract the self-reflection certainty score
    C = beta * score_O + (1 - beta) * score_S
    return C

q = "Who was the only survivor of Titanic?"
y = get_original_answer(q)
O = observed_consistency(q)
score_O = O[3]  # Extract the observed consistency score
S = self_reflection_certainty(q, y)
score_S = S[2]  # Extract the self-reflection certainty score
C = bsdetector_confidence(q)
print(f"Question: {q}")
print(f"Original Answer: {y}")
print(f"Observed Consistency (O): {score_O:.3f}")
print(f"Self-reflection Certainty (S): {score_S:.3f}")
print(f"Final BSDetector Confidence (C): {C:.3f}")

Question: Who was the only survivor of Titanic?
Original Answer: The only survivor of the Titanic disaster was not a single individual, as there were many survivors. However, if you are referring to a notable survivor, one of the most famous is Molly Brown, often referred to as "The Unsinkable Molly Brown." She was known for her efforts to help others during the disaster and her subsequent activism. If you meant a specific individual or a different context, please clarify!
Observed Consistency (O): 0.900
Self-reflection Certainty (S): 1.000
Final BSDetector Confidence (C): 0.933


# Experimental Validation
- **Benchmarks**: Math word problems (GSM8K, SVAMP), commonsense QA (CSQA), open trivia (TriviaQA).
- **Baselines**:
  - Likelihood‑based uncertainty (token prob).
  - Temperature‑sampling variance.
  - Prior self‑reflection methods.
- **Calibration (AUROC)**: BSDETECTOR achieves AUROC ≈ 0.77–0.95, outperforming all baselines by large margins (e.g. on GSM8K: 0.867→0.951; SVAMP: 0.936).
- **Improved answer selection**: Generating 5 candidates & choosing the one with highest C raises accuracy (e.g. GSM8K from 47 %→70 %, SVAMP 75 %→82 %).
- **Reliable LLM‑based evaluation**:
  - Human‑in‑loop: route low‑C evaluations for human review, boosting overall evaluation accuracy vs. random
  - Fully automated: dropping bottom‑20 % by C yields evaluation scores with lower variance and bias.

# Applications
- **Adaptive routing**: "I don’t know" or human‑fallback when C low.
- **Answer boosting**: Select among multiple LLM replies via highest C.
- **Trustworthy automated evaluation**: Filter or flag low‑confidence judgments in LLM evaluators.
- **Real‑world demo**: Legal drafting assistant that flags low‑C clauses for attorney review.

# Limitations & Future Work
- Compute overhead: Requires multiple extra LLM calls and NLI passes.
- Domain generalization: NLI and prompts tuned on QA—other tasks may need adaptation.
- Black‑box constraints: Cannot leverage token‑level signals for GPT‑3.5; reliant on prompt engineering
- Future directions:
  - Reduce sampling costs (e.g. via smarter subset selection).
  - Extend to non‑QA tasks (summarization, dialogue).
  - Explore alternative contradiction detectors beyond NLI.

# Key takeaway
BSDetector offers a simple, general wrapper to endow any black‑box LLM with well‑calibrated confidence estimates—enabling safer, more trustworthy AI deployments.