
# LLM Uncertainty Quantification Notebook

This notebook collects four complementary techniques that can be applied to large language models (LLMs) to reason about prediction uncertainty without depending on any specific vendor API. We mock some LLM outputs so that each method has deterministic, reproducible inputs.

**Methods covered**
1. Token log-probability aggregation → perplexity bands
2. Predictive entropy over the next-token distribution
3. Self-consistency sampling and agreement rate
4. Semantic dispersion across samples via embedding/similarity space


In [None]:

import numpy as np
import pandas as pd
import difflib
from itertools import combinations
from collections import Counter

rng = np.random.default_rng(42)

# Mock log-probabilities for 5 responses, each with 8 tokens (natural log space)
token_logprobs = -rng.gamma(shape=2.5, scale=0.5, size=(5, 8))

# Mock probability distribution for the next token after a shared prefix
next_token_probs = rng.random(12)
next_token_probs /= next_token_probs.sum()
token_vocab = [f"token_{i}" for i in range(len(next_token_probs))]

# Mock completions for a reasoning-style prompt
sample_completions = [
    "The minimum cost occurs when x0 = 2.1 and x1 = 4.0.",
    "Optimality is achieved at x0=2.0, x1=4.0 with minimal fuel usage.",
    "We find a feasible plan near x0=6.2, x1=1.5 but it violates energy.",
    "The minimum cost occurs when x0 = 2.1 and x1 = 4.0.",
    "Solution favors x0≈1.9, x1≈4.2 to meet demand.",
]



## 1. Token Log-Probabilities → Perplexity Bands

Language models expose token-level log-probabilities. Aggregating them gives the negative log-likelihood (cross-entropy) of a response. Higher cross-entropy (and thus higher perplexity) signals lower confidence. This metric can be tracked per sample or bucketed into qualitative bands.


In [None]:

avg_cross_entropy = -token_logprobs.mean(axis=1)  # lower is better
perplexity = np.exp(avg_cross_entropy)
summary_df = pd.DataFrame({
    "sample_id": [f"S{i}" for i in range(len(perplexity))],
    "avg_cross_entropy": avg_cross_entropy,
    "perplexity": perplexity,
})
summary_df["band"] = pd.cut(
    summary_df.perplexity,
    bins=[0, 1.5, 2.5, np.inf],
    labels=["confident", "medium", "high-uncertainty"],
)
summary_df



## 2. Predictive Entropy of the Next-Token Distribution

For generative decoding, the entropy of the pending next-token distribution captures how peaked or flat the model is. Lower entropy implies the model strongly prefers one continuation, whereas higher entropy indicates uncertainty or multiple plausible branches.


In [None]:

log_probs = np.log(next_token_probs + 1e-12)
entropy_nats = -np.sum(next_token_probs * log_probs)
entropy_bits = entropy_nats / np.log(2)

df_entropy = pd.DataFrame({
    "token": token_vocab,
    "probability": next_token_probs,
    "contribution": -next_token_probs * log_probs,
}).sort_values("probability", ascending=False)

print(f"Predictive entropy: {entropy_bits:.3f} bits")
df_entropy.head(10)



## 3. Self-Consistency Sampling and Agreement Rate

Querying the LLM multiple times and measuring how often the answers agree yields an empirical uncertainty estimate. Agreement can be exact-match, fuzzy, or task-specific (e.g., numeric tolerance). Low agreement implies epistemic uncertainty due to reasoning instability.


In [None]:

def normalize_answer(text: str) -> str:
    text = text.lower().strip()
    text = text.replace('≈', '~').replace('~', '')
    return text

normalized = [normalize_answer(o) for o in sample_completions]
counter = Counter(normalized)
most_common_answer, freq = counter.most_common(1)[0]
agreement_ratio = freq / len(sample_completions)

print(f"Most common answer (normalized): {most_common_answer}")
print(f"Agreement ratio: {agreement_ratio:.2f}")
print(f"Epistemic uncertainty proxy (1 - agreement): {1 - agreement_ratio:.2f}")

Counter(sample_completions)



## 4. Semantic Dispersion via Similarity Matrix

Instead of simple agreement, compute pairwise similarity (lexical or embedding-based). The mean similarity reflects how tightly clustered the responses are; $1 - 	ext{mean similarity}$ behaves like an uncertainty score and highlights when completions diverge semantically even if wording differs.


In [None]:

def lexical_similarity(a: str, b: str) -> float:
    return difflib.SequenceMatcher(None, a, b).ratio()

N = len(sample_completions)
S = np.eye(N)
for i, j in combinations(range(N), 2):
    sim = lexical_similarity(sample_completions[i], sample_completions[j])
    S[i, j] = S[j, i] = sim

upper = S[np.triu_indices(N, k=1)]
mean_similarity = upper.mean()
uncertainty_score = 1 - mean_similarity

print(f"Mean similarity: {mean_similarity:.3f}")
print(f"Semantic dispersion (uncertainty): {uncertainty_score:.3f}")

pd.DataFrame(S, columns=[f"S{i}" for i in range(N)], index=[f"S{i}" for i in range(N)])
