# LLM Consistency and Stability Evaluation

## Overview
This notebook implements a systematic framework for evaluating the **output consistency** and **stability** of large language models (LLMs) under repeated inference with identical inputs.  
The analysis quantifies stochastic variability in both *numerical outputs* (probabilities) and *semantic outputs* (text explanations) generated by a fine-tuned causal language model.

---

## 1. Experimental Objective
Given a model $ f_\theta(x) $ that maps prompts $ x $ to probabilistic text outputs, we assess the degree to which  
$$
f_\theta^{(r)}(x) \approx f_\theta^{(s)}(x)
$$
holds for independent runs $ r, s \in \{1, \dots, N\} $, where differences arise solely from random seeds and inherent sampling stochasticity.

Two aspects of stability are investigated:

1. **Probability Stability** – Consistency of numeric outputs (the “buy” probability).  
2. **Semantic Stability** – Consistency of the natural-language *explanations* across runs.

---

## 2. Prompt Construction
Each prompt is generated using random combinations of *traits* and *contexts* describing an individual's behavioral tendencies and situational factors.  
Formally, let:
$$
x_i = (T_i, C_i)
$$
where $T_i \subseteq \mathcal{T}$ and $C_i \subseteq \mathcal{C}$ are random subsets of predefined sets of traits and contexts.

The textual prompt $p_i$ is formed using a fixed template:
$$
p_i = 	{Template}(T_i, C_i)
$$
and the model output is expected in structured JSON form:
$$
y_i = \{ 	{buy}: p_i^{(buy)}, \; 	{explanation}: e_i \}
$$
where $p_i^{(buy)} \in [0,1]$ and $e_i$ is free-form text.

---

## 3. Experimental Design

- **Model**: A locally stored LLM (e.g., `gemma-3-4b-it`).
- **Inference settings**: Deterministic decoding (`do_sample=False`), maximum 200 tokens per generation.
- **Repetitions**: $ N = 5 $ independent runs with distinct random seeds.
- **Prompts**: $ M = 30 $ distinct input contexts.

This yields a data tensor of outputs:
$$
Y = \{ y_{r,i} = (p_{r,i}, e_{r,i}) \mid r = 1, \dots, N; \, i = 1, \dots, M \}
$$

---

## 4. Probability Stability Metrics

Let $ P \in \mathbb{R}^{N 	imes M} $ denote the matrix of predicted probabilities.

1. **Per-prompt standard deviation**  
   $$
   \sigma_i = \sqrt{\frac{1}{N} \sum_{r=1}^N (p_{r,i} - \bar{p}_i)^2}
   $$
   with $\bar{p}_i = \frac{1}{N}\sum_{r=1}^N p_{r,i}$.

   The mean standard deviation over prompts measures overall dispersion:
   $$
   \bar{\sigma} = \frac{1}{M}\sum_{i=1}^M \sigma_i
   $$

2. **Mean Absolute Relative Difference (MARD)**  
   For all run pairs $ (r,s) $,
   $$
   	{MARD}_{r,s} = \frac{1}{M} \sum_{i=1}^M 
   \frac{|p_{r,i} - p_{s,i}|}{\max\left( \frac{p_{r,i}+p_{s,i}}{2}, \epsilon \right)}
   $$
   and the global MARD is the average across all pairs.

3. **Intraclass Correlation Coefficient (ICC)**  
   A simplified reliability index computed as:
   $$
   	{ICC} = \frac{\mathrm{Var}(\bar{p}_i)}{\mathrm{Var}(\bar{p}_i) + \mathrm{Var}(p_{r,i} - \bar{p}_i)}
   $$
   representing the ratio of between-prompt variance to total variance.

---

## 5. Semantic Stability Metrics

Let $E_{r,i}$ denote the embedding of explanation $e_{r,i}$ computed via a SentenceTransformer model.  
For each prompt $i$:

$$
S_i = \frac{2}{N(N-1)} \sum_{r < s} \cos(E_{r,i}, E_{s,i})
$$
where $\cos(\cdot,\cdot)$ denotes cosine similarity.

The average semantic similarity across all prompts is:
$$
\bar{S} = \frac{1}{M}\sum_{i=1}^M S_i
$$
High $\bar{S}$ indicates consistent semantic meaning across runs.

---

## 6. Computation Pipeline

1. **Model Loading**  
   The tokenizer and model are loaded from a local directory with memory-optimized settings (`bfloat16`, device map = GPU).

2. **Prompt Generation**  
   Randomized sets of traits and contexts are combined to produce $M$ unique prompts.

3. **Batch Generation**  
   Prompts are encoded, generated in batches, and decoded into text outputs.

4. **Parsing**  
   JSON outputs are parsed to extract numerical probabilities and textual explanations.

5. **Stability Computation**  
   - Numerical metrics: standard deviation, MARD, ICC.  
   - Text metrics: mean cosine similarity between explanation embeddings.

6. **GPU Memory Management**  
   After each generation cycle, CUDA memory is explicitly cleared to prevent fragmentation and OOM errors.

---

## 7. Interpretation

- **Low mean standard deviation and MARD** → high numeric stability.  
- **High ICC (≈ 1)** → strong reproducibility of probabilities across runs.  
- **High mean cosine similarity (≈ 1)** → stable reasoning structure in explanations.

Empirically, these metrics allow quantifying whether minor stochastic variations in inference cause significant behavioral or semantic shifts in model outputs.

---

## 8. Relation to Prior Research

This notebook operationalizes the conceptual framework of LLM consistency and reproducibility discussed in  
[Wang & Wang (2025), “Assessing Consistency and Reproducibility in the Outputs of Large Language Models”](https://arxiv.org/pdf/2503.16974?).  
While their study uses 50 runs across multiple financial NLP tasks, this notebook implements a scaled-down, single-model version focusing on behavioral decision prompts and continuous probability outputs.

---

## 9. Output Summary

The notebook concludes by reporting:
- Mean per-prompt standard deviation  
- Mean Absolute Relative Difference (MARD)  
- Intraclass Correlation Coefficient (ICC)  
- Mean semantic similarity across explanations  

These collectively provide a quantitative characterization of **LLM output reproducibility** in probabilistic decision-making tasks.


In [1]:
# ============================================================
# LLM Consistency & Stability Evaluation Notebook
# ============================================================

import os
import json
import random
import gc
import numpy as np
import torch
from itertools import combinations
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer, util

# ============================================================
# Cache & Speicherpfade konfigurieren
# ============================================================

CACHE_DIR = "../models/cache"
os.makedirs(CACHE_DIR, exist_ok=True)
os.environ["TRANSFORMERS_CACHE"] = CACHE_DIR
os.environ["SENTENCE_TRANSFORMERS_HOME"] = CACHE_DIR
os.environ["HF_HOME"] = CACHE_DIR

# ============================================================
# SETTINGS
# ============================================================

MODEL_PATH = "../models/saved_models/gemma-3-4b-it"
N_RUNS = 5                            # Anzahl Wiederholungen
N_PROMPTS = 30                        # wie viele Prompts getestet werden
BATCH_SIZE = 4                        # Batchgröße pro Generierung
MAX_NEW_TOKENS = 200                  # Ausgabelänge

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
device = "cuda" if torch.cuda.is_available() else "cpu"

def free_gpu_memory():
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.ipc_collect()

# ============================================================
# PROMPT GENERIERUNG (gleiche Logik wie im Trainingsdatenskript)
# ============================================================

TRAITS = [
    "likes sweets",
    "dislikes sweets",
    "health-conscious",
    "lactose intolerant",
    "cheap",
    "spender",
    "impulsive buyer"
]

CONTEXTS = [
    "hungry",
    "on a diet",
    "ice cream truck nearby",
    "hot summer day",
    "cold winter day",
    "ice cream is cheap today (discount)",
    "after a long workout",
    "after lunch",
    "after work"
]

PROMPT_TEMPLATE = """Your decisions in everyday life are influenced by your personal traits and the context you find yourself in.

Traits: {traits}
Context: {context}

The action of interest is whether you buy ice cream.
Please assess how likely you are to take the action "buys ice cream" and provide your reasoning.
Return JSON with:
{{"buy": <float between 0 and 1>, "explanation": "<text>"}}
"""

def build_prompts(n=30):
    prompts = []
    for _ in range(n):
        traits = random.sample(TRAITS, k=random.randint(1,3))
        context = random.sample(CONTEXTS, k=random.randint(1,3))
        prompt = PROMPT_TEMPLATE.format(traits=traits, context=context)
        prompts.append(prompt)
    return prompts

prompts = build_prompts(N_PROMPTS)

# ============================================================
# MODELL LADEN
# ============================================================

print("🚀 Lade Modell ...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    dtype=torch.bfloat16,
    device_map=device,
    low_cpu_mem_usage=True
)
model.eval()

# ============================================================
# GENERATION
# ============================================================

def generate_batch(prompts, batch_size=4, max_new_tokens=200):
    all_outputs = []
    eos_token_id = tokenizer.eos_token_id or tokenizer.pad_token_id
    pad_token_id = tokenizer.pad_token_id or tokenizer.eos_token_id

    for i in tqdm(range(0, len(prompts), batch_size), desc="🔄 Generiere Batches"):
        batch_prompts = prompts[i:i + batch_size]
        chat_texts = [
            tokenizer.apply_chat_template(
                [{"role": "user", "content": p}],
                tokenize=False,
                add_generation_prompt=True
            )
            for p in batch_prompts
        ]
        inputs = tokenizer(chat_texts, return_tensors="pt", padding=True, truncation=True).to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                do_sample=False,
                max_new_tokens=max_new_tokens,
                pad_token_id=pad_token_id,
                eos_token_id=eos_token_id,
                repetition_penalty=1.1
            )

        for j in range(len(batch_prompts)):
            gen_tokens = outputs[j][inputs["input_ids"].shape[-1]:]
            answer = tokenizer.decode(gen_tokens, skip_special_tokens=True).strip()
            all_outputs.append(answer)
    return all_outputs

def extract_prob_and_expl(text):
    """Parst den JSON-Teil."""
    try:
        start = text.find("{")
        end = text.rfind("}") + 1
        js = json.loads(text[start:end])
        return float(js.get("buy", None)), str(js.get("explanation", "")).strip()
    except Exception:
        return None, text

# ============================================================
# STABILITY EXPERIMENT
# ============================================================

def run_multiple_generations(prompts, n_runs=5, batch_size=4):
    all_probs, all_texts = [], []
    for r in range(n_runs):
        print(f"\n🧪 Run {r+1}/{n_runs}")
        random.seed(r)
        torch.manual_seed(r)
        outputs = generate_batch(prompts, batch_size=batch_size)
        probs, texts = [], []
        for o in outputs:
            p, t = extract_prob_and_expl(o)
            probs.append(p)
            texts.append(t)
        all_probs.append(probs)
        all_texts.append(texts)
        free_gpu_memory()
    return all_probs, all_texts

prob_runs, text_runs = run_multiple_generations(prompts, N_RUNS, BATCH_SIZE)

# ============================================================
# Metriken für Wahrscheinlichkeit (kontinuierlich)
# ============================================================

def compute_prob_stability(prob_matrix):
    arr = np.array(prob_matrix)  # shape: (n_runs, n_prompts)
    mean_std = np.nanmean(np.nanstd(arr, axis=0))
    mard_vals = []
    for a, b in combinations(range(arr.shape[0]), 2):
        rel_diff = np.abs(arr[a] - arr[b]) / (np.maximum((arr[a] + arr[b]) / 2, 1e-8))
        mard_vals.append(np.nanmean(rel_diff))
    mean_mard = np.nanmean(mard_vals)
    icc = np.var(arr.mean(axis=0)) / (np.var(arr.mean(axis=0)) + np.var(arr - arr.mean(axis=0)))
    print(f"📊 Mittelwert der SDs über Prompts: {mean_std:.4f}")
    print(f"📊 Mean Absolute Relative Difference (MARD): {mean_mard*100:.2f}%")
    print(f"📊 Vereinfachte ICC-Schätzung: {icc:.3f}")
    return {"std": mean_std, "mard": mean_mard, "icc": icc}

_ = compute_prob_stability(prob_runs)

# ============================================================
# Metriken für Erklärung (Text)
# ============================================================

def compute_text_stability(text_runs):
    emb_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    sims = []
    for j in range(len(text_runs[0])):
        texts = [text_runs[r][j] for r in range(len(text_runs))]
        embs = emb_model.encode(texts, convert_to_tensor=True)
        pair_sims = []
        for a, b in combinations(range(len(texts)), 2):
            sim = util.cos_sim(embs[a], embs[b]).item()
            pair_sims.append(sim)
        sims.append(np.mean(pair_sims))
    avg_sim = np.mean(sims)
    print(f"📊 Durchschnittliche semantische Ähnlichkeit: {avg_sim:.4f}")
    return avg_sim

_ = compute_text_stability(text_runs)

print("\n✅ Stabilitätsprüfung abgeschlossen.")


🚀 Lade Modell ...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


🧪 Run 1/5


🔄 Generiere Batches:   0%|          | 0/8 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
🔄 Generiere Batches: 100%|██████████| 8/8 [04:45<00:00, 35.68s/it]



🧪 Run 2/5


🔄 Generiere Batches: 100%|██████████| 8/8 [04:44<00:00, 35.59s/it]



🧪 Run 3/5


🔄 Generiere Batches: 100%|██████████| 8/8 [04:44<00:00, 35.60s/it]



🧪 Run 4/5


🔄 Generiere Batches: 100%|██████████| 8/8 [04:44<00:00, 35.60s/it]



🧪 Run 5/5


🔄 Generiere Batches: 100%|██████████| 8/8 [04:44<00:00, 35.60s/it]


📊 Mittelwert der SDs über Prompts: 0.0000
📊 Mean Absolute Relative Difference (MARD): 0.00%
📊 Vereinfachte ICC-Schätzung: 1.000


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: fd5a58c4-7e91-483b-ae5c-ef276e96364d)')' thrown while requesting HEAD https://huggingface.co/api/resolve-cache/models/sentence-transformers/all-MiniLM-L6-v2/c9745ed1d9f207416be6d2e6f8de32d1f16199bf/config.json
Retrying in 1s [Retry 1/5].


config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

📊 Durchschnittliche semantische Ähnlichkeit: 1.0000

✅ Stabilitätsprüfung abgeschlossen.
