This repository contains the code, data, and analysis for our study on how conversation history systematically distorts sequential binary judgments in LLM evaluation pipelines. We test 11 models from 4 providers across 82K+ API calls.
- AMEL is real and cross-provider (d = -0.14, p < 10^-35, N = 82,704 API calls)
- Ambiguous items are most affected (d = -0.26) while clear-cut cases remain robust
- Assimilation, not contrast: models shift toward the conversation's prevailing polarity regardless of item ground truth (d = 0.47)
- Negativity asymmetry: negative context induces 2.6x stronger bias than positive context
- No accumulation: 5 turns of biased history produce the same effect as 50
- Scaling reduces but doesn't eliminate: Haiku d=0.22 > Sonnet d=0.18 > Opus d=0.17
- Temperature doesn't help: lower temperature trends toward stronger bias, not weaker
- Balanced ordering mitigates drift: interleaving expected-yes/no items prevents positional drift in sequential evaluation
- Logprobs: the probability distribution shifts continuously, not just binary flips (1,050 calls)
- Flipped framing: negativity asymmetry has both token-level and semantic sources — balance varies by model (1,260 calls)
- Positional placement: START ≈ END ≈ SPREAD — position of biased turns is irrelevant, any sparse signal suffices (KW H=0.19, p=0.91; 1,260 calls)
- Baseline correlation: items with higher baseline P(no) show stronger negativity asymmetry (r=0.14, p<0.001)
.
├── paper/ # Manuscript (LaTeX)
│ ├── main.tex # Full paper source
│ ├── main.pdf # Compiled manuscript
│ └── references.bib # Bibliography (35 references)
│
├── src/ # Experiment framework
│ ├── config.py # Experimental parameters
│ ├── conversation.py # Context construction (polarity + positional)
│ ├── parser.py # Response parsing (yes/no extraction)
│ ├── runner.py # Async experiment runner (Ollama)
│ └── domains/ # Evaluation domain definitions
│ ├── base.py # Abstract domain interface
│ ├── code_review.py # "Is this code production-ready?"
│ ├── code_review_flipped.py # "Should this code be rejected?" (flipped framing)
│ ├── content_mod.py # "Is this comment appropriate?"
│ └── meals.py # "Is this a healthy choice?"
│
├── run_experiment.py # Main CLI (local models via Ollama)
├── run_openai.py # OpenAI GPT-4.1 Nano runner
├── run_openai_5_2.py # OpenAI GPT-5.2 runner
├── run_claude.py # Anthropic Claude runner
├── run_gemini.py # Google Gemini runner
├── run_mitigation.py # Sequential batch mitigation experiment
├── run_temperature.py # Temperature sensitivity experiment
├── run_logprobs.py # Logprobs mechanistic experiment (Phase 1)
├── run_flipped.py # Flipped framing experiment (Phase 2)
├── run_positional.py # Positional placement experiment (Phase 3)
│
├── analysis/ # Statistical analysis
│ ├── utils.py # Shared utilities (load_results, compute_bias_scores)
│ ├── analyze.py # Core analysis functions
│ ├── paper_statistics.py # Comprehensive stats for paper
│ ├── contrast_assimilation.py # Congruent vs incongruent bias analysis
│ ├── continuous_confidence.py # Baseline entropy vs bias susceptibility
│ ├── mixed_effects.py # Mixed-effects model (BS ~ polarity * category | model)
│ ├── response_time.py # Response latency analysis
│ ├── qualitative_examples.py # Top biased items with raw responses
│ ├── mitigation_analysis.py # Sequential batch experiment analysis
│ ├── temperature_analysis.py # Temperature sensitivity analysis
│ ├── asymmetry_baseline_corr.py # Baseline P(no) vs asymmetry correlation
│ ├── logprobs_analysis.py # First-token probability analysis
│ ├── flipped_analysis.py # Original vs flipped framing comparison
│ └── positional_analysis.py # START/END/SPREAD placement analysis
│
├── generate_paper_figures.py # Publication figure generation (13 figures)
│
├── data/
│ ├── all_results.jsonl # Main experiment dataset (78,084 responses)
│ ├── mitigation/ # Sequential batch experiment (3,780 responses)
│ ├── temperature/ # Temperature spot-check (840 responses)
│ ├── logprobs/ # Logprobs experiment (1,050 responses)
│ ├── flipped/ # Flipped framing experiment (1,260 responses)
│ ├── positional/ # Positional placement experiment (1,260 responses)
│ ├── raw/ # Local models (Llama, Qwen) via Ollama
│ ├── openai/ # GPT-4.1 Nano results
│ ├── openai-gpt52/ # GPT-5.2 results
│ ├── claude-haiku-4-5/ # Claude Haiku 4.5 results
│ ├── claude-sonnet-4-6/ # Claude Sonnet 4.6 results
│ ├── claude-opus-4-6/ # Claude Opus 4.6 results
│ ├── gemini-flash/ # Gemini 2.5 Flash results
│ └── gemini-pro/ # Gemini 2.5 Pro results
│
└── results/
├── paper_figures/ # Figures 1-13 (PDF + PNG)
├── paper_statistics.json # Main experiment statistics
├── asymmetry_baseline_corr.json
├── logprobs_analysis.json
├── flipped_analysis.json
├── positional_analysis.json
├── contrast_assimilation.json
├── continuous_confidence.json
├── mixed_effects.json
├── response_time.json
├── qualitative_examples.json
├── mitigation_analysis.json
└── temperature_analysis.json
We use a between-subjects design with four conditions per test item:
| Condition | Context | Description |
|---|---|---|
| Baseline | None | Test item presented after system prompt only |
| No-saturated | 90% "no" | N turns of predominantly negative evaluations |
| Yes-saturated | 90% "yes" | N turns of predominantly positive evaluations |
| Neutral | 50/50 | N turns of balanced evaluations |
Each condition is repeated 10 times at temperature T=1.0 across context lengths N = {5, 10, 20, 50}.
| Provider | Model | Effect Size (|d|) |
|---|---|---|
| OpenAI | GPT-4.1 Nano | 0.34 |
| OpenAI | GPT-5.2 | 0.17 |
| Anthropic | Claude Haiku 4.5 | 0.22 |
| Anthropic | Claude Sonnet 4.6 | 0.18 |
| Anthropic | Claude Opus 4.6 | 0.17 |
| Gemini 2.5 Flash | 0.18 | |
| Gemini 2.5 Pro | 0.27 | |
| Local | Llama 3.2 3B | 0.32 |
| Local | Qwen3 4B | 0.21 (contrarian) |
| Local | Qwen3.5 4B | 0.08 (n.s.) |
| Local | Qwen3 30B | 0.08 (n.s.) |
pip install -r requirements.txtFor local models, install Ollama and pull the required models.
# Local models (Ollama)
python run_experiment.py run
# API models (set environment variables first)
export OPENAI_API_KEY="..."
python run_openai.py
export ANTHROPIC_API_KEY="..."
python run_claude.py
export GEMINI_API_KEY="..."
python run_gemini.py
# Mitigation experiment (sequential batch)
python run_mitigation.py
# Temperature sensitivity
python run_temperature.py
# Mechanistic experiments
python run_logprobs.py # Logprobs (OpenAI only)
python run_flipped.py # Flipped framing (OpenAI + Ollama)
python run_positional.py # Positional placement (OpenAI + Ollama)# Generate comprehensive statistics
python -m analysis.paper_statistics
# Run all supplementary analyses
python -m analysis.contrast_assimilation
python -m analysis.continuous_confidence
python -m analysis.mixed_effects
python -m analysis.response_time
python -m analysis.qualitative_examples
python -m analysis.mitigation_analysis
python -m analysis.temperature_analysis
# Mechanistic analyses
python -m analysis.asymmetry_baseline_corr
python -m analysis.logprobs_analysis
python -m analysis.flipped_analysis
python -m analysis.positional_analysis
# Generate paper figures (1-13)
python generate_paper_figures.pycd paper
tectonic main.tex
# or: pdflatex main && bibtex main && pdflatex main && pdflatex mainEach line in data/all_results.jsonl is a JSON object:
{
"domain": "code_review",
"model": "claude-sonnet-4-6",
"polarity": "no_saturated",
"context_length": 10,
"test_item_id": "test_code_amb_03",
"test_item_category": "ambiguous",
"test_item_ground_truth": "yes",
"repetition": 3,
"parsed_response": "no",
"response_time_ms": 2146.46,
"seed": 3941144273,
"timestamp": "2026-03-16T12:41:40.789808+00:00"
}@article{temkit2026amel,
title={AMEL: Accumulated Message Effects on LLM Judgments},
author={Temkit, Sid-Ali},
year={2026}
}This research is released under the MIT License. The dataset is released under CC-BY 4.0.