AMEL: Accumulated Message Effects on LLM Judgments

This repository contains the code, data, and analysis for our study on how conversation history systematically distorts sequential binary judgments in LLM evaluation pipelines. We test 11 models from 4 providers across 82K+ API calls.

Key Findings

AMEL is real and cross-provider (d = -0.14, p < 10^-35, N = 82,704 API calls)
Ambiguous items are most affected (d = -0.26) while clear-cut cases remain robust
Assimilation, not contrast: models shift toward the conversation's prevailing polarity regardless of item ground truth (d = 0.47)
Negativity asymmetry: negative context induces 2.6x stronger bias than positive context
No accumulation: 5 turns of biased history produce the same effect as 50
Scaling reduces but doesn't eliminate: Haiku d=0.22 > Sonnet d=0.18 > Opus d=0.17
Temperature doesn't help: lower temperature trends toward stronger bias, not weaker
Balanced ordering mitigates drift: interleaving expected-yes/no items prevents positional drift in sequential evaluation

Mechanistic Findings (Section 5)

Logprobs: the probability distribution shifts continuously, not just binary flips (1,050 calls)
Flipped framing: negativity asymmetry has both token-level and semantic sources — balance varies by model (1,260 calls)
Positional placement: START ≈ END ≈ SPREAD — position of biased turns is irrelevant, any sparse signal suffices (KW H=0.19, p=0.91; 1,260 calls)
Baseline correlation: items with higher baseline P(no) show stronger negativity asymmetry (r=0.14, p<0.001)

Repository Structure

.
├── paper/                      # Manuscript (LaTeX)
│   ├── main.tex                # Full paper source
│   ├── main.pdf                # Compiled manuscript
│   └── references.bib          # Bibliography (35 references)
│
├── src/                        # Experiment framework
│   ├── config.py               # Experimental parameters
│   ├── conversation.py         # Context construction (polarity + positional)
│   ├── parser.py               # Response parsing (yes/no extraction)
│   ├── runner.py               # Async experiment runner (Ollama)
│   └── domains/                # Evaluation domain definitions
│       ├── base.py             # Abstract domain interface
│       ├── code_review.py      # "Is this code production-ready?"
│       ├── code_review_flipped.py # "Should this code be rejected?" (flipped framing)
│       ├── content_mod.py      # "Is this comment appropriate?"
│       └── meals.py            # "Is this a healthy choice?"
│
├── run_experiment.py           # Main CLI (local models via Ollama)
├── run_openai.py               # OpenAI GPT-4.1 Nano runner
├── run_openai_5_2.py           # OpenAI GPT-5.2 runner
├── run_claude.py               # Anthropic Claude runner
├── run_gemini.py               # Google Gemini runner
├── run_mitigation.py           # Sequential batch mitigation experiment
├── run_temperature.py          # Temperature sensitivity experiment
├── run_logprobs.py             # Logprobs mechanistic experiment (Phase 1)
├── run_flipped.py              # Flipped framing experiment (Phase 2)
├── run_positional.py           # Positional placement experiment (Phase 3)
│
├── analysis/                   # Statistical analysis
│   ├── utils.py                # Shared utilities (load_results, compute_bias_scores)
│   ├── analyze.py              # Core analysis functions
│   ├── paper_statistics.py     # Comprehensive stats for paper
│   ├── contrast_assimilation.py # Congruent vs incongruent bias analysis
│   ├── continuous_confidence.py # Baseline entropy vs bias susceptibility
│   ├── mixed_effects.py        # Mixed-effects model (BS ~ polarity * category | model)
│   ├── response_time.py        # Response latency analysis
│   ├── qualitative_examples.py # Top biased items with raw responses
│   ├── mitigation_analysis.py  # Sequential batch experiment analysis
│   ├── temperature_analysis.py # Temperature sensitivity analysis
│   ├── asymmetry_baseline_corr.py # Baseline P(no) vs asymmetry correlation
│   ├── logprobs_analysis.py    # First-token probability analysis
│   ├── flipped_analysis.py     # Original vs flipped framing comparison
│   └── positional_analysis.py  # START/END/SPREAD placement analysis
│
├── generate_paper_figures.py   # Publication figure generation (13 figures)
│
├── data/
│   ├── all_results.jsonl       # Main experiment dataset (78,084 responses)
│   ├── mitigation/             # Sequential batch experiment (3,780 responses)
│   ├── temperature/            # Temperature spot-check (840 responses)
│   ├── logprobs/               # Logprobs experiment (1,050 responses)
│   ├── flipped/                # Flipped framing experiment (1,260 responses)
│   ├── positional/             # Positional placement experiment (1,260 responses)
│   ├── raw/                    # Local models (Llama, Qwen) via Ollama
│   ├── openai/                 # GPT-4.1 Nano results
│   ├── openai-gpt52/           # GPT-5.2 results
│   ├── claude-haiku-4-5/       # Claude Haiku 4.5 results
│   ├── claude-sonnet-4-6/      # Claude Sonnet 4.6 results
│   ├── claude-opus-4-6/        # Claude Opus 4.6 results
│   ├── gemini-flash/           # Gemini 2.5 Flash results
│   └── gemini-pro/             # Gemini 2.5 Pro results
│
└── results/
    ├── paper_figures/          # Figures 1-13 (PDF + PNG)
    ├── paper_statistics.json   # Main experiment statistics
    ├── asymmetry_baseline_corr.json
    ├── logprobs_analysis.json
    ├── flipped_analysis.json
    ├── positional_analysis.json
    ├── contrast_assimilation.json
    ├── continuous_confidence.json
    ├── mixed_effects.json
    ├── response_time.json
    ├── qualitative_examples.json
    ├── mitigation_analysis.json
    └── temperature_analysis.json

Experimental Design

We use a between-subjects design with four conditions per test item:

Condition	Context	Description
Baseline	None	Test item presented after system prompt only
No-saturated	90% "no"	N turns of predominantly negative evaluations
Yes-saturated	90% "yes"	N turns of predominantly positive evaluations
Neutral	50/50	N turns of balanced evaluations

Each condition is repeated 10 times at temperature T=1.0 across context lengths N = {5, 10, 20, 50}.

Models Tested

Provider	Model	Effect Size (\|d\|)
OpenAI	GPT-4.1 Nano	0.34
OpenAI	GPT-5.2	0.17
Anthropic	Claude Haiku 4.5	0.22
Anthropic	Claude Sonnet 4.6	0.18
Anthropic	Claude Opus 4.6	0.17
Google	Gemini 2.5 Flash	0.18
Google	Gemini 2.5 Pro	0.27
Local	Llama 3.2 3B	0.32
Local	Qwen3 4B	0.21 (contrarian)
Local	Qwen3.5 4B	0.08 (n.s.)
Local	Qwen3 30B	0.08 (n.s.)

Reproducing Results

Prerequisites

pip install -r requirements.txt

For local models, install Ollama and pull the required models.

Running Experiments

# Local models (Ollama)
python run_experiment.py run

# API models (set environment variables first)
export OPENAI_API_KEY="..."
python run_openai.py

export ANTHROPIC_API_KEY="..."
python run_claude.py

export GEMINI_API_KEY="..."
python run_gemini.py

# Mitigation experiment (sequential batch)
python run_mitigation.py

# Temperature sensitivity
python run_temperature.py

# Mechanistic experiments
python run_logprobs.py      # Logprobs (OpenAI only)
python run_flipped.py       # Flipped framing (OpenAI + Ollama)
python run_positional.py    # Positional placement (OpenAI + Ollama)

Analysis

# Generate comprehensive statistics
python -m analysis.paper_statistics

# Run all supplementary analyses
python -m analysis.contrast_assimilation
python -m analysis.continuous_confidence
python -m analysis.mixed_effects
python -m analysis.response_time
python -m analysis.qualitative_examples
python -m analysis.mitigation_analysis
python -m analysis.temperature_analysis

# Mechanistic analyses
python -m analysis.asymmetry_baseline_corr
python -m analysis.logprobs_analysis
python -m analysis.flipped_analysis
python -m analysis.positional_analysis

# Generate paper figures (1-13)
python generate_paper_figures.py

Building the Paper

cd paper
tectonic main.tex
# or: pdflatex main && bibtex main && pdflatex main && pdflatex main

Data Format

Each line in data/all_results.jsonl is a JSON object:

{
  "domain": "code_review",
  "model": "claude-sonnet-4-6",
  "polarity": "no_saturated",
  "context_length": 10,
  "test_item_id": "test_code_amb_03",
  "test_item_category": "ambiguous",
  "test_item_ground_truth": "yes",
  "repetition": 3,
  "parsed_response": "no",
  "response_time_ms": 2146.46,
  "seed": 3941144273,
  "timestamp": "2026-03-16T12:41:40.789808+00:00"
}

Citation

@article{temkit2026amel,
  title={AMEL: Accumulated Message Effects on LLM Judgments},
  author={Temkit, Sid-Ali},
  year={2026}
}

License

This research is released under the MIT License. The dataset is released under CC-BY 4.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AMEL: Accumulated Message Effects on LLM Judgments

Key Findings

Mechanistic Findings (Section 5)

Repository Structure

Experimental Design

Models Tested

Reproducing Results

Prerequisites

Running Experiments

Analysis

Building the Paper

Data Format

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
analysis		analysis
data		data
paper		paper
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_paper_figures.py		generate_paper_figures.py
requirements.txt		requirements.txt
run_claude.py		run_claude.py
run_experiment.py		run_experiment.py
run_flipped.py		run_flipped.py
run_gemini.py		run_gemini.py
run_logprobs.py		run_logprobs.py
run_mitigation.py		run_mitigation.py
run_openai.py		run_openai.py
run_openai_5_2.py		run_openai_5_2.py
run_positional.py		run_positional.py
run_temperature.py		run_temperature.py

Folders and files

Latest commit

History

Repository files navigation

AMEL: Accumulated Message Effects on LLM Judgments

Key Findings

Mechanistic Findings (Section 5)

Repository Structure

Experimental Design

Models Tested

Reproducing Results

Prerequisites

Running Experiments

Analysis

Building the Paper

Data Format

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages