Skip to content

chutapp/amel

Repository files navigation

AMEL: Accumulated Message Effects on LLM Judgments

This repository contains the code, data, and analysis for our study on how conversation history systematically distorts sequential binary judgments in LLM evaluation pipelines. We test 11 models from 4 providers across 82K+ API calls.

Key Findings

  • AMEL is real and cross-provider (d = -0.14, p < 10^-35, N = 82,704 API calls)
  • Ambiguous items are most affected (d = -0.26) while clear-cut cases remain robust
  • Assimilation, not contrast: models shift toward the conversation's prevailing polarity regardless of item ground truth (d = 0.47)
  • Negativity asymmetry: negative context induces 2.6x stronger bias than positive context
  • No accumulation: 5 turns of biased history produce the same effect as 50
  • Scaling reduces but doesn't eliminate: Haiku d=0.22 > Sonnet d=0.18 > Opus d=0.17
  • Temperature doesn't help: lower temperature trends toward stronger bias, not weaker
  • Balanced ordering mitigates drift: interleaving expected-yes/no items prevents positional drift in sequential evaluation

Mechanistic Findings (Section 5)

  • Logprobs: the probability distribution shifts continuously, not just binary flips (1,050 calls)
  • Flipped framing: negativity asymmetry has both token-level and semantic sources — balance varies by model (1,260 calls)
  • Positional placement: START ≈ END ≈ SPREAD — position of biased turns is irrelevant, any sparse signal suffices (KW H=0.19, p=0.91; 1,260 calls)
  • Baseline correlation: items with higher baseline P(no) show stronger negativity asymmetry (r=0.14, p<0.001)

Repository Structure

.
├── paper/                      # Manuscript (LaTeX)
│   ├── main.tex                # Full paper source
│   ├── main.pdf                # Compiled manuscript
│   └── references.bib          # Bibliography (35 references)
│
├── src/                        # Experiment framework
│   ├── config.py               # Experimental parameters
│   ├── conversation.py         # Context construction (polarity + positional)
│   ├── parser.py               # Response parsing (yes/no extraction)
│   ├── runner.py               # Async experiment runner (Ollama)
│   └── domains/                # Evaluation domain definitions
│       ├── base.py             # Abstract domain interface
│       ├── code_review.py      # "Is this code production-ready?"
│       ├── code_review_flipped.py # "Should this code be rejected?" (flipped framing)
│       ├── content_mod.py      # "Is this comment appropriate?"
│       └── meals.py            # "Is this a healthy choice?"
│
├── run_experiment.py           # Main CLI (local models via Ollama)
├── run_openai.py               # OpenAI GPT-4.1 Nano runner
├── run_openai_5_2.py           # OpenAI GPT-5.2 runner
├── run_claude.py               # Anthropic Claude runner
├── run_gemini.py               # Google Gemini runner
├── run_mitigation.py           # Sequential batch mitigation experiment
├── run_temperature.py          # Temperature sensitivity experiment
├── run_logprobs.py             # Logprobs mechanistic experiment (Phase 1)
├── run_flipped.py              # Flipped framing experiment (Phase 2)
├── run_positional.py           # Positional placement experiment (Phase 3)
│
├── analysis/                   # Statistical analysis
│   ├── utils.py                # Shared utilities (load_results, compute_bias_scores)
│   ├── analyze.py              # Core analysis functions
│   ├── paper_statistics.py     # Comprehensive stats for paper
│   ├── contrast_assimilation.py # Congruent vs incongruent bias analysis
│   ├── continuous_confidence.py # Baseline entropy vs bias susceptibility
│   ├── mixed_effects.py        # Mixed-effects model (BS ~ polarity * category | model)
│   ├── response_time.py        # Response latency analysis
│   ├── qualitative_examples.py # Top biased items with raw responses
│   ├── mitigation_analysis.py  # Sequential batch experiment analysis
│   ├── temperature_analysis.py # Temperature sensitivity analysis
│   ├── asymmetry_baseline_corr.py # Baseline P(no) vs asymmetry correlation
│   ├── logprobs_analysis.py    # First-token probability analysis
│   ├── flipped_analysis.py     # Original vs flipped framing comparison
│   └── positional_analysis.py  # START/END/SPREAD placement analysis
│
├── generate_paper_figures.py   # Publication figure generation (13 figures)
│
├── data/
│   ├── all_results.jsonl       # Main experiment dataset (78,084 responses)
│   ├── mitigation/             # Sequential batch experiment (3,780 responses)
│   ├── temperature/            # Temperature spot-check (840 responses)
│   ├── logprobs/               # Logprobs experiment (1,050 responses)
│   ├── flipped/                # Flipped framing experiment (1,260 responses)
│   ├── positional/             # Positional placement experiment (1,260 responses)
│   ├── raw/                    # Local models (Llama, Qwen) via Ollama
│   ├── openai/                 # GPT-4.1 Nano results
│   ├── openai-gpt52/           # GPT-5.2 results
│   ├── claude-haiku-4-5/       # Claude Haiku 4.5 results
│   ├── claude-sonnet-4-6/      # Claude Sonnet 4.6 results
│   ├── claude-opus-4-6/        # Claude Opus 4.6 results
│   ├── gemini-flash/           # Gemini 2.5 Flash results
│   └── gemini-pro/             # Gemini 2.5 Pro results
│
└── results/
    ├── paper_figures/          # Figures 1-13 (PDF + PNG)
    ├── paper_statistics.json   # Main experiment statistics
    ├── asymmetry_baseline_corr.json
    ├── logprobs_analysis.json
    ├── flipped_analysis.json
    ├── positional_analysis.json
    ├── contrast_assimilation.json
    ├── continuous_confidence.json
    ├── mixed_effects.json
    ├── response_time.json
    ├── qualitative_examples.json
    ├── mitigation_analysis.json
    └── temperature_analysis.json

Experimental Design

We use a between-subjects design with four conditions per test item:

Condition Context Description
Baseline None Test item presented after system prompt only
No-saturated 90% "no" N turns of predominantly negative evaluations
Yes-saturated 90% "yes" N turns of predominantly positive evaluations
Neutral 50/50 N turns of balanced evaluations

Each condition is repeated 10 times at temperature T=1.0 across context lengths N = {5, 10, 20, 50}.

Models Tested

Provider Model Effect Size (|d|)
OpenAI GPT-4.1 Nano 0.34
OpenAI GPT-5.2 0.17
Anthropic Claude Haiku 4.5 0.22
Anthropic Claude Sonnet 4.6 0.18
Anthropic Claude Opus 4.6 0.17
Google Gemini 2.5 Flash 0.18
Google Gemini 2.5 Pro 0.27
Local Llama 3.2 3B 0.32
Local Qwen3 4B 0.21 (contrarian)
Local Qwen3.5 4B 0.08 (n.s.)
Local Qwen3 30B 0.08 (n.s.)

Reproducing Results

Prerequisites

pip install -r requirements.txt

For local models, install Ollama and pull the required models.

Running Experiments

# Local models (Ollama)
python run_experiment.py run

# API models (set environment variables first)
export OPENAI_API_KEY="..."
python run_openai.py

export ANTHROPIC_API_KEY="..."
python run_claude.py

export GEMINI_API_KEY="..."
python run_gemini.py

# Mitigation experiment (sequential batch)
python run_mitigation.py

# Temperature sensitivity
python run_temperature.py

# Mechanistic experiments
python run_logprobs.py      # Logprobs (OpenAI only)
python run_flipped.py       # Flipped framing (OpenAI + Ollama)
python run_positional.py    # Positional placement (OpenAI + Ollama)

Analysis

# Generate comprehensive statistics
python -m analysis.paper_statistics

# Run all supplementary analyses
python -m analysis.contrast_assimilation
python -m analysis.continuous_confidence
python -m analysis.mixed_effects
python -m analysis.response_time
python -m analysis.qualitative_examples
python -m analysis.mitigation_analysis
python -m analysis.temperature_analysis

# Mechanistic analyses
python -m analysis.asymmetry_baseline_corr
python -m analysis.logprobs_analysis
python -m analysis.flipped_analysis
python -m analysis.positional_analysis

# Generate paper figures (1-13)
python generate_paper_figures.py

Building the Paper

cd paper
tectonic main.tex
# or: pdflatex main && bibtex main && pdflatex main && pdflatex main

Data Format

Each line in data/all_results.jsonl is a JSON object:

{
  "domain": "code_review",
  "model": "claude-sonnet-4-6",
  "polarity": "no_saturated",
  "context_length": 10,
  "test_item_id": "test_code_amb_03",
  "test_item_category": "ambiguous",
  "test_item_ground_truth": "yes",
  "repetition": 3,
  "parsed_response": "no",
  "response_time_ms": 2146.46,
  "seed": 3941144273,
  "timestamp": "2026-03-16T12:41:40.789808+00:00"
}

Citation

@article{temkit2026amel,
  title={AMEL: Accumulated Message Effects on LLM Judgments},
  author={Temkit, Sid-Ali},
  year={2026}
}

License

This research is released under the MIT License. The dataset is released under CC-BY 4.0.

About

AMEL: Accumulated Message Effects on LLM Judgments

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors