Reverse CAPTCHA: Evaluating LLM Susceptibility to Invisible Unicode Instruction Injection

An evaluation framework that tests whether large language models follow invisible Unicode-encoded instructions embedded in otherwise normal-looking text. Where traditional CAPTCHAs exploit tasks humans can solve but machines cannot, Reverse CAPTCHA exploits a perception channel that machines can access but humans cannot.

Paper: paper/reverse_captcha.pdf

Key Findings

Across 8,308 graded outputs from 5 models (GPT-5.2, GPT-4o-mini, Claude Opus 4, Claude Sonnet 4, Claude Haiku 4.5):

Tool use is the primary amplifier

Code execution access transforms invisible Unicode from an ignorable artifact to a decodable instruction channel. Without tools, compliance stays below 17%. With tools and hints, it reaches 98-100%.

Encoding vulnerability is provider-specific

OpenAI models decode zero-width binary but not Unicode Tags. Anthropic models prefer Unicode Tags. An attacker must tailor encoding to the target provider.

Full compliance heatmap

The hint gradient is consistent across models: unhinted << codepoint hints < full hints. The combination of tool access and decoding instructions is the critical enabler.

How It Works

Two invisible encoding schemes are embedded in trivia questions:

Zero-Width Binary (ZW): Each ASCII character encoded as 8 binary digits using U+200B (0) and U+200C (1)
Unicode Tags: Each ASCII character mapped to U+E0000 + codepoint (one invisible char per ASCII char)

The model receives a trivia question with a known visible answer. Hidden in the invisible characters is a different answer. If the model outputs the hidden answer, it followed the invisible instruction.

Test Matrix

30 trivia Q/A pairs
2 encoding schemes x 4 hint levels + 1 control = 9 conditions
2 tool conditions (ON/OFF)
5 models
270 cases per eval run

Statistical Analysis

Wilson score 95% confidence intervals
Fisher's exact test with Bonferroni correction
Chi-squared tests for per-model scheme variation
Cohen's h effect sizes

Quick Start

# Install
pip install -e ".[dev]"

# Run reverse CAPTCHA eval
evalrun run --pack reverse_captcha --model openai:gpt-4o-mini --tools --n 3 --out results.sqlite

# Run without tools
evalrun run --pack reverse_captcha --model openai:gpt-4o-mini --n 3 --out results.sqlite

# Run with Anthropic models
evalrun run --pack reverse_captcha --model anthropic:claude-sonnet-4-20250514 --tools --n 3 --out results.sqlite

# Generate analysis
python3 scripts/analyze_journal.py

# Generate figures
python3 scripts/generate_figures.py

Project Structure

├── packs/reverse_captcha/       # Eval pack (cases, grader, generator)
│   ├── cases.yaml               # 270 test cases
│   ├── qa_pairs.yaml            # 30 trivia Q/A pairs
│   ├── grader.py                # Deterministic grading (5 categories)
│   └── generate_cases.py        # Case generator for both encodings
├── src/evalrun/                 # Core framework
│   ├── adapters/                # OpenAI + Anthropic API adapters
│   ├── cli.py                   # CLI (run, report, export)
│   ├── runner.py                # Eval runner with tool-use support
│   └── db.py                    # SQLite persistence
├── paper/                       # Workshop paper
│   ├── reverse_captcha.tex      # LaTeX source
│   └── reverse_captcha.pdf      # Compiled PDF
├── results/journal/             # Raw data
│   ├── analysis/                # Statistical analysis CSVs
│   └── figures/                 # Publication figures (PDF + PNG)
├── scripts/                     # Analysis and figure generation
└── tests/                       # Test suite

Other Eval Packs

The framework also includes two additional packs:

Watermark Robustness — Tests whether models preserve a provenance marker across rewriting tasks
Hidden Message Extraction — Tests whether models can extract hidden messages given explicit rules

Environment

Python 3.10+
OPENAI_API_KEY for OpenAI models
ANTHROPIC_API_KEY for Anthropic models
Dependencies: click, openai, anthropic, pyyaml, matplotlib, pandas, scipy, numpy

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
packs		packs
paper		paper
report		report
results		results
scripts		scripts
src/evalrun		src/evalrun
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.test		Dockerfile.test
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reverse CAPTCHA: Evaluating LLM Susceptibility to Invisible Unicode Instruction Injection

Key Findings

Tool use is the primary amplifier

Encoding vulnerability is provider-specific

Full compliance heatmap

How It Works

Test Matrix

Statistical Analysis

Quick Start

Project Structure

Other Eval Packs

Environment

License

About

Uh oh!

Releases

Packages

Contributors 2

Languages

canonicalmg/reverse-captcha-eval

Folders and files

Latest commit

History

Repository files navigation

Reverse CAPTCHA: Evaluating LLM Susceptibility to Invisible Unicode Instruction Injection

Key Findings

Tool use is the primary amplifier

Encoding vulnerability is provider-specific

Full compliance heatmap

How It Works

Test Matrix

Statistical Analysis

Quick Start

Project Structure

Other Eval Packs

Environment

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages