langprod — The Protocol Layer Hypothesis

An experimental test of whether transformer middle layers encode a format-agnostic semantic substrate — a "protocol layer" where the same product, described in different linguistic registers, converges in representational geometry.

Model: Qwen2.5-32B-Instruct-GPTQ-Int4 (64 layers, hidden dim 5120) · Stimuli: 800 product descriptions (80 products × 5 registers × 2 variants).

Headline result: hypothesis not supported

Register identity — how a description is written — was the dominant organizing principle of representational geometry at every layer (register RSA peak r = 0.670 vs. product RSA peak r = 0.371). Fine-grained product identity emerged only in late layers (peak at layer 61/64), not the predicted middle. There is no protocol layer for within-language register variation; form dominates content end-to-end.

Start here

If you want to…	Go to
Read the paper	`writeup.pdf` (source: `writeup.md`)
Use the dataset / benchmark	`benchmark/` → `benchmark/README.md`
Get the dataset on Hugging Face	`adromero/regprod-800` — includes the 496 MB activations bundle
Reproduce results without a GPU	`benchmark/reproduce.py` (CPU-only, ~80 s)
Re-run the full pipeline	`run.py` (needs the model + a ~24 GB GPU)

The benchmark: RegProd-800

The shippable artifact is RegProd-800 — 800 labeled product descriptions built to dissociate semantic content from surface register, plus three classification tasks (register-5, category-8, product-80), reference baselines, and an optional GPU-free reproduction bundle (fp16 residual-stream activations, ~496 MB).

cd benchmark
python load.py          # sanity-check the loader (800 stimuli, 3 tasks)
python reproduce.py     # recompute the RSA curves from fp16 activations, vs baselines

reproduce.py regenerates the writeup's headline numbers on CPU to four decimals — see benchmark/README.md for the full dataset card, schema, task protocol, and known caveats (BoW ceiling, GroupKFold zero-shot quirk, anisotropy).

⚠️ Disclaimer. All 800 descriptions are AI-generated for research. The 400 real-product items name real brands/companies (e.g. Colgate Total, Pampers Swaddlers, Procter & Gamble), but the text and specs are fabricated — not genuine marketing copy, reviews, claims, patents, or filings by those brands, and possibly inaccurate. Trademarks belong to their owners and are used nominatively for non-commercial research; no affiliation or endorsement is implied. Rights holders may request removal via a GitHub issue. Full notice in the dataset card.

Repository layout

writeup.md / writeup.pdf   The paper (methods, results, hypothesis evaluation)
plan.md                    Pre-registered plan (hypotheses + falsification criteria)

run.py                     Pipeline entry point (subcommands below)
stimuli.py                 Product catalogs, register specs, prompt templates, quality gates
generate_stimuli.py        Stimulus generation runner (via the `claude` CLI)
extraction.py              Hidden-state extraction from the transformer
analysis.py                RDMs, RSA, model RDMs, permutation testing
probes.py                  Linear probing classifiers (all-layer / control / zone)
viz.py                     Figure generation
md_to_pdf.py               Render writeup.md -> writeup.pdf

benchmark/                 ► Shippable RegProd-800 dataset, tasks, baselines, GPU-free repro
coherence/                 Separate "brand-coherence-validation" sub-experiment (not the paper)
data/                      Generated outputs — gitignored (see "Data" below)
tests/                     Test suite
plans/, artifacts/         Planning & design records

Reproduce from scratch

Pipeline stages (each reads/writes under data/):

python run.py generate     # generate 800 stimuli + BoW baseline + register checks
python run.py extract      # extract hidden states  (needs model + GPU; --pilot for a 5-stimulus dry run)
python run.py analyze      # RDMs, RSA, permutation tests
python run.py probe        # all-layer, control, and zone probes
python run.py report       # hypothesis tests, go/no-go, figures
python run.py all          # everything, end-to-end

Data

data/ is gitignored and not shipped. Its largest file is the raw extraction data/Qwen_Qwen2.5-32B-Instruct-GPTQ-Int4_hidden_states.h5 (3.5 GB: mean-pooled residual/attention/MLP activations for all 65 layer positions). You don't need it to use the benchmark — an fp16 residual-stream subset (~496 MB) that reproduces every published result on CPU is published on the Hugging Face dataset. That bundle is gitignored here, so a GitHub clone won't include it; download it from HF into benchmark/activations/ to run reproduce.py. To regenerate the full file, run python run.py extract with the model available.

Environment

python -m venv .venv && source .venv/bin/activate
pip install -e .            # runtime deps; add [dev] for pytest/ruff

Built and run on Ubuntu 24.04 (WSL2), Python 3.12, RTX 5090 (32 GB). Extraction requires GPTQ support (pip install -e '.[gptq-fallback]'); the benchmark's reproduce.py needs only numpy/scipy (+ scikit-learn for the probe) and no GPU.

License

This repository is dual-licensed by artifact type:

Source code (run.py, *.py, the benchmark loaders) — MIT (see LICENSE).
The RegProd-800 dataset (everything under benchmark/) and the writeup (writeup.md / writeup.pdf) — CC BY-NC 4.0 (attribution, non-commercial; see benchmark/LICENSE).

The CC license covers the authors' own contributions only. Real brand/company names are third-party trademarks of their owners, used nominatively for non-commercial research; the synthetic text is AI-generated (see the disclaimer above). Removal requests: open an issue.

Citation

A. Romero (2026). The Protocol Layer Hypothesis: An Experimental Test of Register-Invariant Semantic Representations in Transformer Middle Layers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

langprod — The Protocol Layer Hypothesis

Headline result: hypothesis not supported

Start here

The benchmark: RegProd-800

Repository layout

Reproduce from scratch

Data

Environment

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
artifacts		artifacts
benchmark		benchmark
coherence		coherence
plans/brand-coherence-validation		plans/brand-coherence-validation
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analysis.py		analysis.py
decision-log.md		decision-log.md
extraction.py		extraction.py
generate_stimuli.py		generate_stimuli.py
md_to_pdf.py		md_to_pdf.py
open-questions.md		open-questions.md
plan-coherence.md		plan-coherence.md
plan.md		plan.md
probes.py		probes.py
pyproject.toml		pyproject.toml
run.py		run.py
smoke_test.py		smoke_test.py
smoke_test_extended.py		smoke_test_extended.py
smoke_test_v2.py		smoke_test_v2.py
status.md		status.md
stimuli.py		stimuli.py
viz.py		viz.py
writeup.md		writeup.md
writeup.pdf		writeup.pdf

Folders and files

Latest commit

History

Repository files navigation

langprod — The Protocol Layer Hypothesis

Headline result: hypothesis not supported

Start here

The benchmark: RegProd-800

Repository layout

Reproduce from scratch

Data

Environment

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages