MixViz

Visualizing Cross-Head Mixing in Non-Standard Attention Architectures

Standard attention visualization tools like BertViz assume each head operates independently, displaying per-head attention heatmaps as if they tell the whole story. But architectures that mix information across heads before or during attention — such as Talking Heads and Interleaved Head Attention (IHA) — break this assumption. Their real computational structure is hidden inside cross-head mixing weights that no existing tool surfaces.

MixViz is an interactive visualization tool built to reveal what standard tools miss: how cross-head mixing reshapes attention patterns, and why per-head interpretability breaks down in architectures like IHA.

Live Demo

https://mixviz.up.railway.app

The System

The main interface has two linked views. On the left, the Sentence View shows a BertViz-style bipartite display of query-to-key attention for a single head, with a timeline scrubber that lets you step through each query position. On the right, the Attention Field plots every attention weight across all 8 heads as a circular dot cloud — each head gets its own color, and gold rings highlight weights that align with the ground-truth reasoning path.

Switch to the Mixing Matrix tab to see the learned alpha weights that IHA uses to combine heads before attention. This is the structure that standard tools completely hide — it shows which source heads contribute to each output head's query construction.

The timeline playback lets you scrub or auto-play through query positions. As the scrubber moves, the sentence view updates attention lines in real time and the attention field highlights the current query row — making it easy to see how attention shifts along the reasoning chain.

Why This Exists

We trained three architectures — MHA, Talking Heads, and IHA — on a synthetic relation-composition task where the ground-truth reasoning path is fully known. All three solve the task perfectly (F1 > 0.999). But when we inspected their attention patterns:

Architecture	Alignment	Redundancy	Entropy	Best Epoch
MHA	0.507	0.190	1.604	46
Talking Heads	0.560	0.289	1.423	27
IHA	0.055	0.139	0.503	44

IHA's per-head alignment to the reasoning path drops by 10x compared to MHA. The model still solves the task, but the computation is distributed across the mixing structure in a way that individual attention maps can't reveal. Standard visualization tools are actively misleading for this architecture.

MixViz was built to make this visible — to show both the attention patterns and the mixing structure that shapes them.

Running Locally

Quick start (demo mode, no GPU needed)

Demo mode uses precomputed attention data — no model checkpoints or training required.

# Clone and install frontend
git clone https://github.com/danielle34/mixviz.git
cd mixviz/mixviz_web/frontend
npm install

# Start the backend (serves precomputed data)
cd ..
pip install flask numpy scipy
python server.py &

# Start the frontend
cd frontend
npm run dev

Open http://localhost:5173 in your browser.

Full setup (live mode with models)

Requires trained checkpoints in outputs/checkpoints/.

# Python environment
conda create -n attn_interp python=3.11
conda activate attn_interp
pip install torch flask scipy numpy

# Install frontend
cd mixviz_web/frontend
npm install

# Precompute attention data (optional, speeds up demo mode)
cd ../..
python mixviz_web/backend/precompute.py

# Run backend + frontend
python mixviz_web/server.py &
cd mixviz_web/frontend && npm run dev

Training models from scratch

python src/train.py --model_type mha --task binary --lr 1e-4 \
    --data_dir data --output_dir outputs/checkpoints/mha_binary_1e4

python src/train.py --model_type iha --task binary --lr 1e-4 \
    --data_dir data --output_dir outputs/checkpoints/iha_binary_1e4

python src/train.py --model_type talking_heads --task binary --lr 1e-4 \
    --data_dir data --output_dir outputs/checkpoints/talking_heads_binary_1e4

What's in This Repo

Folder	Description
`mixviz_web/`	Interactive web visualization (Flask backend + React frontend)
`models/`	Attention architecture implementations (MHA, Talking Heads, IHA)
`src/`	Training, metric extraction, analysis scripts
`paper/`	Paper source (LaTeX)
`outputs/metrics/`	Experiment results (summary JSONs and per-model arrays)
`scripts/`	SLURM job launchers for training sweeps

Citation

If you use this work, please cite:

@article{adeyemi2026mixviz,
  title   = {MixViz: Visualizing Cross-Head Mixing in Non-Standard Attention Architectures},
  author  = {Adeyemi, Morayo Danielle},
  year    = {2026},
}

Acknowledgments

This work builds on and is inspired by:

Interleaved Head Attention (Duvvuri & Ekbote et al., 2026)
BertViz (Vig, 2019)
AttentionViz (Yeh et al., 2023)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
mixviz_web		mixviz_web
models		models
outputs/metrics		outputs/metrics
scripts		scripts
src		src
.gitignore		.gitignore
FILES.md		FILES.md
README.md		README.md
railway.toml		railway.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MixViz

Visualizing Cross-Head Mixing in Non-Standard Attention Architectures

Live Demo

The System

Why This Exists

Running Locally

Quick start (demo mode, no GPU needed)

Full setup (live mode with models)

Training models from scratch

What's in This Repo

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MixViz

Visualizing Cross-Head Mixing in Non-Standard Attention Architectures

Live Demo

The System

Why This Exists

Running Locally

Quick start (demo mode, no GPU needed)

Full setup (live mode with models)

Training models from scratch

What's in This Repo

Citation

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages