Standard attention visualization tools like BertViz assume each head operates independently, displaying per-head attention heatmaps as if they tell the whole story. But architectures that mix information across heads before or during attention — such as Talking Heads and Interleaved Head Attention (IHA) — break this assumption. Their real computational structure is hidden inside cross-head mixing weights that no existing tool surfaces.
MixViz is an interactive visualization tool built to reveal what standard tools miss: how cross-head mixing reshapes attention patterns, and why per-head interpretability breaks down in architectures like IHA.
The main interface has two linked views. On the left, the Sentence View shows a BertViz-style bipartite display of query-to-key attention for a single head, with a timeline scrubber that lets you step through each query position. On the right, the Attention Field plots every attention weight across all 8 heads as a circular dot cloud — each head gets its own color, and gold rings highlight weights that align with the ground-truth reasoning path.
Switch to the Mixing Matrix tab to see the learned alpha weights that IHA uses to combine heads before attention. This is the structure that standard tools completely hide — it shows which source heads contribute to each output head's query construction.
The timeline playback lets you scrub or auto-play through query positions. As the scrubber moves, the sentence view updates attention lines in real time and the attention field highlights the current query row — making it easy to see how attention shifts along the reasoning chain.
We trained three architectures — MHA, Talking Heads, and IHA — on a synthetic relation-composition task where the ground-truth reasoning path is fully known. All three solve the task perfectly (F1 > 0.999). But when we inspected their attention patterns:
| Architecture | Alignment | Redundancy | Entropy | Best Epoch |
|---|---|---|---|---|
| MHA | 0.507 | 0.190 | 1.604 | 46 |
| Talking Heads | 0.560 | 0.289 | 1.423 | 27 |
| IHA | 0.055 | 0.139 | 0.503 | 44 |
IHA's per-head alignment to the reasoning path drops by 10x compared to MHA. The model still solves the task, but the computation is distributed across the mixing structure in a way that individual attention maps can't reveal. Standard visualization tools are actively misleading for this architecture.
MixViz was built to make this visible — to show both the attention patterns and the mixing structure that shapes them.
Demo mode uses precomputed attention data — no model checkpoints or training required.
# Clone and install frontend
git clone https://github.com/danielle34/mixviz.git
cd mixviz/mixviz_web/frontend
npm install
# Start the backend (serves precomputed data)
cd ..
pip install flask numpy scipy
python server.py &
# Start the frontend
cd frontend
npm run devOpen http://localhost:5173 in your browser.
Requires trained checkpoints in outputs/checkpoints/.
# Python environment
conda create -n attn_interp python=3.11
conda activate attn_interp
pip install torch flask scipy numpy
# Install frontend
cd mixviz_web/frontend
npm install
# Precompute attention data (optional, speeds up demo mode)
cd ../..
python mixviz_web/backend/precompute.py
# Run backend + frontend
python mixviz_web/server.py &
cd mixviz_web/frontend && npm run devpython src/train.py --model_type mha --task binary --lr 1e-4 \
--data_dir data --output_dir outputs/checkpoints/mha_binary_1e4
python src/train.py --model_type iha --task binary --lr 1e-4 \
--data_dir data --output_dir outputs/checkpoints/iha_binary_1e4
python src/train.py --model_type talking_heads --task binary --lr 1e-4 \
--data_dir data --output_dir outputs/checkpoints/talking_heads_binary_1e4| Folder | Description |
|---|---|
mixviz_web/ |
Interactive web visualization (Flask backend + React frontend) |
models/ |
Attention architecture implementations (MHA, Talking Heads, IHA) |
src/ |
Training, metric extraction, analysis scripts |
paper/ |
Paper source (LaTeX) |
outputs/metrics/ |
Experiment results (summary JSONs and per-model arrays) |
scripts/ |
SLURM job launchers for training sweeps |
If you use this work, please cite:
@article{adeyemi2026mixviz,
title = {MixViz: Visualizing Cross-Head Mixing in Non-Standard Attention Architectures},
author = {Adeyemi, Morayo Danielle},
year = {2026},
}This work builds on and is inspired by:
- Interleaved Head Attention (Duvvuri & Ekbote et al., 2026)
- BertViz (Vig, 2019)
- AttentionViz (Yeh et al., 2023)



