A comprehensive, production-quality, educational transformer implementation with complete visualization, evaluation, and documentation.
Unlike typical transformer implementations, this provides:
✅ Complete Observability - Track both information highways (residual + K/V streams)
✅ 20+ Visualizations - From attention heatmaps to animated flow diagrams
✅ 74 Tests - Comprehensive coverage of all components
✅ Full Documentation - 20+ pages of concepts, guides, and API reference
✅ Educational Design - Built to understand, not just use
✅ Research-Ready - Easy to modify and extend
✅ Production-Quality - Clean, tested, documented code
The transformer processes information through two distinct pathways:
Information flows UP through layers at each position, carrying accumulated representation through depth.
Information flows RIGHT across positions via attention, enabling dynamic information sharing.
For n layers and m positions: C(n+m, n) distinct computational paths!
- 6 layers × 32 positions = 2.3 million paths
- 12 layers × 512 positions = > atoms in universe
This massive redundancy enables robust, nuanced information processing.
⚡ See QUICK_START.md for 5-minute setup guide
The fastest way to get started - install, verify, and run everything:
git clone <repository-url>
cd transformer
python3 run.pyThis will:
- ✅ Auto-detect and use
uvfor fast installation (falls back to pip) - ✅ Install dependencies
- ✅ Verify installation
- ✅ Run all examples
- ✅ Generate animations
- ✅ Create comprehensive outputs
Recommended: Install uv for 10-100x faster dependency installation
# Install uv (optional but recommended)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Then run normally - uv will be auto-detected
python3 run.pyOptions:
python3 run.py --skip-install # Skip dependency check
python3 run.py --examples-only # Only run examples
python3 run.py --use-venv # Create and use virtual environmentgit clone <repository-url>
cd transformer
pip install -r requirements.txtpython3 scripts/verify_installation.pyfrom src.transformer import Transformer, TransformerConfig
from src.visualization import visualize_attention
# Create model
config = TransformerConfig(
vocab_size=10000,
d_model=512,
n_heads=8,
n_layers=6
)
model = Transformer(config)
# Forward pass with state tracking
input_ids = torch.randint(0, 10000, (2, 10))
logits, state = model(input_ids, return_state=True)
# Visualize attention
attn = state.get_attention_weights(layer_idx=0)
visualize_attention(attn, save_path='attention.png')from src.config import get_preset_config
# Quick configuration
config = get_preset_config('small') # or 'base', 'large', etc.
model = Transformer(config)Basic:
- Attention heatmaps (all heads, all layers)
- Residual stream evolution
- K/V stream patterns
- Layer statistics
Advanced:
- Gradient flow analysis
- Activation patterns
- Causal graph representation
- Attention flow Sankey diagrams
- Token journey tracking
- Head specialization analysis
Animations:
- Information flow through layers
- Attention evolution
- Specific path tracing
- Perplexity and accuracy metrics
- Inference speed benchmarking
- Training speed measurement
- Model profiling
- Configuration comparison
- Metrics tracking system
- Save/load configs (JSON/YAML)
- 7 preset configurations (debug to xlarge)
- Configuration validation
- Easy modification
TrainingVisualizerfor monitoring- Weight distribution tracking
- Learning dynamics analysis
- Gradient visualization
74 tests covering:
- All components individually
- Full model integration
- Visualization functions
- Evaluation metrics
- Configuration management
transformer/
├── run.py # ⭐ One-command setup & execution
├── README.md # This file
├── requirements.txt # Dependencies
│
├── src/
│ ├── transformer/ # Core model & components
│ ├── visualization/ # 20+ visualization functions
│ ├── evaluation/ # Metrics & benchmarking
│ ├── config/ # Configuration management
│ └── utils/ # Helper functions
│
├── tests/ # 74 comprehensive tests
├── examples/ # 4 complete examples
├── scripts/ # Utility scripts
│ ├── verify_installation.py
│ ├── generate_all_animations.py
│ ├── generate_complete_documentation.py
│ └── run_comprehensive_validation.py
│
├── doc/ # Complete documentation
│ ├── concepts/ # Core concepts explained
│ ├── guides/ # User guides
│ ├── api/ # API reference
│ ├── ARCHITECTURE.md # Architecture deep dive
│ ├── OVERVIEW.md # Complete overview
│ └── FAQ.md # 40+ questions answered
│
└── outputs/ # Generated outputs (animations, analysis, etc.)
Run the included examples:
# Basic usage and visualization
python3 examples/basic_usage.py
# Path analysis and complexity
python3 examples/path_analysis.py
# Create animations
python3 examples/animation_demo.py
# Full training workflow
python3 examples/training_example.py📚 Start here: doc/DOCUMENTATION_INDEX.md - Master documentation hub
Quick Links:
- Quick Start Guide - Get started in 5 minutes
- Learning Paths - 6 structured paths (2-6 hours each)
- FAQ - 45 questions answered
- Glossary - 150+ terms defined
- Practical Recipes - 12 copy-paste solutions
Theory & Concepts:
- Transformer Theory - Complete conceptual foundation
- Mathematics - Rigorous mathematical derivations
- Statistics - Empirical analysis and best practices
- Information Flow - Dual highway architecture
- Attention Mechanism - Q, K, V explained
Implementation:
- Implementation Guide - Complete practical guide
- Visualization Guide - All 20+ visualization functions
- Animation Guide - Creating animated visualizations
Total Documentation: ~100,000 words across 28 comprehensive files
# Run all tests
pytest tests/ -v
# Run specific test suite
pytest tests/test_components.py -v
pytest tests/test_visualization.py -v
pytest tests/test_evaluation.py -vAll 74 tests passing ✅
from src.config import get_preset_config, list_presets
# Available presets
presets = list_presets()
# debug, minimal, tiny, small, base, large, xlarge
# Use preset
config = get_preset_config('base')| Preset | d_model | Layers | Params | Use Case |
|---|---|---|---|---|
| debug | 64 | 2 | ~40K | Quick testing |
| tiny | 128 | 2 | ~150K | Small experiments |
| small | 256 | 4 | ~1.2M | Most experiments |
| base | 512 | 6 | ~40M | Serious training |
| large | 1024 | 12 | ~300M | Large-scale |
from src.visualization import (
visualize_attention,
visualize_information_highways,
animate_information_flow,
visualize_head_specialization,
visualize_token_journey
)
# Get state from forward pass
logits, state = model(input_ids, return_state=True)
# Basic attention visualization
visualize_attention(state.get_attention_weights(0))
# Information highways (both streams)
visualize_information_highways(state)
# Animated flow
animate_information_flow(state, save_path='flow.gif')
# Head analysis
visualize_head_specialization(state)
# Track single token
visualize_token_journey(state, token_position=5)from src.visualization import TrainingVisualizer
viz = TrainingVisualizer()
for epoch in range(num_epochs):
# Train
train_loss = train_epoch(model, train_loader, optimizer)
val_loss = evaluate(model, val_loader)
# Log metrics
viz.log_metrics(
train_loss=train_loss,
val_loss=val_loss,
learning_rate=optimizer.param_groups[0]['lr']
)
# Periodic visualization
if epoch % 10 == 0:
viz.plot_training_curves(f'training_epoch_{epoch}.png')
# Save history
viz.save_history('training_history.json')from src.evaluation import evaluate_model, benchmark_inference, profile_model
# Comprehensive evaluation
results = evaluate_model(model, val_loader, device)
print(results) # Perplexity, accuracy, top-5 accuracy
# Benchmark inference
bench_results = benchmark_inference(
model,
input_shape=(2, 128),
vocab_size=config.vocab_size,
device=device
)
# Profile model
profile = profile_model(model, input_shape=(2, 128), vocab_size=config.vocab_size, device=device)
print(f"Parameters: {profile['total_parameters']:,}")
print(f"Forward time: {profile['forward_time_ms']:.2f} ms")This implementation embodies:
🎯 Transparency - No black boxes, everything observable
📚 Education - Built to learn from, not just use
🔬 Research - Easy to modify and experiment
✨ Quality - Production-grade, tested code
🧩 Modularity - Composable, reusable components
📖 Documentation - Comprehensive, clear explanations
- Education: Learn transformer architecture
- Research: Experiment with modifications
- Prototyping: Quick testing of ideas
- Analysis: Deep model interpretation
- Visualization: Create publication-quality figures
This implementation prioritizes clarity over speed:
- ~5-10x slower than optimized implementations
- Full state tracking adds overhead
- Suitable for small-to-medium experiments
- For production: use PyTorch native or HuggingFace
✅ Complete transformer with state tracking
✅ 20+ visualization functions
✅ Training & evaluation tools
✅ Configuration management
✅ 74 comprehensive tests
✅ 20+ documentation pages
✅ 4 complete examples
✅ Path analysis tools
✅ Benchmarking system
All standard, well-maintained libraries:
- PyTorch ≥2.0.0
- NumPy, Matplotlib, Seaborn, Plotly
- PyYAML, NetworkX
- Pytest (for testing)
Contributions welcome! See doc/development/contributing.md
Areas of interest:
- New visualizations
- Performance improvements
- Documentation improvements
- Example notebooks
- Educational content
[Your chosen license]
If you use this in your research, please cite:
@software{first_principles_transformer,
title={First-Principles Transformer: A Comprehensive Educational Implementation},
author={[Your name]},
year={2025},
url={[repository-url]}
}Built on foundational work:
- "Attention Is All You Need" (Vaswani et al., 2017)
- Mechanistic interpretability research
- PyTorch ecosystem
- 📊 PROJECT_STATUS.md - Current status, quick links, and overview
- 📚 Documentation Index - Master documentation hub
- 🔍 Comprehensive Review - Complete quality review
- 📖 Architecture - Information flow architecture
- 🎬 Animation Index - Animation reference guide
Built with clarity, tested thoroughly, visualized completely, documented comprehensively.