Highlights
- Unified attention support end-to-end, including new cache structures and eSurge compatibility.
- Faster GPU/TPU inference via backend-aware attention selection, KV-cache update optimizations, and smarter compilation/batching defaults.
Added
- Unified attention mechanism across attention layers and generation/scheduler paths.
- New cache types for unified attention, including
UnifiedAttentionCacheand related config/view helpers. HybridCachesupport and expanded unified-attention cache integration.
Performance & Behavior Changes
- GPU inference now prefers
unified_attentionand TPU inference prefersragged_page_attention_v3, with warnings when a suboptimal mechanism is selected. - KV-cache updates are optimized for GPU latency (vectorized scatter approach; improved memory donation behavior).
- eSurge compilation is capped to the scheduler’s actual per-step token budget to reduce startup time for long-context models.
runner.compile()now acceptsmax_num_batched_tokensfor fine-grained compilation control.- GPU/TPU-aware auto-defaults for
max_num_batched_tokens(GPU: >= 2048 tokens/step, TPU: >= 8192 tokens/step) and higher TPU defaults. - Performance tuning updates (numexpr threading configuration, JAX PGLE enablement, and XLA GPU flag fixes).
Evaluation
- eSurge
lm-evaladapter improvements: exact teacher-forced log-likelihood scoring, rolling-window perplexity, per-request stop sequences, improvedgreedy_until, and more robust tokenization/chat-template fallbacks.
Fixes
- Dtype conversion adjustments in the bridge for more consistent behavior.
- Linting fixes in tests and the Xerxes model.
Dependency Updates
- Upgrade
ejkerneltov0.0.50.
Merged PRs
- #249: Unified attention mechanism + caching structures.
- #250: Bridge dtype conversion updates.
- #251: eSurge Speedup v1.
What's Changed
- feat: Add unified attention mechanism and related caching structures by @erfanzar in #249
- modify dtype conversation in bridge. by @erfanzar in #250
- eSurge Speedup v1 (
esurge/speedup-v1) by @erfanzar in #251
Full Changelog: v0.2.0.1...v0.2.0.2