v0.20.0
What's Changed
- [engine] add API for concurrency rate and kv cache token limit by @AlpinDale in #1608
- [diffusion]
aphrodite diffusionbackend by @AlpinDale in #1607 - [cli][diffusion] only import diffusion backend when it is called by @AlpinDale in #1610
- [logger][metrics] log number of cache hits in the request-level logger by @AlpinDale in #1611
- [cli] add CLI arg for selecting attention backend by @AlpinDale in #1612
- fix: tokenizer server init by @AlpinDale in #1617
- [models] add support for GLM-4.7 Flash by @AlpinDale in #1620
- fix: mark GLM-4 MoE Lite as an MLA model by @AlpinDale in #1621
- fix: compute engine max_concurrency from worker KV cache configs by @lucyknada in #1622
- feat: add support for the Qwen3.5 family of models by @AlpinDale in #1624
- feat: update aphrodite to 0.20.0 by @AlpinDale in #1628
- feat: add tensor parallel support for exllamav3 by @AlpinDale in #1629
- chore: remove unused csrc code by @AlpinDale in #1630
- chore: bump cuda to 13.0 by @AlpinDale in #1631
- chore: sync to upstream vllm f768b4473e1bd55023dcaff63984cfdd08902fc8 by @AlpinDale in #1632
- chore: massively improve DRY performance by @AlpinDale in #1634
- feat: optimize lm_head by fusing more kernels and actually quantizing lm_head by @AlpinDale in #1635
New Contributors
- @lucyknada made their first contribution in #1622
Full Changelog: v0.10.0...v0.20.0