v0.20.0

AlpinDale released this 26 Apr 12:27

· 36 commits to main since this release

c0178f1

What's Changed

[engine] add API for concurrency rate and kv cache token limit by @AlpinDale in #1608
[diffusion] aphrodite diffusion backend by @AlpinDale in #1607
[cli][diffusion] only import diffusion backend when it is called by @AlpinDale in #1610
[logger][metrics] log number of cache hits in the request-level logger by @AlpinDale in #1611
[cli] add CLI arg for selecting attention backend by @AlpinDale in #1612
fix: tokenizer server init by @AlpinDale in #1617
[models] add support for GLM-4.7 Flash by @AlpinDale in #1620
fix: mark GLM-4 MoE Lite as an MLA model by @AlpinDale in #1621
fix: compute engine max_concurrency from worker KV cache configs by @lucyknada in #1622
feat: add support for the Qwen3.5 family of models by @AlpinDale in #1624
feat: update aphrodite to 0.20.0 by @AlpinDale in #1628
feat: add tensor parallel support for exllamav3 by @AlpinDale in #1629
chore: remove unused csrc code by @AlpinDale in #1630
chore: bump cuda to 13.0 by @AlpinDale in #1631
chore: sync to upstream vllm f768b4473e1bd55023dcaff63984cfdd08902fc8 by @AlpinDale in #1632
chore: massively improve DRY performance by @AlpinDale in #1634
feat: optimize lm_head by fusing more kernels and actually quantizing lm_head by @AlpinDale in #1635

New Contributors

@lucyknada made their first contribution in #1622

Full Changelog: v0.10.0...v0.20.0

Contributors

AlpinDale and lucyknada

Assets 4