Skip to content

feat: add support for Qwen3-Next model and add Flash Linear Kernels#1510

Merged
AlpinDale merged 3 commits into
mainfrom
qwen3_next
Sep 10, 2025
Merged

feat: add support for Qwen3-Next model and add Flash Linear Kernels#1510
AlpinDale merged 3 commits into
mainfrom
qwen3_next

Conversation

@AlpinDale

@AlpinDale AlpinDale commented Sep 10, 2025

Copy link
Copy Markdown
Member

FLA kernels based on https://github.com/fla-org/flash-linear-attention, qwen3 modeling code based on vllm-project/vllm#24526

Speeds are great.

Llama-3.1 70B, 8x 3090, TP=8:
E2E time: 12.89s, TTFT: 2.38s, Prefill: 1907 tokens (800.5 tokens/s), Decode: 306 tokens (29.1 tokens/s)

Qwen3-Next-80B-A3B, 8x 3090, TP=8:
E2E time: 4.33s, TTFT: 0.47s, Prefill: 1905 tokens (4051.1 tokens/s), Decode: 355 tokens (92.0 tokens/s)

@AlpinDale AlpinDale merged commit 5ffc240 into main Sep 10, 2025
0 of 4 checks passed
@AlpinDale AlpinDale deleted the qwen3_next branch September 10, 2025 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant