feat: implement DeepSeek-V4 model#1651
Conversation
Signed-off-by: AlpinDale <alpindale@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 22277e7f47
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| dim3 block(std::min<int64_t>(num_heads * rot_dim / 2, 512)); | ||
| const at::cuda::OptionalCUDAGuard device_guard(device_of(query)); | ||
| const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); | ||
| auto cache_f32 = cos_sin_cache.to(torch::kFloat32); |
There was a problem hiding this comment.
Avoid per-call fp32 conversion of RoPE cache
This change unconditionally does cos_sin_cache.to(torch::kFloat32) inside rotary_embedding, which runs every kernel invocation. Most callers (e.g. RotaryEmbedding.forward_cuda) pass fp16/bf16 caches, so this introduces a full cache dtype conversion/allocation on each decode step, causing major throughput and memory regressions across non-DeepSeek models; the conversion should be done once (or only when needed) instead of in the hot path.
Useful? React with 👍 / 👎.
Signed-off-by: AlpinDale <alpindale@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8274e7401b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if any( | ||
| isinstance(g.kv_cache_spec, MambaSpec) and g.kv_cache_spec.block_size != cache_config.block_size for g in groups | ||
| ): | ||
| return scheduler_block_size, scheduler_block_size |
There was a problem hiding this comment.
Keep hash_block_size as a divisor for hybrid Mamba
When a hybrid configuration includes a non-align MambaSpec, this branch returns hash_block_size = scheduler_block_size (the LCM of all group block sizes). For mixed block sizes, that value is larger than at least one group size, so it violates the downstream requirement that each block_size be divisible by hash_block_size (see the assertion in HybridKVCacheCoordinator). In practice this causes coordinator initialization to fail for hybrid Mamba+attention setups with prefix caching/connector enabled, instead of gracefully disabling finer-grained hashing.
Useful? React with 👍 / 👎.
| offsets: torch.Tensor | None = None, | ||
| inverse: bool = False, | ||
| ) -> tuple[torch.Tensor, torch.Tensor | None]: | ||
| return self.forward_native(positions, query, key, offsets) |
There was a problem hiding this comment.
Propagate inverse flag in HIP DeepseekV4 rotary path
forward_hip accepts an inverse argument but drops it when delegating to forward_native, so ROCm execution always applies the non-inverse rotation. Any call site that requests inverse RoPE on HIP (for example, de-rotation flows) will silently compute the wrong transform and produce incorrect outputs.
Useful? React with 👍 / 👎.
Signed-off-by: AlpinDale <alpindale@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 283de047a2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| topk_indices = hash_indices_table[input_tokens] | ||
| else: | ||
| use_sorted = envs.APHRODITE_BATCH_INVARIANT | ||
| topk_indices = torch.topk(scores_for_choice, k=topk, dim=-1, sorted=use_sorted)[1] | ||
| topk_weights = scores.gather(1, topk_indices) |
There was a problem hiding this comment.
Cast hashed expert indices to int64 before gather
When hash_indices_table is used, topk_indices comes directly from the table and is then passed to scores.gather(...) without dtype normalization. In DeepSeek V4 hash routing, that table is explicitly created as torch.int32 for the non-mega path (deepseek_v4.py sets hash_indices_dtype=torch.int32 and builds tid2eid with that dtype), so the fallback path here can hit a runtime error because gather expects long/int64 indices. This affects hash-MoE routing on the non-custom-op path (e.g., ROCm/aiter-enabled fallback) and will fail before routing can complete.
Useful? React with 👍 / 👎.
Based on vllm-project/vllm#40860