Skip to content

feat: implement DeepSeek-V4 model#1651

Merged
AlpinDale merged 3 commits into
mainfrom
feat/deepseek_v4_impl
Apr 27, 2026
Merged

feat: implement DeepSeek-V4 model#1651
AlpinDale merged 3 commits into
mainfrom
feat/deepseek_v4_impl

Conversation

@AlpinDale

Copy link
Copy Markdown
Member

Signed-off-by: AlpinDale <alpindale@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22277e7f47

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread aphrodite/model_executor/layers/rotary_embedding/deepseek_scaling_rope.py Outdated
Comment thread aphrodite/v1/attention/ops/deepseek_v4_ops/fused_inv_rope_fp8_quant.py Outdated
dim3 block(std::min<int64_t>(num_heads * rot_dim / 2, 512));
const at::cuda::OptionalCUDAGuard device_guard(device_of(query));
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
auto cache_f32 = cos_sin_cache.to(torch::kFloat32);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid per-call fp32 conversion of RoPE cache

This change unconditionally does cos_sin_cache.to(torch::kFloat32) inside rotary_embedding, which runs every kernel invocation. Most callers (e.g. RotaryEmbedding.forward_cuda) pass fp16/bf16 caches, so this introduces a full cache dtype conversion/allocation on each decode step, causing major throughput and memory regressions across non-DeepSeek models; the conversion should be done once (or only when needed) instead of in the hot path.

Useful? React with 👍 / 👎.

Signed-off-by: AlpinDale <alpindale@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8274e7401b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if any(
isinstance(g.kv_cache_spec, MambaSpec) and g.kv_cache_spec.block_size != cache_config.block_size for g in groups
):
return scheduler_block_size, scheduler_block_size

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep hash_block_size as a divisor for hybrid Mamba

When a hybrid configuration includes a non-align MambaSpec, this branch returns hash_block_size = scheduler_block_size (the LCM of all group block sizes). For mixed block sizes, that value is larger than at least one group size, so it violates the downstream requirement that each block_size be divisible by hash_block_size (see the assertion in HybridKVCacheCoordinator). In practice this causes coordinator initialization to fail for hybrid Mamba+attention setups with prefix caching/connector enabled, instead of gracefully disabling finer-grained hashing.

Useful? React with 👍 / 👎.

offsets: torch.Tensor | None = None,
inverse: bool = False,
) -> tuple[torch.Tensor, torch.Tensor | None]:
return self.forward_native(positions, query, key, offsets)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate inverse flag in HIP DeepseekV4 rotary path

forward_hip accepts an inverse argument but drops it when delegating to forward_native, so ROCm execution always applies the non-inverse rotation. Any call site that requests inverse RoPE on HIP (for example, de-rotation flows) will silently compute the wrong transform and produce incorrect outputs.

Useful? React with 👍 / 👎.

Signed-off-by: AlpinDale <alpindale@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 283de047a2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +206 to 210
topk_indices = hash_indices_table[input_tokens]
else:
use_sorted = envs.APHRODITE_BATCH_INVARIANT
topk_indices = torch.topk(scores_for_choice, k=topk, dim=-1, sorted=use_sorted)[1]
topk_weights = scores.gather(1, topk_indices)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Cast hashed expert indices to int64 before gather

When hash_indices_table is used, topk_indices comes directly from the table and is then passed to scores.gather(...) without dtype normalization. In DeepSeek V4 hash routing, that table is explicitly created as torch.int32 for the non-mega path (deepseek_v4.py sets hash_indices_dtype=torch.int32 and builds tid2eid with that dtype), so the fallback path here can hit a runtime error because gather expects long/int64 indices. This affects hash-MoE routing on the non-custom-op path (e.g., ROCm/aiter-enabled fallback) and will fail before routing can complete.

Useful? React with 👍 / 👎.

@AlpinDale AlpinDale merged commit 61aad7c into main Apr 27, 2026
1 check failed
@AlpinDale AlpinDale deleted the feat/deepseek_v4_impl branch April 27, 2026 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant