-
Notifications
You must be signed in to change notification settings - Fork 14.4k
model: Added support for Maincoder-1B model #18534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model seems like a normal llama arch to me. Probably enough to just add one single line to convert_hf_to_gguf.py
Agreed, this is another arch rebrand, though I'd say Qwen given the tokenizer and chat template. |
|
Thanks for the review! I understand the concern about architecture similarity with LLAMA and Qwen. Let me clarify the key difference: # From modelling_maincoder.py (lines 200-206)
# RoPE first
query_states, key_states = apply_rotary_emb(query_states, key_states, ...)
# Then QK
query_states = self.q_norm(query_states)
normkey_states = self.k_norm(key_states)This differs from:
The order matters mathematically, RoPE modifies the query/key vectors, so normalizing before vs after produces different results. Additionally, the model has learned QK norm weights (q_norm.weight, k_norm.weight per layer), which are not present in Qwen2 and use a different application order than Qwen3. If there's an existing architecture that matches this exact pattern (RoPE → learned QK norm), I'm happy to use that instead. I checked llama.cpp, qwen2.cpp, qwen3.cpp, and others, but none match this specific order. |
Removed trailing spaces from maincoder.cpp
|
A quick search show that In any cases, it's always better to leave comments in the code to avoid confusions for future contributors. Just an extra question, what's the reason behind having norm after rope (vs. before rope?) If I understand correctly, most models apply norm before rope to avoid adding any distortions to the embedded positional information |
|
Thanks @ngxson. Yeah, it is similar, The catch is Hunyuan uses ROPE_TYPE_NEOX (interleaved dimension pairs) while Maincoder uses ROPE_TYPE_NORM (sequential dimension pairs), different dimension ordering, so can't directly reuse it. Let me know if I'm missing something that would allow us to reuse it, though! On norm-after-RoPE: |
|
Also, can we please appreciate the new formulation of RoPE that I at least haven't seen in Transformers before? def apply_rotary_emb(
xq: torch.Tensor,
xk: torch.Tensor,
freqs_cis: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor]:
"""Apply rotary embeddings to query and key tensors."""
xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
# Broadcast freqs_cis
freqs_cis = freqs_cis[:, :, None, :]
xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
return xq_out.type_as(xq), xk_out.type_as(xk)It's amazing in how many ways you can express the same thing in Python! |
|
Thanks @CISC and @ngxson for the review and for helping get this over the line! Really appreciate the time, expect more PRs from our team soon! @pwilkin I wish I could take credit for that RoPE implementation! I had the exact same reaction when I saw it. I actually adapted it from the Llama 4 implementation here |
Added support for Maincoder-1B
fixes #18346