Skip to content

Devstral 2 support (Mistral 3 architecture, Tekken tokenizer, YaRN …#107

Merged
mikepapadim merged 1 commit intobeehive-lab:mainfrom
AdamBien:main
Apr 13, 2026
Merged

Devstral 2 support (Mistral 3 architecture, Tekken tokenizer, YaRN …#107
mikepapadim merged 1 commit intobeehive-lab:mainfrom
AdamBien:main

Conversation

@AdamBien
Copy link
Copy Markdown
Contributor

Devstral 2 support (Mistral 3 architecture, Tekken tokenizer, YaRN RoPE)

Devstral 2 uses the mistral3 architecture with independent head dimensions
(head_dim=128 != dim/num_heads=160), requiring non-square Q/K/V projections,
a dedicated GPU kernel for precomputed YaRN RoPE frequencies, and a GPT-2
BPE tokenizer (Tekken) instead of Mistral's SentencePiece.

Key changes

  • DevstralConfiguration with explicit headDim, qDim() (4096), kvDim() (1024)
  • DevstralTokenizer: GPT-2 BPE with BYTE_ENCODER and Tekken pre-tokenization regex
  • YaRN RoPE: precomputeFreqsCisYaRN with mscale and ramp interpolation
  • GPU kernels: fusedQKVMatmulQ8NonSquare, ropeRotationWithCacheCopyPrecomputed
  • Devstral-specific FFN layers, layer planners, state, chat format
  • GGUF metadata: mistral3.* prefix, attention.key_length for head_dim

Tested with

  • Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf (GPU, Metal)

Model

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@mikepapadim
Copy link
Copy Markdown
Member

hello @AdamBien , thank you very much contributing this.

Can you let me know with which backend and hardware you have tested it?

thanks

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end support for Devstral 2 (Mistral 3 architecture) across model loading, configuration/state shapes, tokenizer/chat formatting, and TornadoVM GPU execution paths, including non-square Q/K/V projections and YaRN RoPE with precomputed frequency tables.

Changes:

  • Introduces Devstral model type + loader + configuration/state to handle independent head_dim and derived qDim/kvDim.
  • Adds Tekken (GPT-2 BPE) tokenizer and Devstral chat format wiring.
  • Extends TornadoVM kernels/layer planners to support non-square fused QKV matmuls and precomputed RoPE rotation + KV cache writes.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/main/java/org/beehive/gpullama3/tornadovm/layers/type/q8_0/DevstralQ8_0FFNLayers.java Devstral Q8_0 FFN/attention task graphs using non-square QKV + precomputed RoPE.
src/main/java/org/beehive/gpullama3/tornadovm/layers/type/fp16/DevstralFP16FFNLayers.java Devstral FP16 FFN/attention task graphs using non-square QKV + precomputed RoPE.
src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/QuantizationPlannerFactory.java Routes DEVSTRAL_2 to Devstral-specific planners for FP16/Q8_0.
src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/model/q8_0/DevstralQ8_0LayerPlanner.java Planner wiring for Devstral Q8_0 execution plan.
src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/model/fp16/DevstralFP16LayerPlanner.java Planner wiring for Devstral FP16 execution plan.
src/main/java/org/beehive/gpullama3/tornadovm/kernels/TransformerComputeKernelsLayered.java Adds precomputed RoPE kernel + non-square fused QKV kernels (FP16/Q8_0).
src/main/java/org/beehive/gpullama3/tokenizer/Vocabulary.java Adds Devstral vocabulary loader helper.
src/main/java/org/beehive/gpullama3/tokenizer/DevstralTokenizer.java Implements Tekken GPT-2-style byte-level BPE tokenizer.
src/main/java/org/beehive/gpullama3/model/ModelType.java Adds DEVSTRAL_2 enum + loader dispatch.
src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java Detects Devstral models by name for loader selection.
src/main/java/org/beehive/gpullama3/model/loader/DevstralModelLoader.java Loads Devstral GGUF metadata (mistral3.*), precomputes YaRN RoPE freqs, builds weights.
src/main/java/org/beehive/gpullama3/model/format/DevstralChatFormat.java Devstral chat template support using DevstralTokenizer special tokens.
src/main/java/org/beehive/gpullama3/model/format/ChatFormat.java Registers DevstralTokenizer → DevstralChatFormat factory mapping.
src/main/java/org/beehive/gpullama3/model/devstral/package-info.java Documents Devstral 2 architectural differences and integration points.
src/main/java/org/beehive/gpullama3/model/devstral/DevstralConfiguration.java Defines Devstral config with explicit headDim, qDim(), kvDim().
src/main/java/org/beehive/gpullama3/model/devstral/Devstral.java New model implementation delegating forward pass to Devstral-specific core.
src/main/java/org/beehive/gpullama3/inference/state/DevstralState.java State allocation for q/k/v shapes when qDim != dim.
src/main/java/org/beehive/gpullama3/inference/operation/RoPE.java Adds YaRN RoPE frequency precompute implementation.
src/main/java/org/beehive/gpullama3/inference/InferenceCore.java Adds CPU forward path specialized for Devstral’s qDim/kvDim layout.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +163 to +166
public List<Integer> encode(String text, Set<String> allowedSpecial) {
if (allowedSpecial.isEmpty()) {
return encodeOrdinary(text);
}
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DevstralTokenizer.encode(String, Set) bypasses the byte→unicode pre-encoding used by encode(String)/encodeAsList, and calls encodeOrdinary on the raw input when allowedSpecial is empty. For a GPT-2/Tekken byte-level BPE this can mis-tokenize or throw when the raw text contains code points not present in the byte-encoded vocab. Consider applying the same byte-encoding step for all ordinary spans in encode(text, allowedSpecial) (including the empty-set fast path), so the behavior matches encode(String).

Copilot uses AI. Check for mistakes.
Comment on lines +168 to +179
assert specialTokens.keySet().containsAll(allowedSpecial);
String specialPattern = allowedSpecial.stream().map(Pattern::quote).collect(Collectors.joining("|", "(", ")"));
String[] specialChunks = text.split(specialPattern);

List<Integer> ids = new ArrayList<>();
for (String part : specialChunks) {
if (allowedSpecial.contains(part)) {
ids.add(specialTokens.get(part));
} else {
ids.addAll(encodeOrdinary(part));
}
}
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special-token handling in encode(String, Set) appears ineffective: text.split(specialPattern) drops the matched delimiters, so allowed special tokens will never be emitted via the allowedSpecial.contains(part) branch. This can cause special tokens to be encoded as ordinary text (or fail) instead of as single token IDs. Consider switching to a matcher-based split that preserves matches (iterate over occurrences, encode the preceding span ordinarily, then append the special token ID).

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +80
public List<Integer> encodeFillInTheMiddle(String prefix, String suffix) {
List<Integer> tokens = new ArrayList<>();
final Set<String> EMPTY_STRING_SET = Collections.emptySet();
tokens.add(this.suffix);
tokens.addAll(tokenizer.encode(suffix, EMPTY_STRING_SET));
tokens.add(this.prefix);
tokens.addAll(tokenizer.encode(prefix, EMPTY_STRING_SET));
return tokens;
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encodeFillInTheMiddle uses tokenizer.encode(..., emptySet). With DevstralTokenizer this currently routes through encode(String, Set) and can skip the byte-level pre-encoding step, producing incorrect BPE IDs for non-ASCII text. Prefer using encodeAsList for ordinary text (or ensure DevstralTokenizer.encode(text, Set) applies byte-encoding for ordinary spans).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@mikepapadim mikepapadim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested it and worked out of the box!

@AdamBien just the CLA in pending and I am happy to merge.

thankk you again for your contribution

Tested with locally:

 ./llamaTornado --gpu --verbose-init --metal --model Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf --prompt "vectoradd java" --gpu-memory 30GB

@mikepapadim mikepapadim merged commit 2e8ed31 into beehive-lab:main Apr 13, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants