Devstral 2 support (Mistral 3 architecture, Tekken tokenizer, YaRN …#107
Devstral 2 support (Mistral 3 architecture, Tekken tokenizer, YaRN …#107mikepapadim merged 1 commit intobeehive-lab:mainfrom
Conversation
|
|
|
hello @AdamBien , thank you very much contributing this. Can you let me know with which backend and hardware you have tested it? thanks |
There was a problem hiding this comment.
Pull request overview
Adds end-to-end support for Devstral 2 (Mistral 3 architecture) across model loading, configuration/state shapes, tokenizer/chat formatting, and TornadoVM GPU execution paths, including non-square Q/K/V projections and YaRN RoPE with precomputed frequency tables.
Changes:
- Introduces Devstral model type + loader + configuration/state to handle independent
head_dimand derivedqDim/kvDim. - Adds Tekken (GPT-2 BPE) tokenizer and Devstral chat format wiring.
- Extends TornadoVM kernels/layer planners to support non-square fused QKV matmuls and precomputed RoPE rotation + KV cache writes.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/main/java/org/beehive/gpullama3/tornadovm/layers/type/q8_0/DevstralQ8_0FFNLayers.java | Devstral Q8_0 FFN/attention task graphs using non-square QKV + precomputed RoPE. |
| src/main/java/org/beehive/gpullama3/tornadovm/layers/type/fp16/DevstralFP16FFNLayers.java | Devstral FP16 FFN/attention task graphs using non-square QKV + precomputed RoPE. |
| src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/QuantizationPlannerFactory.java | Routes DEVSTRAL_2 to Devstral-specific planners for FP16/Q8_0. |
| src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/model/q8_0/DevstralQ8_0LayerPlanner.java | Planner wiring for Devstral Q8_0 execution plan. |
| src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/model/fp16/DevstralFP16LayerPlanner.java | Planner wiring for Devstral FP16 execution plan. |
| src/main/java/org/beehive/gpullama3/tornadovm/kernels/TransformerComputeKernelsLayered.java | Adds precomputed RoPE kernel + non-square fused QKV kernels (FP16/Q8_0). |
| src/main/java/org/beehive/gpullama3/tokenizer/Vocabulary.java | Adds Devstral vocabulary loader helper. |
| src/main/java/org/beehive/gpullama3/tokenizer/DevstralTokenizer.java | Implements Tekken GPT-2-style byte-level BPE tokenizer. |
| src/main/java/org/beehive/gpullama3/model/ModelType.java | Adds DEVSTRAL_2 enum + loader dispatch. |
| src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java | Detects Devstral models by name for loader selection. |
| src/main/java/org/beehive/gpullama3/model/loader/DevstralModelLoader.java | Loads Devstral GGUF metadata (mistral3.*), precomputes YaRN RoPE freqs, builds weights. |
| src/main/java/org/beehive/gpullama3/model/format/DevstralChatFormat.java | Devstral chat template support using DevstralTokenizer special tokens. |
| src/main/java/org/beehive/gpullama3/model/format/ChatFormat.java | Registers DevstralTokenizer → DevstralChatFormat factory mapping. |
| src/main/java/org/beehive/gpullama3/model/devstral/package-info.java | Documents Devstral 2 architectural differences and integration points. |
| src/main/java/org/beehive/gpullama3/model/devstral/DevstralConfiguration.java | Defines Devstral config with explicit headDim, qDim(), kvDim(). |
| src/main/java/org/beehive/gpullama3/model/devstral/Devstral.java | New model implementation delegating forward pass to Devstral-specific core. |
| src/main/java/org/beehive/gpullama3/inference/state/DevstralState.java | State allocation for q/k/v shapes when qDim != dim. |
| src/main/java/org/beehive/gpullama3/inference/operation/RoPE.java | Adds YaRN RoPE frequency precompute implementation. |
| src/main/java/org/beehive/gpullama3/inference/InferenceCore.java | Adds CPU forward path specialized for Devstral’s qDim/kvDim layout. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public List<Integer> encode(String text, Set<String> allowedSpecial) { | ||
| if (allowedSpecial.isEmpty()) { | ||
| return encodeOrdinary(text); | ||
| } |
There was a problem hiding this comment.
DevstralTokenizer.encode(String, Set) bypasses the byte→unicode pre-encoding used by encode(String)/encodeAsList, and calls encodeOrdinary on the raw input when allowedSpecial is empty. For a GPT-2/Tekken byte-level BPE this can mis-tokenize or throw when the raw text contains code points not present in the byte-encoded vocab. Consider applying the same byte-encoding step for all ordinary spans in encode(text, allowedSpecial) (including the empty-set fast path), so the behavior matches encode(String).
| assert specialTokens.keySet().containsAll(allowedSpecial); | ||
| String specialPattern = allowedSpecial.stream().map(Pattern::quote).collect(Collectors.joining("|", "(", ")")); | ||
| String[] specialChunks = text.split(specialPattern); | ||
|
|
||
| List<Integer> ids = new ArrayList<>(); | ||
| for (String part : specialChunks) { | ||
| if (allowedSpecial.contains(part)) { | ||
| ids.add(specialTokens.get(part)); | ||
| } else { | ||
| ids.addAll(encodeOrdinary(part)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Special-token handling in encode(String, Set) appears ineffective: text.split(specialPattern) drops the matched delimiters, so allowed special tokens will never be emitted via the allowedSpecial.contains(part) branch. This can cause special tokens to be encoded as ordinary text (or fail) instead of as single token IDs. Consider switching to a matcher-based split that preserves matches (iterate over occurrences, encode the preceding span ordinarily, then append the special token ID).
| public List<Integer> encodeFillInTheMiddle(String prefix, String suffix) { | ||
| List<Integer> tokens = new ArrayList<>(); | ||
| final Set<String> EMPTY_STRING_SET = Collections.emptySet(); | ||
| tokens.add(this.suffix); | ||
| tokens.addAll(tokenizer.encode(suffix, EMPTY_STRING_SET)); | ||
| tokens.add(this.prefix); | ||
| tokens.addAll(tokenizer.encode(prefix, EMPTY_STRING_SET)); | ||
| return tokens; |
There was a problem hiding this comment.
encodeFillInTheMiddle uses tokenizer.encode(..., emptySet). With DevstralTokenizer this currently routes through encode(String, Set) and can skip the byte-level pre-encoding step, producing incorrect BPE IDs for non-ASCII text. Prefer using encodeAsList for ordinary text (or ensure DevstralTokenizer.encode(text, Set) applies byte-encoding for ordinary spans).
mikepapadim
left a comment
There was a problem hiding this comment.
Just tested it and worked out of the box!
@AdamBien just the CLA in pending and I am happy to merge.
thankk you again for your contribution
Tested with locally:
./llamaTornado --gpu --verbose-init --metal --model Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf --prompt "vectoradd java" --gpu-memory 30GB
Devstral 2 support (Mistral 3 architecture, Tekken tokenizer, YaRN RoPE)
Devstral 2 uses the mistral3 architecture with independent head dimensions
(head_dim=128 != dim/num_heads=160), requiring non-square Q/K/V projections,
a dedicated GPU kernel for precomputed YaRN RoPE frequencies, and a GPT-2
BPE tokenizer (Tekken) instead of Mistral's SentencePiece.
Key changes
Tested with
Model