Devstral 2 support (Mistral 3 architecture, Tekken tokenizer, YaRN … by AdamBien · Pull Request #107 · beehive-lab/GPULlama3.java

AdamBien · 2026-04-12T10:36:16Z

Devstral 2 support (Mistral 3 architecture, Tekken tokenizer, YaRN RoPE)

Devstral 2 uses the mistral3 architecture with independent head dimensions
(head_dim=128 != dim/num_heads=160), requiring non-square Q/K/V projections,
a dedicated GPU kernel for precomputed YaRN RoPE frequencies, and a GPT-2
BPE tokenizer (Tekken) instead of Mistral's SentencePiece.

Key changes

DevstralConfiguration with explicit headDim, qDim() (4096), kvDim() (1024)
DevstralTokenizer: GPT-2 BPE with BYTE_ENCODER and Tekken pre-tokenization regex
YaRN RoPE: precomputeFreqsCisYaRN with mscale and ramp interpolation
GPU kernels: fusedQKVMatmulQ8NonSquare, ropeRotationWithCacheCopyPrecomputed
Devstral-specific FFN layers, layer planners, state, chat format
GGUF metadata: mistral3.* prefix, attention.key_length for head_dim

Tested with

Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf (GPU, Metal)

Model

https://huggingface.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF

…oPE)

CLAassistant · 2026-04-12T10:36:26Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

mikepapadim · 2026-04-12T10:49:16Z

hello @AdamBien , thank you very much contributing this.

Can you let me know with which backend and hardware you have tested it?

thanks

Copilot

Pull request overview

Adds end-to-end support for Devstral 2 (Mistral 3 architecture) across model loading, configuration/state shapes, tokenizer/chat formatting, and TornadoVM GPU execution paths, including non-square Q/K/V projections and YaRN RoPE with precomputed frequency tables.

Changes:

Introduces Devstral model type + loader + configuration/state to handle independent head_dim and derived qDim/kvDim.
Adds Tekken (GPT-2 BPE) tokenizer and Devstral chat format wiring.
Extends TornadoVM kernels/layer planners to support non-square fused QKV matmuls and precomputed RoPE rotation + KV cache writes.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/main/java/org/beehive/gpullama3/tornadovm/layers/type/q8_0/DevstralQ8_0FFNLayers.java	Devstral Q8_0 FFN/attention task graphs using non-square QKV + precomputed RoPE.
src/main/java/org/beehive/gpullama3/tornadovm/layers/type/fp16/DevstralFP16FFNLayers.java	Devstral FP16 FFN/attention task graphs using non-square QKV + precomputed RoPE.
src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/QuantizationPlannerFactory.java	Routes DEVSTRAL_2 to Devstral-specific planners for FP16/Q8_0.
src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/model/q8_0/DevstralQ8_0LayerPlanner.java	Planner wiring for Devstral Q8_0 execution plan.
src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/model/fp16/DevstralFP16LayerPlanner.java	Planner wiring for Devstral FP16 execution plan.
src/main/java/org/beehive/gpullama3/tornadovm/kernels/TransformerComputeKernelsLayered.java	Adds precomputed RoPE kernel + non-square fused QKV kernels (FP16/Q8_0).
src/main/java/org/beehive/gpullama3/tokenizer/Vocabulary.java	Adds Devstral vocabulary loader helper.
src/main/java/org/beehive/gpullama3/tokenizer/DevstralTokenizer.java	Implements Tekken GPT-2-style byte-level BPE tokenizer.
src/main/java/org/beehive/gpullama3/model/ModelType.java	Adds DEVSTRAL_2 enum + loader dispatch.
src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java	Detects Devstral models by name for loader selection.
src/main/java/org/beehive/gpullama3/model/loader/DevstralModelLoader.java	Loads Devstral GGUF metadata (mistral3.*), precomputes YaRN RoPE freqs, builds weights.
src/main/java/org/beehive/gpullama3/model/format/DevstralChatFormat.java	Devstral chat template support using DevstralTokenizer special tokens.
src/main/java/org/beehive/gpullama3/model/format/ChatFormat.java	Registers DevstralTokenizer → DevstralChatFormat factory mapping.
src/main/java/org/beehive/gpullama3/model/devstral/package-info.java	Documents Devstral 2 architectural differences and integration points.
src/main/java/org/beehive/gpullama3/model/devstral/DevstralConfiguration.java	Defines Devstral config with explicit `headDim`, `qDim()`, `kvDim()`.
src/main/java/org/beehive/gpullama3/model/devstral/Devstral.java	New model implementation delegating forward pass to Devstral-specific core.
src/main/java/org/beehive/gpullama3/inference/state/DevstralState.java	State allocation for q/k/v shapes when `qDim != dim`.
src/main/java/org/beehive/gpullama3/inference/operation/RoPE.java	Adds YaRN RoPE frequency precompute implementation.
src/main/java/org/beehive/gpullama3/inference/InferenceCore.java	Adds CPU forward path specialized for Devstral’s `qDim/kvDim` layout.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-12T10:54:30Z

src/main/java/org/beehive/gpullama3/tokenizer/DevstralTokenizer.java

+    public List<Integer> encode(String text, Set<String> allowedSpecial) {
+        if (allowedSpecial.isEmpty()) {
+            return encodeOrdinary(text);
+        }


DevstralTokenizer.encode(String, Set) bypasses the byte→unicode pre-encoding used by encode(String)/encodeAsList, and calls encodeOrdinary on the raw input when allowedSpecial is empty. For a GPT-2/Tekken byte-level BPE this can mis-tokenize or throw when the raw text contains code points not present in the byte-encoded vocab. Consider applying the same byte-encoding step for all ordinary spans in encode(text, allowedSpecial) (including the empty-set fast path), so the behavior matches encode(String).

Copilot · 2026-04-12T10:54:30Z

src/main/java/org/beehive/gpullama3/tokenizer/DevstralTokenizer.java

+        assert specialTokens.keySet().containsAll(allowedSpecial);
+        String specialPattern = allowedSpecial.stream().map(Pattern::quote).collect(Collectors.joining("|", "(", ")"));
+        String[] specialChunks = text.split(specialPattern);
+
+        List<Integer> ids = new ArrayList<>();
+        for (String part : specialChunks) {
+            if (allowedSpecial.contains(part)) {
+                ids.add(specialTokens.get(part));
+            } else {
+                ids.addAll(encodeOrdinary(part));
+            }
+        }


Special-token handling in encode(String, Set) appears ineffective: text.split(specialPattern) drops the matched delimiters, so allowed special tokens will never be emitted via the allowedSpecial.contains(part) branch. This can cause special tokens to be encoded as ordinary text (or fail) instead of as single token IDs. Consider switching to a matcher-based split that preserves matches (iterate over occurrences, encode the preceding span ordinarily, then append the special token ID).

Copilot · 2026-04-12T10:54:31Z

src/main/java/org/beehive/gpullama3/model/format/DevstralChatFormat.java

+    public List<Integer> encodeFillInTheMiddle(String prefix, String suffix) {
+        List<Integer> tokens = new ArrayList<>();
+        final Set<String> EMPTY_STRING_SET = Collections.emptySet();
+        tokens.add(this.suffix);
+        tokens.addAll(tokenizer.encode(suffix, EMPTY_STRING_SET));
+        tokens.add(this.prefix);
+        tokens.addAll(tokenizer.encode(prefix, EMPTY_STRING_SET));
+        return tokens;


encodeFillInTheMiddle uses tokenizer.encode(..., emptySet). With DevstralTokenizer this currently routes through encode(String, Set) and can skip the byte-level pre-encoding step, producing incorrect BPE IDs for non-ASCII text. Prefer using encodeAsList for ordinary text (or ensure DevstralTokenizer.encode(text, Set) applies byte-encoding for ordinary spans).

mikepapadim

Just tested it and worked out of the box!

@AdamBien just the CLA in pending and I am happy to merge.

thankk you again for your contribution

Tested with locally:

 ./llamaTornado --gpu --verbose-init --metal --model Devstral-Small-2-24B-Instruct-2512-Q8_0.gguf --prompt "vectoradd java" --gpu-memory 30GB

Devstral 2 support (Mistral 3 architecture, Tekken tokenizer, YaRN R…

36043f5

…oPE)

mikepapadim requested review from Copilot, mikepapadim and orionpapadakis April 12, 2026 10:47

Copilot started reviewing on behalf of mikepapadim April 12, 2026 10:48 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

mikepapadim approved these changes Apr 12, 2026

View reviewed changes

mikepapadim merged commit 2e8ed31 into beehive-lab:main Apr 13, 2026
9 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Devstral 2 support (Mistral 3 architecture, Tekken tokenizer, YaRN …#107

Devstral 2 support (Mistral 3 architecture, Tekken tokenizer, YaRN …#107
mikepapadim merged 1 commit intobeehive-lab:mainfrom
AdamBien:main

AdamBien commented Apr 12, 2026

Uh oh!

CLAassistant commented Apr 12, 2026

Uh oh!

mikepapadim commented Apr 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

mikepapadim left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

AdamBien commented Apr 12, 2026

Key changes

Tested with

Model

Uh oh!

CLAassistant commented Apr 12, 2026

Uh oh!

mikepapadim commented Apr 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

mikepapadim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants