vocab : keep DNA k-mer ids distinct from colliding BPE tokens by kashif · Pull Request #23466 · ggml-org/llama.cpp

kashif · 2026-05-21T08:18:23Z

Overview

Follow-up to #23410. The HybridDNA tokenizer gives every DNA k-mer its own id, but one 6-mer (CCCCCC) also exists as a Qwen3 BPE token. Because get_vocab() is keyed by text, the DNA id (154402) was dropped in favor of the BPE id (91443) and written out as an unused pad — so <dna>…CCCCCC…</dna> encoded to the wrong id and 154402 detokenized to [PAD154402], diverging from the Python tokenizer.

A naive conversion fix can't work: llama.cpp's vocab is a 1:1 text↔id map, so two tokens named CCCCCC won't load. transformers avoids this by resolving k-mers through a dedicated DNA map in <dna> context. This PR does the same in src/llama-vocab.cpp only: inside <dna> a k-mer resolves to its own id by product-order index (not the shared text→id map), and at load the colliding k-mer's text is restored from its index so it detokenizes correctly.

Result matches transformers both ways: DNA CCCCCC → 154402, plain CCCCCC → 91443, both detokenize to CCCCCC. Verified with full token-id parity against AutoTokenizer(..., trust_remote_code=True).

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES for comments and refactoring

kashif · 2026-05-21T08:20:01Z

@CISC would something like this be a valid solution?

CISC · 2026-05-21T09:35:55Z

To not essentially reimplement the vocab I wonder if it would make more sense to f.ex. postfix the DNA tokens with a special character at conversion, which we can then easily strip from id_to_token without replacing the strings?

kashif · 2026-05-21T11:09:37Z

thanks for the suggestion @CISC let me try that

kashif · 2026-05-21T11:25:18Z

@CISC should be ready for intial review

CISC · 2026-05-21T11:53:06Z

Nice, but don't modify token_to_piece, simply update id_to_tokens DNA entries text by erasing the marker in-place (only for hybriddna tokenizer, similar to what you did before).

kashif · 2026-05-21T12:24:25Z

@CISC thanks for the suggestion. Fixed

CISC · 2026-05-21T12:38:26Z

I'll check it out later today (I added a test for CCCCCC in the vocab files).

Is it perhaps worth only checking the tokens after <oov>?

kashif · 2026-05-21T12:46:55Z

thanks @CISC checking! Also, hope you can get some sleep!!

CISC · 2026-05-21T12:49:13Z

thanks @CISC checking! Also, hope you can get some sleep!!

Sleep is overrated. :)

CISC · 2026-05-21T19:42:06Z

@kashif I guess update the GGUFs ASAP, I'll add the vocab test files.

kashif · 2026-05-21T21:00:24Z

Thanks @CISC, all weights updated and re-uploaded

CISC · 2026-05-21T21:24:36Z

BTW, I think you forgot to delete chat_template.jinja, all the GGUFs have chat templates.

kashif · 2026-05-21T21:27:23Z

opps! fixing

kashif · 2026-05-21T21:33:36Z

good catch @CISC removed and reuploaded weights

CISC · 2026-05-22T07:55:27Z

@ServeurpersoCom Thanks, holding for Release fix first.

kashif · 2026-05-22T09:33:10Z

thanks @CISC

* vocab : mark hybriddna k-mers to avoid BPE token collisions * improved loop --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

kashif requested a review from CISC as a code owner May 21, 2026 08:18

kashif force-pushed the fix-hybriddna-kmer-collision branch 2 times, most recently from da621b9 to 3f4eab0 Compare May 21, 2026 11:24

kashif force-pushed the fix-hybriddna-kmer-collision branch from 3f4eab0 to 29e2bec Compare May 21, 2026 12:06

kashif force-pushed the fix-hybriddna-kmer-collision branch from 29e2bec to 745bfa2 Compare May 21, 2026 12:27

vocab : mark hybriddna k-mers to avoid BPE token collisions

ed2329c

kashif force-pushed the fix-hybriddna-kmer-collision branch from 745bfa2 to ed2329c Compare May 21, 2026 12:47

improved loop

6cbf9dc

github-actions Bot added the python python script changes label May 21, 2026

CISC approved these changes May 21, 2026

View reviewed changes

CISC added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 21, 2026

ServeurpersoCom approved these changes May 22, 2026

View reviewed changes

CISC merged commit afcda09 into ggml-org:master May 22, 2026
6 of 52 checks passed

kashif deleted the fix-hybriddna-kmer-collision branch May 22, 2026 09:33

THEman6989 mentioned this pull request May 22, 2026

Add install() for impl libraries and fix Apple/Android builds THEman6989/llama.cpp-gfx906-turbo-mtp#1

Merged

Conversation

kashif commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

kashif commented May 21, 2026

Uh oh!

CISC commented May 21, 2026

Uh oh!

kashif commented May 21, 2026

Uh oh!

kashif commented May 21, 2026

Uh oh!

CISC commented May 21, 2026

Uh oh!

kashif commented May 21, 2026

Uh oh!

CISC commented May 21, 2026

Uh oh!

kashif commented May 21, 2026

Uh oh!

CISC commented May 21, 2026

Uh oh!

CISC commented May 21, 2026

Uh oh!

kashif commented May 21, 2026

Uh oh!

CISC commented May 21, 2026

Uh oh!

kashif commented May 21, 2026

Uh oh!

kashif commented May 21, 2026

Uh oh!

CISC commented May 22, 2026

Uh oh!

Uh oh!

kashif commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kashif commented May 21, 2026 •

edited

Loading