Skip to content

vocab : keep DNA k-mer ids distinct from colliding BPE tokens#23466

Merged
CISC merged 2 commits into
ggml-org:masterfrom
kashif:fix-hybriddna-kmer-collision
May 22, 2026
Merged

vocab : keep DNA k-mer ids distinct from colliding BPE tokens#23466
CISC merged 2 commits into
ggml-org:masterfrom
kashif:fix-hybriddna-kmer-collision

Conversation

@kashif
Copy link
Copy Markdown
Contributor

@kashif kashif commented May 21, 2026

Overview

Follow-up to #23410. The HybridDNA tokenizer gives every DNA k-mer its own id, but one 6-mer (CCCCCC) also exists as a Qwen3 BPE token. Because get_vocab() is keyed by text, the DNA id (154402) was dropped in favor of the BPE id (91443) and written out as an unused pad — so <dna>…CCCCCC…</dna> encoded to the wrong id and 154402 detokenized to [PAD154402], diverging from the Python tokenizer.

A naive conversion fix can't work: llama.cpp's vocab is a 1:1 text↔id map, so two tokens named CCCCCC won't load. transformers avoids this by resolving k-mers through a dedicated DNA map in <dna> context. This PR does the same in src/llama-vocab.cpp only: inside <dna> a k-mer resolves to its own id by product-order index (not the shared text→id map), and at load the colliding k-mer's text is restored from its index so it detokenizes correctly.

Result matches transformers both ways: DNA CCCCCC → 154402, plain CCCCCC → 91443, both detokenize to CCCCCC. Verified with full token-id parity against AutoTokenizer(..., trust_remote_code=True).

Additional information

Requirements

@kashif kashif requested a review from CISC as a code owner May 21, 2026 08:18
@kashif
Copy link
Copy Markdown
Contributor Author

kashif commented May 21, 2026

@CISC would something like this be a valid solution?

@CISC
Copy link
Copy Markdown
Member

CISC commented May 21, 2026

To not essentially reimplement the vocab I wonder if it would make more sense to f.ex. postfix the DNA tokens with a special character at conversion, which we can then easily strip from id_to_token without replacing the strings?

@kashif
Copy link
Copy Markdown
Contributor Author

kashif commented May 21, 2026

thanks for the suggestion @CISC let me try that

@kashif kashif force-pushed the fix-hybriddna-kmer-collision branch 2 times, most recently from da621b9 to 3f4eab0 Compare May 21, 2026 11:24
@kashif
Copy link
Copy Markdown
Contributor Author

kashif commented May 21, 2026

@CISC should be ready for intial review

@CISC
Copy link
Copy Markdown
Member

CISC commented May 21, 2026

Nice, but don't modify token_to_piece, simply update id_to_tokens DNA entries text by erasing the marker in-place (only for hybriddna tokenizer, similar to what you did before).

@kashif kashif force-pushed the fix-hybriddna-kmer-collision branch from 3f4eab0 to 29e2bec Compare May 21, 2026 12:06
@kashif
Copy link
Copy Markdown
Contributor Author

kashif commented May 21, 2026

@CISC thanks for the suggestion. Fixed

@kashif kashif force-pushed the fix-hybriddna-kmer-collision branch from 29e2bec to 745bfa2 Compare May 21, 2026 12:27
@CISC
Copy link
Copy Markdown
Member

CISC commented May 21, 2026

I'll check it out later today (I added a test for CCCCCC in the vocab files).

Is it perhaps worth only checking the tokens after <oov>?

@kashif
Copy link
Copy Markdown
Contributor Author

kashif commented May 21, 2026

thanks @CISC checking! Also, hope you can get some sleep!!

@kashif kashif force-pushed the fix-hybriddna-kmer-collision branch from 745bfa2 to ed2329c Compare May 21, 2026 12:47
@CISC
Copy link
Copy Markdown
Member

CISC commented May 21, 2026

thanks @CISC checking! Also, hope you can get some sleep!!

Sleep is overrated. :)

@github-actions github-actions Bot added the python python script changes label May 21, 2026
@CISC CISC added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 21, 2026
@CISC
Copy link
Copy Markdown
Member

CISC commented May 21, 2026

@kashif I guess update the GGUFs ASAP, I'll add the vocab test files.

@kashif
Copy link
Copy Markdown
Contributor Author

kashif commented May 21, 2026

Thanks @CISC, all weights updated and re-uploaded

@CISC
Copy link
Copy Markdown
Member

CISC commented May 21, 2026

BTW, I think you forgot to delete chat_template.jinja, all the GGUFs have chat templates.

@kashif
Copy link
Copy Markdown
Contributor Author

kashif commented May 21, 2026

opps! fixing

@kashif
Copy link
Copy Markdown
Contributor Author

kashif commented May 21, 2026

good catch @CISC removed and reuploaded weights

@CISC
Copy link
Copy Markdown
Member

CISC commented May 22, 2026

@ServeurpersoCom Thanks, holding for Release fix first.

@CISC CISC merged commit afcda09 into ggml-org:master May 22, 2026
6 of 52 checks passed
@kashif kashif deleted the fix-hybriddna-kmer-collision branch May 22, 2026 09:33
@kashif
Copy link
Copy Markdown
Contributor Author

kashif commented May 22, 2026

thanks @CISC

Alex7MV pushed a commit to Alex7MV/claude_llama.cpp that referenced this pull request May 22, 2026
* vocab : mark hybriddna k-mers to avoid BPE token collisions

* improved loop

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
ProTekk pushed a commit to ProTekk/buun-llama-cpp that referenced this pull request May 22, 2026
* vocab : mark hybriddna k-mers to avoid BPE token collisions

* improved loop

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants