vocab : keep DNA k-mer ids distinct from colliding BPE tokens#23466
Conversation
|
@CISC would something like this be a valid solution? |
|
To not essentially reimplement the vocab I wonder if it would make more sense to f.ex. postfix the DNA tokens with a special character at conversion, which we can then easily strip from |
|
thanks for the suggestion @CISC let me try that |
da621b9 to
3f4eab0
Compare
|
@CISC should be ready for intial review |
|
Nice, but don't modify |
3f4eab0 to
29e2bec
Compare
|
@CISC thanks for the suggestion. Fixed |
29e2bec to
745bfa2
Compare
|
I'll check it out later today (I added a test for Is it perhaps worth only checking the tokens after |
|
thanks @CISC checking! Also, hope you can get some sleep!! |
745bfa2 to
ed2329c
Compare
Sleep is overrated. :) |
|
@kashif I guess update the GGUFs ASAP, I'll add the vocab test files. |
|
Thanks @CISC, all weights updated and re-uploaded |
|
BTW, I think you forgot to delete |
|
opps! fixing |
|
good catch @CISC removed and reuploaded weights |
|
@ServeurpersoCom Thanks, holding for |
|
thanks @CISC |
* vocab : mark hybriddna k-mers to avoid BPE token collisions * improved loop --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* vocab : mark hybriddna k-mers to avoid BPE token collisions * improved loop --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Overview
Follow-up to #23410. The HybridDNA tokenizer gives every DNA k-mer its own id, but one 6-mer (
CCCCCC) also exists as a Qwen3 BPE token. Becauseget_vocab()is keyed by text, the DNA id (154402) was dropped in favor of the BPE id (91443) and written out as an unused pad — so<dna>…CCCCCC…</dna>encoded to the wrong id and 154402 detokenized to[PAD154402], diverging from the Python tokenizer.A naive conversion fix can't work: llama.cpp's vocab is a 1:1 text↔id map, so two tokens named
CCCCCCwon't load. transformers avoids this by resolving k-mers through a dedicated DNA map in<dna>context. This PR does the same insrc/llama-vocab.cpponly: inside<dna>a k-mer resolves to its own id by product-order index (not the shared text→id map), and at load the colliding k-mer's text is restored from its index so it detokenizes correctly.Result matches transformers both ways: DNA
CCCCCC→ 154402, plainCCCCCC→ 91443, both detokenize toCCCCCC. Verified with full token-id parity againstAutoTokenizer(..., trust_remote_code=True).Additional information
Requirements