-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : Added support for Tekken pre-tokenizer (#8577) #8579
Conversation
Nice work!! |
I've reviewed what I'm able, but I'm not very familiar enough with the process of adding new tokenizers to the system to fully verify this, and I'd like a corroborating review if we can. Overall, I think this is probably good enough to merge in and work on the next step of Mistral NeMo? Should this new tokenizer be added to the |
I do think it's "good enough", as it covers all the test cases in
In these cases, I considered all letters using |
Removed uneeded `vocab.tokenizer_clean_spaces` assignment
@m18coppola I managed to find a few failing cases (though not much, impressively) with
This isn't comprehensive for all of unicode, and might be what is causing the above problems. For example, this is the full regex to match lower-case caracters (as in [a-zµß-öø-ÿāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıijĵķ-ĸĺļľŀłńņň-ʼnŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżž-ƀƃƅƈƌ-ƍƒƕƙ-ƛƞơƣƥƨƪ-ƫƭưƴƶƹ-ƺƽ-ƿdžljnjǎǐǒǔǖǘǚǜ-ǝǟǡǣǥǧǩǫǭǯ-ǰdzǵǹǻǽǿȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȡȣȥȧȩȫȭȯȱȳ-ȹȼȿ-ɀɂɇɉɋɍɏ-ʓʕ-ʯͱͳͷͻ-ͽΐά-ώϐ-ϑϕ-ϗϙϛϝϟϡϣϥϧϩϫϭϯ-ϳϵϸϻ-ϼа-џѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁҋҍҏґғҕҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊӌӎ-ӏӑӓӕӗәӛӝӟӡӣӥӧөӫӭӯӱӳӵӷӹӻӽӿԁԃԅԇԉԋԍԏԑԓԕԗԙԛԝԟԡԣԥԧԩԫԭԯՠ-ֈა-ჺჽ-ჿᏸ-ᏽᲀ-ᲈᴀ-ᴫᵫ-ᵷᵹ-ᶚḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉẋẍẏẑẓẕ-ẝẟạảấầẩẫậắằẳẵặẹẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹỻỽỿ-ἇἐ-ἕἠ-ἧἰ-ἷὀ-ὅὐ-ὗὠ-ὧὰ-ώᾀ-ᾇᾐ-ᾗᾠ-ᾧᾰ-ᾴᾶ-ᾷιῂ-ῄῆ-ῇῐ-ΐῖ-ῗῠ-ῧῲ-ῴῶ-ῷℊℎ-ℏℓℯℴℹℼ-ℽⅆ-ⅉⅎↄⰰ-ⱟⱡⱥ-ⱦⱨⱪⱬⱱⱳ-ⱴⱶ-ⱻⲁⲃⲅⲇⲉⲋⲍⲏⲑⲓⲕⲗⲙⲛⲝⲟⲡⲣⲥⲧⲩⲫⲭⲯⲱⲳⲵⲷⲹⲻⲽⲿⳁⳃⳅⳇⳉⳋⳍⳏⳑⳓⳕⳗⳙⳛⳝⳟⳡⳣ-ⳤⳬⳮⳳⴀ-ⴥⴧⴭꙁꙃꙅꙇꙉꙋꙍꙏꙑꙓꙕꙗꙙꙛꙝꙟꙡꙣꙥꙧꙩꙫꙭꚁꚃꚅꚇꚉꚋꚍꚏꚑꚓꚕꚗꚙꚛꜣꜥꜧꜩꜫꜭꜯ-ꜱꜳꜵꜷꜹꜻꜽꜿꝁꝃꝅꝇꝉꝋꝍꝏꝑꝓꝕꝗꝙꝛꝝꝟꝡꝣꝥꝧꝩꝫꝭꝯꝱ-ꝸꝺꝼꝿꞁꞃꞅꞇꞌꞎꞑꞓ-ꞕꞗꞙꞛꞝꞟꞡꞣꞥꞧꞩꞯꞵꞷꞹꞻꞽꞿꟁꟃꟈꟊꟑꟓꟕꟗꟙꟶꟺꬰ-ꭚꭠ-ꭨꭰ-ꮿff-stﬓ-ﬗa-z𐐨-𐑏𐓘-𐓻𐖗-𐖡𐖣-𐖱𐖳-𐖹𐖻-𐖼𐳀-𐳲𑣀-𑣟𖹠-𖹿𝐚-𝐳𝑎-𝑔𝑖-𝑧𝒂-𝒛𝒶-𝒹𝒻𝒽-𝓃𝓅-𝓏𝓪-𝔃𝔞-𝔷𝕒-𝕫𝖆-𝖟𝖺-𝗓𝗮-𝘇𝘢-𝘻𝙖-𝙯𝚊-𝚥𝛂-𝛚𝛜-𝛡𝛼-𝜔𝜖-𝜛𝜶-𝝎𝝐-𝝕𝝰-𝞈𝞊-𝞏𝞪-𝟂𝟄-𝟉𝟋𝼀-𝼉𝼋-𝼞𝼥-𝼪𞤢-𞥃] While for uppercase (as in [A-ZÀ-ÖØ-ÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİIJĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸ-ŹŻŽƁ-ƂƄƆ-ƇƉ-ƋƎ-ƑƓ-ƔƖ-ƘƜ-ƝƟ-ƠƢƤƦ-ƧƩƬƮ-ƯƱ-ƳƵƷ-ƸƼDŽLJNJǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮDZǴǶ-ǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨȪȬȮȰȲȺ-ȻȽ-ȾɁɃ-ɆɈɊɌɎͰͲͶͿΆΈ-ΊΌΎ-ΏΑ-ΡΣ-ΫϏϒ-ϔϘϚϜϞϠϢϤϦϨϪϬϮϴϷϹ-ϺϽ-ЯѠѢѤѦѨѪѬѮѰѲѴѶѸѺѼѾҀҊҌҎҐҒҔҖҘҚҜҞҠҢҤҦҨҪҬҮҰҲҴҶҸҺҼҾӀ-ӁӃӅӇӉӋӍӐӒӔӖӘӚӜӞӠӢӤӦӨӪӬӮӰӲӴӶӸӺӼӾԀԂԄԆԈԊԌԎԐԒԔԖԘԚԜԞԠԢԤԦԨԪԬԮԱ-ՖႠ-ჅჇჍᎠ-ᏵᲐ-ᲺᲽ-ᲿḀḂḄḆḈḊḌḎḐḒḔḖḘḚḜḞḠḢḤḦḨḪḬḮḰḲḴḶḸḺḼḾṀṂṄṆṈṊṌṎṐṒṔṖṘṚṜṞṠṢṤṦṨṪṬṮṰṲṴṶṸṺṼṾẀẂẄẆẈẊẌẎẐẒẔẞẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼẾỀỂỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢỤỦỨỪỬỮỰỲỴỶỸỺỼỾἈ-ἏἘ-ἝἨ-ἯἸ-ἿὈ-ὍὙὛὝὟὨ-ὯᾸ-ΆῈ-ΉῘ-ΊῨ-ῬῸ-Ώℂℇℋ-ℍℐ-ℒℕℙ-ℝℤΩℨK-ℭℰ-ℳℾ-ℿⅅↃⰀ-ⰯⱠⱢ-ⱤⱧⱩⱫⱭ-ⱰⱲⱵⱾ-ⲀⲂⲄⲆⲈⲊⲌⲎⲐⲒⲔⲖⲘⲚⲜⲞⲠⲢⲤⲦⲨⲪⲬⲮⲰⲲⲴⲶⲸⲺⲼⲾⳀⳂⳄⳆⳈⳊⳌⳎⳐⳒⳔⳖⳘⳚⳜⳞⳠⳢⳫⳭⳲꙀꙂꙄꙆꙈꙊꙌꙎꙐꙒꙔꙖꙘꙚꙜꙞꙠꙢꙤꙦꙨꙪꙬꚀꚂꚄꚆꚈꚊꚌꚎꚐꚒꚔꚖꚘꚚꜢꜤꜦꜨꜪꜬꜮꜲꜴꜶꜸꜺꜼꜾꝀꝂꝄꝆꝈꝊꝌꝎꝐꝒꝔꝖꝘꝚꝜꝞꝠꝢꝤꝦꝨꝪꝬꝮꝹꝻꝽ-ꝾꞀꞂꞄꞆꞋꞍꞐꞒꞖꞘꞚꞜꞞꞠꞢꞤꞦꞨꞪ-ꞮꞰ-ꞴꞶꞸꞺꞼꞾꟀꟂꟄ-ꟇꟉꟐꟖꟘꟵA-Z𐐀-𐐧𐒰-𐓓𐕰-𐕺𐕼-𐖊𐖌-𐖒𐖔-𐖕𐲀-𐲲𑢠-𑢿𖹀-𖹟𝐀-𝐙𝐴-𝑍𝑨-𝒁𝒜𝒞-𝒟𝒢𝒥-𝒦𝒩-𝒬𝒮-𝒵𝓐-𝓩𝔄-𝔅𝔇-𝔊𝔍-𝔔𝔖-𝔜𝔸-𝔹𝔻-𝔾𝕀-𝕄𝕆𝕊-𝕐𝕬-𝖅𝖠-𝖹𝗔-𝗭𝘈-𝘡𝘼-𝙕𝙰-𝚉𝚨-𝛀𝛢-𝛺𝜜-𝜴𝝖-𝝮𝞐-𝞨𝟊𞤀-𞤡] And [DžLjNjDzᾈ-ᾏᾘ-ᾟᾨ-ᾯᾼῌῼ] Script used to generate the above (click to expand)#!/usr/bin/env python3
start_range = None
end_range = None
ranges: list[tuple[int, int]] = []
# You need this file from https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
with open("UnicodeData.txt", "r") as f:
for line in f:
fields = line.split(';')
# Replace this with the category for which ranges should be found
if fields[2] == "Lu":
hex = int(fields[0], 16)
if len(ranges) == 0:
ranges.append((hex, hex))
elif ranges[-1][1] == hex - 1:
ranges[-1] = (ranges[-1][0], hex)
else:
ranges.append((hex, hex))
print("[", end="")
for range in ranges:
if range[0] == range[1]:
print(chr(range[0]), end="")
else:
print(f"{chr(range[0])}-{chr(range[1])}", end="")
print("]") I think this is fine for now with the shorter ASCII-only exclusions, but this should eventually be fixed by properly supporting at least some sub-categories in cc @jaime-m-p (Other proof of the hope to fit all the flags in 32 bits)$ cut -d';' UnicodeData.txt -f3 | sort | uniq
Cc
Cf
Co
Cs
Ll
Lm
Lo
Lt
Lu
Mc
Me
Mn
Nd
Nl
No
Pc
Pd
Pe
Pf
Pi
Po
Ps
Sc
Sk
Sm
So
Zl
Zp
Zs
$ cut -d';' UnicodeData.txt -f3 | sort | uniq | wc -l
29
I agree with this too. Keep it simple at first, especially since it seems to work for most cases. What's nice is that fixing the regex will not require reconverting the model, so it can safely be done later. |
Thanks @m18coppola for this PR.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most tokenization tests that I did are passing, though I found a few that fail:
src: 'Thai : สพรั ่ ง กั'
res: 'Thai : สพรั ่ ง กั'
tok: 2438 2464 1737 18430 13119 4026 4739 1032 5004 3341 1135 18031 4739
main : failed test: 'Thai : สพรั ่ ง กั'
main : detokenized to: 'Thai : สพรั ่ ง กั' instead of 'Thai : สพรั ่ ง กั'
main : expected tokens: 2438 'Th', 2464 'ai', 1737 ' :', 18430 ' ส', 13119 'พ', 43134 'รั', 1032 ' ', 5004 '่', 3341 ' ', 1135 '', 18031 ' ก', 4739 'ั',
main : got tokens: 2438 'Th', 2464 'ai', 1737 ' :', 18430 ' ส', 13119 'พ', 4026 'ร', 4739 'ั', 1032 ' ', 5004 '่', 3341 ' ', 1135 '', 18031 ' ก', 4739 'ั',
This one also fails with the bert-bge
model.
I think it is fine to merge and resolve later
Currently it is set to 16 bits to save memory, but we can use 64 bits (unicode categories + helper flags: is_whitespace, etc). I'm experimenting with |
Just a heads up the tokenizer in the HF repo is a bit different from the original first upload from Mistral (so best to re download it)
|
Thanks for the alert -- just re-ran and can confirm that the chkhsh of the latest Tokenizer version on HF is different than it was before. I was getting the same chkhsh as this PR with the version from yesterday, and with the new version I'm getting chkhsh of We'll need to get this PR updated with the latest before we merge it in. |
* Updated chkhsh for Tekken tokenizer
"Add EOS" is disabled by default, changed Let me know if I should squash this into a single commit. |
The default merge strategy for the repo is squash, so no need to squash in your PR -- it's fine (even ideal) to keep it separate here. Y'all think we're good to merge? |
* llama : Added support for Tekken pre-tokenizer (ggerganov#8577) Removed uneeded `vocab.tokenizer_clean_spaces` assignment * llama : fix order of pre-tokenizers * * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces * Updated chkhsh for Tekken tokenizer --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'tekken' (a) C:\Users\ArabTech\Desktop\2\llama-b3419-bin-win-openblas-x64> llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache (a) C:\Users\ArabTech\Desktop\2\llama-b3604-bin-win-openblas-x64 (1)>cd.. (a) C:\Users\ArabTech\Desktop\2>wasmedge --dir .:. llama-api-server.wasm --model-path Mistral-Nemo-Instruct-2407-Q4_K_M.gguf --prompt-template mistral-instruct --ctx-size 128000 not run with lama cpp and LlamaEdge |
Adding Tekken pre-tokenizer support for Mistral Nemo models.
Partially addresses issue #8577