vocab: add tokenizer support for jina-embeddings-v2-base-zh#18756
vocab: add tokenizer support for jina-embeddings-v2-base-zh#18756o7si wants to merge 4 commits intoggml-org:masterfrom
Conversation
daveth3t3chg33k
left a comment
There was a problem hiding this comment.
Extract the function to an IIFE variable
@daveth3t3chg33k Could you please specify which code snippet you're referring to? |
|
As mentioned here #18452 (comment), this is not BPE so BPE should not be modified to accommodate this. The lowercase normalization should be read from |
CISC
left a comment
There was a problem hiding this comment.
Now, this revealed something very interesting about BertNormalizer:
- clean_text (
bool, optional, defaults toTrue) — Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one. - handle_chinese_chars (
bool, optional, defaults toTrue) — Whether to handle chinese chars by putting spaces around them. - strip_accents (
bool, optional) — Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert). - lowercase (
bool, optional, defaults toTrue) — Whether to lowercase.
I wonder if any models ever stray from defaults here, our WPM tokenizer just assumes all this...
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Yes, they do exist! Take a look at this model: https://huggingface.co/aari1995/German_Semantic_V3/raw/main/tokenizer.json "normalizer": {
"type": "BertNormalizer",
"clean_text": true,
"handle_chinese_chars": true,
"strip_accents": false,
"lowercase": false # <-- differs from default (true)
},I haven't found models that deviate from defaults for other parameters, but they likely exist since HF exposes these options. |
|
I'm still modifying the related code and will re-request review once ready. |
Hi @CISC, Looking at {
"normalizer": {"type": "Sequence", "normalizers": [{"type": "NFC"}, {"type": "Lowercase"}]},
"pre_tokenizer": {"type": "Whitespace"},
"model": {"type": "BPE", ...}
}The core My proposed implementation approach:
This way we can keep What do you think of this approach? On a related note, I noticed that |
Well, that's the problem, the
We have an ongoing model PR that's challenging this already, mixing SPM-style metaspace and |
Hi @CISC, I understand your point. Looking at the enum: LLAMA_VOCAB_TYPE_BPE = 2, // GPT-2 tokenizer based on byte-level BPEThe whitespace tokenizer differs from GPT-2 in that it doesn't use byte encoding. So it shouldn't be treated as the same type. However, I'm facing an implementation dilemma:
So I'd like to ask: which approach do you prefer?
Or do you have any other suggestions? |
The
jina-embeddings-v2-base-zhmodel uses:你好)This PR adds tokenizer support for
jina-embeddings-v2-base-zh.Note: I'm still learning the llama.cpp codebase, so please point out any issues — I'll fix them promptly! :D
Tested:
Expected output (HuggingFace tokenizer)
Actual output (llama.cpp tokenizer)
Related issue: