Skip to content

vocab: add tokenizer support for jina-embeddings-v2-base-zh#18756

Draft
o7si wants to merge 4 commits intoggml-org:masterfrom
o7si:issue-18452
Draft

vocab: add tokenizer support for jina-embeddings-v2-base-zh#18756
o7si wants to merge 4 commits intoggml-org:masterfrom
o7si:issue-18452

Conversation

@o7si
Copy link
Copy Markdown
Contributor

@o7si o7si commented Jan 11, 2026

The jina-embeddings-v2-base-zh model uses:

  • Whitespace pre-tokenizer
  • Raw Unicode vocabulary (tokens stored as original characters like 你好)
  • Lowercase normalization

This PR adds tokenizer support for jina-embeddings-v2-base-zh.

Note: I'm still learning the llama.cpp codebase, so please point out any issues — I'll fix them promptly! :D

Tested:

Expected output (HuggingFace tokenizer)
$ git clone https://huggingface.co/jinaai/jina-embeddings-v2-base-zh 
$ cat << 'EOF' > test_jina_tokenizer.py
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("jina-embeddings-v2-base-zh")
ids = tokenizer.encode("你好")
for id in ids:
    print(id, tokenizer.decode(id))
EOF
$ python test_jina_tokenizer.py
0 <s>
49226 你好
2 </s>
Actual output (llama.cpp tokenizer)
$ git clone https://huggingface.co/jinaai/jina-embeddings-v2-base-zh 
$ python convert_hf_to_gguf.py jina-embeddings-v2-base-zh --outtype f32
$ ./build/bin/llama-tokenize -m jina-embeddings-v2-base-zh/Jina-Bert-Implementation-160M-F32.gguf -p "你好"
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.029 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
llama_model_load_from_file_impl: using device Metal (Apple M4) (unknown id) - 10922 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 184 tensors from jina-embeddings-v2-base-zh/Jina-Bert-Implementation-160M-F32.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = jina-bert-v2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Jina Bert Implementation
llama_model_loader: - kv   3:                       general.organization str              = Jinaai
llama_model_loader: - kv   4:                         general.size_label str              = 160M
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["sentence-transformers", "feature-ex...
llama_model_loader: - kv   7:                          general.languages arr[str,2]       = ["en", "zh"]
llama_model_loader: - kv   8:                   jina-bert-v2.block_count u32              = 12
llama_model_loader: - kv   9:                jina-bert-v2.context_length u32              = 8192
llama_model_loader: - kv  10:              jina-bert-v2.embedding_length u32              = 768
llama_model_loader: - kv  11:           jina-bert-v2.feed_forward_length u32              = 3072
llama_model_loader: - kv  12:          jina-bert-v2.attention.head_count u32              = 12
llama_model_loader: - kv  13:  jina-bert-v2.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv  14:                          general.file_type u32              = 0
llama_model_loader: - kv  15:              jina-bert-v2.attention.causal bool             = false
llama_model_loader: - kv  16:                  jina-bert-v2.pooling_type u32              = 1
llama_model_loader: - kv  17:               general.quantization_version u32              = 2
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = jina-v2-zh
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,61056]   = ["<s>", "<pad>", "</s>", "<unk>", "<m...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,61056]   = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  22:                      tokenizer.ggml.merges arr[str,39382]   = ["t h", "i n", "a n", "e r", "th e", ...
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  25:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  26:          tokenizer.ggml.seperator_token_id u32              = 2
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  28:               tokenizer.ggml.mask_token_id u32              = 4
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  30:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  31:               tokenizer.ggml.add_sep_token bool             = true
llama_model_loader: - kv  32:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - type  f32:  184 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = all F32
print_info: file size   = 611.20 MiB (32.00 BPW) 
load: empty token at index 5
init_tokenizer: initializing tokenizer for type 2
load: model vocab missing newline token, using special_pad_id instead
load: 0 unused tokens
load: control token:      0 '<s>' is not marked as EOG
load: control token:      4 '<mask>' is not marked as EOG
load: control token:      1 '<pad>' is not marked as EOG
load: control token:      3 '<unk>' is not marked as EOG
load: control token:      2 '</s>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 5
load: token to piece cache size = 0.3058 MB
print_info: arch             = jina-bert-v2
print_info: vocab_only       = 1
print_info: no_alloc         = 0
print_info: model type       = ?B
print_info: model params     = 160.22 M
print_info: general.name     = Jina Bert Implementation
print_info: vocab type       = BPE
print_info: n_vocab          = 61056
print_info: n_merges         = 39382
print_info: BOS token        = 0 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 3 '<unk>'
print_info: SEP token        = 2 '</s>'
print_info: PAD token        = 1 '<pad>'
print_info: MASK token       = 4 '<mask>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 45
llama_model_load: vocab only - skipping tensors
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_seq     = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 0.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (512) > n_ctx_train (0) -- possible training context overflow
     0 -> '<s>'
 49226 -> '你好'
     2 -> '</s>'

Related issue:

Copy link
Copy Markdown

@daveth3t3chg33k daveth3t3chg33k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract the function to an IIFE variable

@o7si
Copy link
Copy Markdown
Contributor Author

o7si commented Jan 12, 2026

Extract the function to an IIFE variable

@daveth3t3chg33k Could you please specify which code snippet you're referring to?

@CISC
Copy link
Copy Markdown
Member

CISC commented Jan 13, 2026

As mentioned here #18452 (comment), this is not BPE so BPE should not be modified to accommodate this.

The lowercase normalization should be read from tokenizer.json by SpecialVocab and flagged in metadata.

@o7si o7si marked this pull request as draft January 14, 2026 08:50
Copy link
Copy Markdown
Member

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, this revealed something very interesting about BertNormalizer:

  • clean_text (bool, optional, defaults to True) — Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one.
  • handle_chinese_chars (bool, optional, defaults to True) — Whether to handle chinese chars by putting spaces around them.
  • strip_accents (bool, optional) — Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert).
  • lowercase (bool, optional, defaults to True) — Whether to lowercase.

I wonder if any models ever stray from defaults here, our WPM tokenizer just assumes all this...

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@o7si
Copy link
Copy Markdown
Contributor Author

o7si commented Jan 14, 2026

I wonder if any models ever stray from defaults here, our WPM tokenizer just assumes all this...

Yes, they do exist! Take a look at this model: https://huggingface.co/aari1995/German_Semantic_V3/raw/main/tokenizer.json

  "normalizer": {
    "type": "BertNormalizer",
    "clean_text": true,
    "handle_chinese_chars": true,
    "strip_accents": false,
    "lowercase": false # <-- differs from default (true)
  },

I haven't found models that deviate from defaults for other parameters, but they likely exist since HF exposes these options.

@o7si
Copy link
Copy Markdown
Contributor Author

o7si commented Jan 14, 2026

I'm still modifying the related code and will re-request review once ready.

@o7si
Copy link
Copy Markdown
Contributor Author

o7si commented Jan 15, 2026

This is technically a very basic, but unsupported tokenizer, perhaps it should just be treated as one (ie, set tokenizer.ggml.model to whitespace)?

Hi @CISC, Looking at jina-embeddings-v2-base-zh/tokenizer.json:

{
  "normalizer": {"type": "Sequence", "normalizers": [{"type": "NFC"}, {"type": "Lowercase"}]},
  "pre_tokenizer": {"type": "Whitespace"},
  "model": {"type": "BPE", ...}
}

The core model.type is still BPE. So this is essentially a BPE tokenizer without byte-level encoding.

My proposed implementation approach:

  • Detect whether ByteLevel exists in the pre_tokenizer (similar to how we read normalizer.lowercase)
  • Add a new metadata field: tokenizer.ggml.byte_level
  • In C++, read this metadata to control whether byte-level encoding is applied

This way we can keep tokenizer.ggml.model = "gpt2" for all BPE tokenizers, and use metadata flags (byte_level, normalizer.lowercase) to control the specific behavior.

What do you think of this approach?

On a related note, I noticed that llama.cpp currently mixes normalizer, pre-tokenizer, and tokenizer logic together. In the long run, separating these into distinct stages (like HuggingFace tokenizers) might make it easier to support different variants via metadata flags. (Of course, this is out of scope for this PR - just a thought for discussion.)

@CISC
Copy link
Copy Markdown
Member

CISC commented Jan 15, 2026

The core model.type is still BPE. So this is essentially a BPE tokenizer without byte-level encoding.

Well, that's the problem, the llama.cpp BPE is BPE with Byte-Level Encoding, hence the gpt2 name.

On a related note, I noticed that llama.cpp currently mixes normalizer, pre-tokenizer, and tokenizer logic together. In the long run, separating these into distinct stages (like HuggingFace tokenizers) might make it easier to support different variants via metadata flags. (Of course, this is out of scope for this PR - just a thought for discussion.)

We have an ongoing model PR that's challenging this already, mixing SPM-style metaspace and <0x> byte tokens with BPE without BLE, it's becoming pretty unsustainable...

@o7si
Copy link
Copy Markdown
Contributor Author

o7si commented Jan 16, 2026

As mentioned here #18452 (comment), this is not BPE so BPE should not be modified to accommodate this.

Hi @CISC, I understand your point. Looking at the enum:

LLAMA_VOCAB_TYPE_BPE    = 2, // GPT-2 tokenizer based on byte-level BPE

The whitespace tokenizer differs from GPT-2 in that it doesn't use byte encoding. So it shouldn't be treated as the same type.

However, I'm facing an implementation dilemma:

llm_tokenizer_bpe_session contains about 200 lines of tokenization logic. If I create a standalone llm_tokenizer_ws, I would need to duplicate all this code. If I want to reuse it, I would need to modify BPE's code structure (e.g., adding a base class or inheritance).

So I'd like to ask: which approach do you prefer?

  • Fully duplicate the session code (code duplication, but doesn't touch BPE)
  • Add a common base class to extract shared logic (requires refactoring)

Or do you have any other suggestions?

@CISC CISC mentioned this pull request Jan 17, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants