vocab: add tokenizer support for jina-embeddings-v2-base-zh by o7si · Pull Request #18756 · ggml-org/llama.cpp

o7si · 2026-01-11T11:58:19Z

The jina-embeddings-v2-base-zh model uses:

Whitespace pre-tokenizer
Raw Unicode vocabulary (tokens stored as original characters like 你好)
Lowercase normalization

This PR adds tokenizer support for jina-embeddings-v2-base-zh.

Note: I'm still learning the llama.cpp codebase, so please point out any issues — I'll fix them promptly! :D

Tested:

Expected output (HuggingFace tokenizer)

$ git clone https://huggingface.co/jinaai/jina-embeddings-v2-base-zh 
$ cat << 'EOF' > test_jina_tokenizer.py
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("jina-embeddings-v2-base-zh")
ids = tokenizer.encode("你好")
for id in ids:
    print(id, tokenizer.decode(id))
EOF
$ python test_jina_tokenizer.py
0 <s>
49226 你好
2 </s>

Actual output (llama.cpp tokenizer)

$ git clone https://huggingface.co/jinaai/jina-embeddings-v2-base-zh 
$ python convert_hf_to_gguf.py jina-embeddings-v2-base-zh --outtype f32
$ ./build/bin/llama-tokenize -m jina-embeddings-v2-base-zh/Jina-Bert-Implementation-160M-F32.gguf -p "你好"
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.029 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
llama_model_load_from_file_impl: using device Metal (Apple M4) (unknown id) - 10922 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 184 tensors from jina-embeddings-v2-base-zh/Jina-Bert-Implementation-160M-F32.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = jina-bert-v2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Jina Bert Implementation
llama_model_loader: - kv   3:                       general.organization str              = Jinaai
llama_model_loader: - kv   4:                         general.size_label str              = 160M
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["sentence-transformers", "feature-ex...
llama_model_loader: - kv   7:                          general.languages arr[str,2]       = ["en", "zh"]
llama_model_loader: - kv   8:                   jina-bert-v2.block_count u32              = 12
llama_model_loader: - kv   9:                jina-bert-v2.context_length u32              = 8192
llama_model_loader: - kv  10:              jina-bert-v2.embedding_length u32              = 768
llama_model_loader: - kv  11:           jina-bert-v2.feed_forward_length u32              = 3072
llama_model_loader: - kv  12:          jina-bert-v2.attention.head_count u32              = 12
llama_model_loader: - kv  13:  jina-bert-v2.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv  14:                          general.file_type u32              = 0
llama_model_loader: - kv  15:              jina-bert-v2.attention.causal bool             = false
llama_model_loader: - kv  16:                  jina-bert-v2.pooling_type u32              = 1
llama_model_loader: - kv  17:               general.quantization_version u32              = 2
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = jina-v2-zh
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,61056]   = ["<s>", "<pad>", "</s>", "<unk>", "<m...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,61056]   = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  22:                      tokenizer.ggml.merges arr[str,39382]   = ["t h", "i n", "a n", "e r", "th e", ...
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  25:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  26:          tokenizer.ggml.seperator_token_id u32              = 2
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  28:               tokenizer.ggml.mask_token_id u32              = 4
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  30:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  31:               tokenizer.ggml.add_sep_token bool             = true
llama_model_loader: - kv  32:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - type  f32:  184 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = all F32
print_info: file size   = 611.20 MiB (32.00 BPW) 
load: empty token at index 5
init_tokenizer: initializing tokenizer for type 2
load: model vocab missing newline token, using special_pad_id instead
load: 0 unused tokens
load: control token:      0 '<s>' is not marked as EOG
load: control token:      4 '<mask>' is not marked as EOG
load: control token:      1 '<pad>' is not marked as EOG
load: control token:      3 '<unk>' is not marked as EOG
load: control token:      2 '</s>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 5
load: token to piece cache size = 0.3058 MB
print_info: arch             = jina-bert-v2
print_info: vocab_only       = 1
print_info: no_alloc         = 0
print_info: model type       = ?B
print_info: model params     = 160.22 M
print_info: general.name     = Jina Bert Implementation
print_info: vocab type       = BPE
print_info: n_vocab          = 61056
print_info: n_merges         = 39382
print_info: BOS token        = 0 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 3 '<unk>'
print_info: SEP token        = 2 '</s>'
print_info: PAD token        = 1 '<pad>'
print_info: MASK token       = 4 '<mask>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 45
llama_model_load: vocab only - skipping tensors
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_seq     = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 0.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (512) > n_ctx_train (0) -- possible training context overflow
     0 -> '<s>'
 49226 -> '你好'
     2 -> '</s>'

Related issue:

Eval bug: the vector of jina-embeddings-v2-base-zh retrieval result is incorrect. #18452

daveth3t3chg33k

Extract the function to an IIFE variable

o7si · 2026-01-12T13:09:23Z

Extract the function to an IIFE variable

@daveth3t3chg33k Could you please specify which code snippet you're referring to?

CISC · 2026-01-13T13:17:58Z

As mentioned here #18452 (comment), this is not BPE so BPE should not be modified to accommodate this.

The lowercase normalization should be read from tokenizer.json by SpecialVocab and flagged in metadata.

CISC

Now, this revealed something very interesting about BertNormalizer:

clean_text (bool, optional, defaults to True) — Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one.
handle_chinese_chars (bool, optional, defaults to True) — Whether to handle chinese chars by putting spaces around them.
strip_accents (bool, optional) — Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert).
lowercase (bool, optional, defaults to True) — Whether to lowercase.

I wonder if any models ever stray from defaults here, our WPM tokenizer just assumes all this...

gguf-py/gguf/vocab.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

o7si · 2026-01-14T09:18:40Z

I wonder if any models ever stray from defaults here, our WPM tokenizer just assumes all this...

Yes, they do exist! Take a look at this model: https://huggingface.co/aari1995/German_Semantic_V3/raw/main/tokenizer.json

  "normalizer": {
    "type": "BertNormalizer",
    "clean_text": true,
    "handle_chinese_chars": true,
    "strip_accents": false,
    "lowercase": false # <-- differs from default (true)
  },

I haven't found models that deviate from defaults for other parameters, but they likely exist since HF exposes these options.

o7si · 2026-01-14T09:21:32Z

I'm still modifying the related code and will re-request review once ready.

o7si · 2026-01-15T07:50:33Z

This is technically a very basic, but unsupported tokenizer, perhaps it should just be treated as one (ie, set tokenizer.ggml.model to whitespace)?

Hi @CISC, Looking at jina-embeddings-v2-base-zh/tokenizer.json:

{
  "normalizer": {"type": "Sequence", "normalizers": [{"type": "NFC"}, {"type": "Lowercase"}]},
  "pre_tokenizer": {"type": "Whitespace"},
  "model": {"type": "BPE", ...}
}

The core model.type is still BPE. So this is essentially a BPE tokenizer without byte-level encoding.

My proposed implementation approach:

Detect whether ByteLevel exists in the pre_tokenizer (similar to how we read normalizer.lowercase)
Add a new metadata field: tokenizer.ggml.byte_level
In C++, read this metadata to control whether byte-level encoding is applied

This way we can keep tokenizer.ggml.model = "gpt2" for all BPE tokenizers, and use metadata flags (byte_level, normalizer.lowercase) to control the specific behavior.

What do you think of this approach?

On a related note, I noticed that llama.cpp currently mixes normalizer, pre-tokenizer, and tokenizer logic together. In the long run, separating these into distinct stages (like HuggingFace tokenizers) might make it easier to support different variants via metadata flags. (Of course, this is out of scope for this PR - just a thought for discussion.)

CISC · 2026-01-15T08:23:46Z

The core model.type is still BPE. So this is essentially a BPE tokenizer without byte-level encoding.

Well, that's the problem, the llama.cpp BPE is BPE with Byte-Level Encoding, hence the gpt2 name.

On a related note, I noticed that llama.cpp currently mixes normalizer, pre-tokenizer, and tokenizer logic together. In the long run, separating these into distinct stages (like HuggingFace tokenizers) might make it easier to support different variants via metadata flags. (Of course, this is out of scope for this PR - just a thought for discussion.)

We have an ongoing model PR that's challenging this already, mixing SPM-style metaspace and <0x> byte tokens with BPE without BLE, it's becoming pretty unsustainable...

o7si · 2026-01-16T17:56:26Z

As mentioned here #18452 (comment), this is not BPE so BPE should not be modified to accommodate this.

Hi @CISC, I understand your point. Looking at the enum:

LLAMA_VOCAB_TYPE_BPE    = 2, // GPT-2 tokenizer based on byte-level BPE

The whitespace tokenizer differs from GPT-2 in that it doesn't use byte encoding. So it shouldn't be treated as the same type.

However, I'm facing an implementation dilemma:

llm_tokenizer_bpe_session contains about 200 lines of tokenization logic. If I create a standalone llm_tokenizer_ws, I would need to duplicate all this code. If I want to reuse it, I would need to modify BPE's code structure (e.g., adding a base class or inheritance).

So I'd like to ask: which approach do you prefer?

Fully duplicate the session code (code duplication, but doesn't touch BPE)
Add a common base class to extract shared logic (requires refactoring)

Or do you have any other suggestions?

o7si marked this pull request as ready for review January 11, 2026 11:58

o7si requested review from CISC and ggerganov as code owners January 11, 2026 11:58

o7si mentioned this pull request Jan 11, 2026

Eval bug: the vector of jina-embeddings-v2-base-zh retrieval result is incorrect. #18452

Closed

github-actions bot added the python python script changes label Jan 11, 2026

loci-dev mentioned this pull request Jan 11, 2026

UPSTREAM PR #18756: vocab: add tokenizer support for jina-embeddings-v2-base-zh auroralabs-loci/llama.cpp#887

Open

daveth3t3chg33k reviewed Jan 12, 2026

View reviewed changes

o7si and others added 2 commits January 14, 2026 15:30

vocab: add tokenizer support for jina-embeddings-v2-base-zh

22e85fc

convert: add normalizer.lowercase metadata support

ccd55e4

o7si force-pushed the issue-18452 branch from b061faa to ccd55e4 Compare January 14, 2026 08:34

o7si marked this pull request as draft January 14, 2026 08:50

CISC reviewed Jan 14, 2026

View reviewed changes

gguf-py/gguf/vocab.py Outdated Show resolved Hide resolved

Update gguf-py/gguf/vocab.py

0f61385

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

wip

f3bce52

CISC mentioned this pull request Jan 17, 2026

model: Add VAETKI support #18719

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vocab: add tokenizer support for jina-embeddings-v2-base-zh#18756

vocab: add tokenizer support for jina-embeddings-v2-base-zh#18756
o7si wants to merge 4 commits intoggml-org:masterfrom
o7si:issue-18452

o7si commented Jan 11, 2026

Uh oh!

daveth3t3chg33k left a comment

Uh oh!

o7si commented Jan 12, 2026

Uh oh!

CISC commented Jan 13, 2026

Uh oh!

CISC left a comment

Uh oh!

Uh oh!

o7si commented Jan 14, 2026

Uh oh!

o7si commented Jan 14, 2026 •

edited

Loading

Uh oh!

o7si commented Jan 15, 2026

Uh oh!

CISC commented Jan 15, 2026

Uh oh!

o7si commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

o7si commented Jan 11, 2026

Uh oh!

daveth3t3chg33k left a comment

Choose a reason for hiding this comment

Uh oh!

o7si commented Jan 12, 2026

Uh oh!

CISC commented Jan 13, 2026

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

o7si commented Jan 14, 2026

Uh oh!

o7si commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

o7si commented Jan 15, 2026

Uh oh!

CISC commented Jan 15, 2026

Uh oh!

o7si commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

o7si commented Jan 14, 2026 •

edited

Loading