Add support for tokenize and untokenize of UTF-8 encoding in prompt/output by wizd · Pull Request #87 · ggml-org/llama.cpp

wizd · 2023-03-13T09:17:30Z

The tokenization process of LLaMA is filled with magic numbers and not easily replicable. However, I have found that using the SentencePiece library works well. It's possible that the original LLaMA model also used SentencePiece for its tokenization.

test prompt: '我静静的坐在雨中，思考着'
I sit quietly in the rain, thinking

This sentence was heavily tokenized to <0x??>, making it very difficult to replicate.

baifachuan · 2023-03-13T11:00:47Z

make编译不过；

     |                               ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:683:40: error: ‘absl::string_view’ has not been declared
  683 |   util::Status ParseExtraOptions(absl::string_view extra_option,
      |                                        ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:690:13: error: ‘absl::string_view’ has not been declared
  690 |       absl::string_view input, absl::string_view normalized,
      |             ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:690:38: error: ‘absl::string_view’ has not been declared
  690 |       absl::string_view input, absl::string_view normalized,
      |                                      ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:692:41: error: ‘string_view’ is not a member of ‘absl’
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                         ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:692:41: error: ‘string_view’ is not a member of ‘absl’
/usr/local/include/sentencepiece_processor.h:692:54: error: template argument 1 is invalid
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                                      ^~~
/usr/local/include/sentencepiece_processor.h:692:57: error: template argument 1 is invalid
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                                         ^~
/usr/local/include/sentencepiece_processor.h:692:57: error: template argument 2 is invalid
/usr/local/include/sentencepiece_processor.h:721:35: error: ‘string_view’ is not a member of ‘absl’
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                   ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:721:59: error: expected primary-expression before ‘*’ token
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                                           ^
/usr/local/include/sentencepiece_processor.h:721:60: error: ‘model_proto’ was not declared in this scope; did you mean ‘ModelProto’?
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                                            ^~~~~~~~~~~
      |                                                            ModelProto
/usr/local/include/sentencepiece_processor.h:724:35: error: ‘string_view’ is not a member of ‘absl’
  724 | util::Status SaveModelProto(absl::string_view, const ModelProto &model_proto);
      |                                   ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:724:48: error: expected primary-expression before ‘const’
  724 | util::Status SaveModelProto(absl::string_view, const ModelProto &model_proto);
      |                                                ^~~~~
utils.cpp: In function ‘std::vector<int> llama_tokenize(const gpt_vocab&, const string&, bool)’:
utils.cpp:291:13: error: invalid conversion from ‘const char*’ to ‘int’ [-fpermissive]
  291 |     sp.Load("./models/tokenizer.model");
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |             |
      |             const char*
In file included from utils.h:10,
                 from utils.cpp:1:
/usr/local/include/sentencepiece_processor.h:244:47: note:   initializing argument 1 of ‘virtual sentencepiece::util::Status sentencepiece::SentencePieceProcessor::Load(int)’
  244 |   virtual util::Status Load(absl::string_view filename);
      |                             ~~~~~~~~~~~~~~~~~~^~~~~~~~
utils.cpp:294:27: error: cannot convert ‘const string’ {aka ‘const std::__cxx11::basic_string<char>’} to ‘int’
  294 |     return sp.EncodeAsIds(text);
      |                           ^~~~
      |                           |
      |                           const string {aka const std::__cxx11::basic_string<char>}
In file included from utils.h:10,
                 from utils.cpp:1:
/usr/local/include/sentencepiece_processor.h:457:58: note:   initializing argument 1 of ‘virtual std::vector<int> sentencepiece::SentencePieceProcessor::EncodeAsIds(int) const’
  457 |   virtual std::vector<int> EncodeAsIds(absl::string_view input) const {
      |                                        ~~~~~~~~~~~~~~~~~~^~~~~
make: *** [Makefile:185: utils.o] Error 1

wizd · 2023-03-13T11:38:10Z

Need to add sentencepiece library manually
on macos:
https://github.com/google/sentencepiece#build-and-install-using-vcpkg

lucasjinreal · 2023-03-13T12:30:13Z

@wizd I using your fork, the intract mode is not work:

it was in dead loop.....

ggerganov · 2023-03-13T16:47:04Z

Resolved in #79

ggerganov · 2023-03-13T18:37:14Z

Oh wait, did I get confused?
#79 does not resolve the tokenizer issues?

kharvd · 2023-03-13T18:40:28Z

I think it does. Are you still able to reproduce the issues?

ggerganov · 2023-03-13T18:51:10Z

I reran the convert script and I get the following:

make -j && ./main -m models/13B/ggml-model-q4_0.bin -t 8 -n 64 -s 11 -p "我静静的坐在雨中，思考着"
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

make: Nothing to be done for `default'.
main: seed = 11
llama_model_load: loading model from 'models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from 'models/13B/ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363
llama_model_load: loading model part 2/2 from 'models/13B/ggml-model-q4_0.bin.1'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

main: prompt: '我静静的坐在雨中，思考着'
main: number of tokens in prompt = 2
     1 -> ''
 30672 -> '我'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


我们已经开始了。 (We've already begun.)
������行������。 (A camel caravan travels in a circle. )
The above-mentioned idioms and phrases are what I found on Chinese websites when googling

main: mem per token = 22439492 bytes
main:     load time =  2962.67 ms
main:   sample time =    59.34 ms
main:  predict time =  5717.17 ms / 87.96 ms per token
main:    total time = 10370.07 ms

There are 2 problems still:

The prompt is not converted to tokens
The generated text has invalid characters

j-f1 · 2023-03-13T19:00:13Z

Can you try running from shell script encoded as UTF-8 and outputting to a text file? Your terminal might not be handling Unicode correctly.

You’ll also need to re-generate your models from scratch since this PR changes how the ggml files are created.

kharvd · 2023-03-13T19:00:45Z

@ggerganov just in case: did you re-run the quantization script as well?

ggerganov · 2023-03-13T19:08:12Z

@ggerganov just in case: did you re-run the quantization script as well?

Oops .. all good now 🦙

wizzard0 · 2023-03-13T22:14:51Z

suggestion: can we add a magic version number? i feel we’ll get further updates?

…

On Mon, Mar 13, 2023 at 21:08, Georgi Gerganov ***@***.***> wrote: > ***@***.***(https://github.com/ggerganov) just in case: did you re-run the quantization script as well? Oops .. all good now 🦙 — Reply to this email directly, [view it on GitHub](#87 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AADHVK435NNAH7DLEKXMTEDW35WCRANCNFSM6AAAAAAVYZNR34). You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

lucasjinreal · 2023-03-14T02:21:10Z

Does this merge into master? How to test it? The wizd's branch doesn't work well with intract mode.

zhoujian1028 · 2023-03-14T04:48:24Z

@ggerganov

The prompt is not converted to tokens
How do you solve it? Thks!

Fix TypeError in low_level chat

…l-fix vulkan: fix turbo3 build + coopmat FA after April upstream sync

* iq3_k: fix Metal dot product I was accessing the scales as 4-byte aligned, but iq3_k is not 4-byte aligned. Instead of throwing an error (as it happens on CUDA when one makes this mistake), Metal silently accepts and we get garbage. * iq3_k: slightly faster Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

@apollosenvy

Mirror of @apollosenvy's turbo3_0 Vulkan SET_ROWS port (PR ggml-org#33 + ggml-org#87) to the other two turbo types. Reported by @dpblnt in ggml-org#50 with a clean matrix on RX 9060 XT showing turbo3 V works on Vulkan but turbo2/turbo4 V abort with: pre-allocated tensor (cache_v_l*) in a buffer (Vulkan0) that cannot run the operation (SET_ROWS) at llama_context::sched_reserve() time, before any compute runs. Mechanical port across 4 files: - vulkan-shaders/types.glsl: block_turbo2_0 + block_turbo4_0 struct declarations matching the C side (ggml-common.h). - vulkan-shaders/copy_to_quant.comp: SET_ROWS quantize main() blocks for turbo2 (4 centroids, 2-bit pack, no signs byte) and turbo4 (16 centroids, 4-bit nibble pack, no signs byte). WHT setup and reduction structure identical to turbo3 (QK = 128 across all three). Centroid + midpoint tables ported from CENTROIDS_2BIT and CENTROIDS_4BIT in ggml-turbo-quant.c. - vulkan-shaders/vulkan-shaders-gen.cpp: turbo2_0 and turbo4_0 added to the set_rows iteration list at line ~789. - ggml-vulkan.cpp: SET_ROWS pipeline registrations + supports_op switch + dispatch element-count all extended with TURBO2_0 and TURBO4_0 cases. ## Verified on llvmpipe Vulkan (CPU software, AMD MI300X cloud droplet) Patched ggml-vulkan.cpp temporarily during repro to allow llvmpipe (normally filtered out as eCpu); patch reverted before commit. The SET_ROWS abort is a backend-capability check at graph build time so it fires regardless of GPU vs CPU Vulkan backend. | ctk / ctv | tg16 (t/s) | status | |-------------------|-----------:|---------------| | q4_0 / q4_0 | 17.68 | baseline | | q4_0 / turbo3 | 5.91 | already worked| | q4_0 / turbo4 | 6.14 | was aborting | | q4_0 / turbo2 | 5.65 | was aborting | llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they are reported here only to confirm the abort is gone and the kernels run end-to-end without divergence. ## Needs GPU validation Cannot validate GPU shader correctness on the droplet (MI300X SR-IOV VF does not expose itself to RADV/amdvlk on cloud). Specifically: - Subgroup shuffle / ballot behavior on real GPU subgroup sizes - Shader compilation under non-llvmpipe Vulkan drivers - PPL / quality on the actual quantization math @dpblnt @apollosenvy if either of you has cycles, would appreciate a quick rebuild on RDNA Vulkan (gfx1100/gfx1200) to confirm: 1. The SET_ROWS abort that triggered ggml-org#50 is gone 2. Output coherence on turbo4 V (not garbage tokens) 3. PPL stays in the expected ballpark vs the CUDA / Metal implementations of the same quants Closes ggml-org#50. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Empirical response to issues ggml-org#87 (Mitzenmacher) and ggml-org#89 (Portnoy) on turboquant_plus. 8 hypothesis tests across synthetic Gaussian and Qwen3-0.6B + Llama-3.2-1B real KV cache. Headline: EDEN's optimal scale is real but second-order. Rotation choice (WHT vs dense Haar) is first-order and accounts for ~all of the published gap. Production already uses the first-order fix. On real KV with WHT, matched-norm beats EDEN-S by 0.5-9% across Qwen and Llama; tightens to 0.67-0.72% on Llama. EDEN-S genuinely wins at b=8 on synthetic Gaussian (70%+ MSE reduction) — credibility moat showing the lever exists in its own regime. Concedes attribution argument (ggml-org#89): DRIVE/EDEN line is real prior art. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… proxy for KV cache quantization Adds the empirical study covering the broader 35-hypothesis investigation spawned by the EDEN/Mitzenmacher/Portnoy critique (ggml-org#87, ggml-org#89). Companion to eden-optimal-s-revisit.md. Headline result: a K-cache centroid table that improves per-vector reconstruction MSE by 1–13% across five model families (Qwen3-0.6B, Llama-3.2-1B, Mistral-7B, Phi-3-mini, Gemma-2-2b) causes 70–90% mean KL@D regressions and 50–60% catastrophic-rate failures at the model output. This is a sign inversion, not a small mismatch: methods that are strictly better under MSE are systematically worse under KL@D in deployment. The paper is structured in three acts: - Act 1: K cache is sub-Gaussian post-WHT (H8); V is Gaussian - Act 2: per-prompt fitting fails (PDS), ensemble (ens_4way) reaches an irreducible ~8% catastrophic floor at scale (N=100) - Act 3: the universal-K cross-model counterexample (F3) demonstrates the MSE→KL inversion at the cleanest possible level Includes a 2-D toy attention illustration of the softmax bucket-flip mechanism, an explicit decision boundary for when MSE proxies quality (linear consumers) vs when it doesn't (attention softmax), an ablation summary distinguishing first-order from second-order factors, operational guidance for practitioners, and a "why this wasn't caught earlier" diagnosis of the methodology blind spots. Companion paper handles the algorithmic claim head-on. This paper is the broader evaluation-methodology critique. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@apollosenvy

Mirror of @apollosenvy's turbo3_0 Vulkan SET_ROWS port (PR ggml-org#33 + ggml-org#87) to the other two turbo types. Reported by @dpblnt in ggml-org#50 with a clean matrix on RX 9060 XT showing turbo3 V works on Vulkan but turbo2/turbo4 V abort with: pre-allocated tensor (cache_v_l*) in a buffer (Vulkan0) that cannot run the operation (SET_ROWS) at llama_context::sched_reserve() time, before any compute runs. Mechanical port across 4 files: - vulkan-shaders/types.glsl: block_turbo2_0 + block_turbo4_0 struct declarations matching the C side (ggml-common.h). - vulkan-shaders/copy_to_quant.comp: SET_ROWS quantize main() blocks for turbo2 (4 centroids, 2-bit pack, no signs byte) and turbo4 (16 centroids, 4-bit nibble pack, no signs byte). WHT setup and reduction structure identical to turbo3 (QK = 128 across all three). Centroid + midpoint tables ported from CENTROIDS_2BIT and CENTROIDS_4BIT in ggml-turbo-quant.c. - vulkan-shaders/vulkan-shaders-gen.cpp: turbo2_0 and turbo4_0 added to the set_rows iteration list at line ~789. - ggml-vulkan.cpp: SET_ROWS pipeline registrations + supports_op switch + dispatch element-count all extended with TURBO2_0 and TURBO4_0 cases. ## Verified on llvmpipe Vulkan (CPU software, AMD MI300X cloud droplet) Patched ggml-vulkan.cpp temporarily during repro to allow llvmpipe (normally filtered out as eCpu); patch reverted before commit. The SET_ROWS abort is a backend-capability check at graph build time so it fires regardless of GPU vs CPU Vulkan backend. | ctk / ctv | tg16 (t/s) | status | |-------------------|-----------:|---------------| | q4_0 / q4_0 | 17.68 | baseline | | q4_0 / turbo3 | 5.91 | already worked| | q4_0 / turbo4 | 6.14 | was aborting | | q4_0 / turbo2 | 5.65 | was aborting | llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they are reported here only to confirm the abort is gone and the kernels run end-to-end without divergence. ## Needs GPU validation Cannot validate GPU shader correctness on the droplet (MI300X SR-IOV VF does not expose itself to RADV/amdvlk on cloud). Specifically: - Subgroup shuffle / ballot behavior on real GPU subgroup sizes - Shader compilation under non-llvmpipe Vulkan drivers - PPL / quality on the actual quantization math @dpblnt @apollosenvy if either of you has cycles, would appreciate a quick rebuild on RDNA Vulkan (gfx1100/gfx1200) to confirm: 1. The SET_ROWS abort that triggered ggml-org#50 is gone 2. Output coherence on turbo4 V (not garbage tokens) 3. PPL stays in the expected ballpark vs the CUDA / Metal implementations of the same quants Closes ggml-org#50. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wizd added 6 commits March 13, 2023 10:00

first try to intergrate sentencepiece

307dba3

call a standalone function to untokenize output

1b87fe1

buffering output for UTF-8 encoded token

86e967c

buffering utf-8 output to make it complete for spliting output.

15f06f6

clean code

ed10def

Merge branch 'ggerganov:master' into master

7438b83

This was referenced Mar 13, 2023

Unicode support #11

Closed

Chinese character decoding error when intract way #86

Closed

add support to load tokenizer.model from command line argument

6b9e424

remove unused header

a1eff53

ggerganov closed this Mar 13, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

Merge pull request ggml-org#87 from SagsMug/main

4ce6670

Fix TypeError in low_level chat

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

InfernalDread pushed a commit to InfernalDread/llama.cpp that referenced this pull request Apr 23, 2026

Merge pull request ggml-org#87 from apollosenvy/pr/vulkan-turbo3-apri…

627ebbc

…l-fix vulkan: fix turbo3 build + coopmat FA after April upstream sync

Conversation

wizd commented Mar 13, 2023

Uh oh!

baifachuan commented Mar 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wizd commented Mar 13, 2023

Uh oh!

lucasjinreal commented Mar 13, 2023

Uh oh!

ggerganov commented Mar 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 13, 2023

Uh oh!

kharvd commented Mar 13, 2023

Uh oh!

ggerganov commented Mar 13, 2023

Uh oh!

j-f1 commented Mar 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kharvd commented Mar 13, 2023

Uh oh!

ggerganov commented Mar 13, 2023

Uh oh!

wizzard0 commented Mar 13, 2023 via email

Uh oh!

lucasjinreal commented Mar 14, 2023

Uh oh!

zhoujian1028 commented Mar 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

baifachuan commented Mar 13, 2023 •

edited

Loading

ggerganov commented Mar 13, 2023 •

edited

Loading

j-f1 commented Mar 13, 2023 •

edited

Loading