Skip to content

Add support for tokenize and untokenize of UTF-8 encoding in prompt/output#87

Closed
wizd wants to merge 8 commits into
ggml-org:masterfrom
wizd:master
Closed

Add support for tokenize and untokenize of UTF-8 encoding in prompt/output#87
wizd wants to merge 8 commits into
ggml-org:masterfrom
wizd:master

Conversation

@wizd
Copy link
Copy Markdown

@wizd wizd commented Mar 13, 2023

The tokenization process of LLaMA is filled with magic numbers and not easily replicable. However, I have found that using the SentencePiece library works well. It's possible that the original LLaMA model also used SentencePiece for its tokenization.

test prompt: '我静静的坐在雨中,思考着'
I sit quietly in the rain, thinking

This sentence was heavily tokenized to <0x??>, making it very difficult to replicate.

Screenshot 2023-03-13 at 5 14 58 PM

@baifachuan
Copy link
Copy Markdown

baifachuan commented Mar 13, 2023

make编译不过;

     |                               ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:683:40: error: ‘absl::string_view’ has not been declared
  683 |   util::Status ParseExtraOptions(absl::string_view extra_option,
      |                                        ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:690:13: error: ‘absl::string_view’ has not been declared
  690 |       absl::string_view input, absl::string_view normalized,
      |             ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:690:38: error: ‘absl::string_view’ has not been declared
  690 |       absl::string_view input, absl::string_view normalized,
      |                                      ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:692:41: error: ‘string_view’ is not a member of ‘absl’
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                         ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:692:41: error: ‘string_view’ is not a member of ‘absl’
/usr/local/include/sentencepiece_processor.h:692:54: error: template argument 1 is invalid
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                                      ^~~
/usr/local/include/sentencepiece_processor.h:692:57: error: template argument 1 is invalid
  692 |       const std::vector<std::pair<absl::string_view, int>> &result,
      |                                                         ^~
/usr/local/include/sentencepiece_processor.h:692:57: error: template argument 2 is invalid
/usr/local/include/sentencepiece_processor.h:721:35: error: ‘string_view’ is not a member of ‘absl’
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                   ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:721:59: error: expected primary-expression before ‘*’ token
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                                           ^
/usr/local/include/sentencepiece_processor.h:721:60: error: ‘model_proto’ was not declared in this scope; did you mean ‘ModelProto’?
  721 | util::Status LoadModelProto(absl::string_view, ModelProto *model_proto);
      |                                                            ^~~~~~~~~~~
      |                                                            ModelProto
/usr/local/include/sentencepiece_processor.h:724:35: error: ‘string_view’ is not a member of ‘absl’
  724 | util::Status SaveModelProto(absl::string_view, const ModelProto &model_proto);
      |                                   ^~~~~~~~~~~
/usr/local/include/sentencepiece_processor.h:724:48: error: expected primary-expression before ‘const’
  724 | util::Status SaveModelProto(absl::string_view, const ModelProto &model_proto);
      |                                                ^~~~~
utils.cpp: In function ‘std::vector<int> llama_tokenize(const gpt_vocab&, const string&, bool)’:
utils.cpp:291:13: error: invalid conversion from ‘const char*’ to ‘int’ [-fpermissive]
  291 |     sp.Load("./models/tokenizer.model");
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |             |
      |             const char*
In file included from utils.h:10,
                 from utils.cpp:1:
/usr/local/include/sentencepiece_processor.h:244:47: note:   initializing argument 1 of ‘virtual sentencepiece::util::Status sentencepiece::SentencePieceProcessor::Load(int)’
  244 |   virtual util::Status Load(absl::string_view filename);
      |                             ~~~~~~~~~~~~~~~~~~^~~~~~~~
utils.cpp:294:27: error: cannot convert ‘const string’ {aka ‘const std::__cxx11::basic_string<char>’} to ‘int’
  294 |     return sp.EncodeAsIds(text);
      |                           ^~~~
      |                           |
      |                           const string {aka const std::__cxx11::basic_string<char>}
In file included from utils.h:10,
                 from utils.cpp:1:
/usr/local/include/sentencepiece_processor.h:457:58: note:   initializing argument 1 of ‘virtual std::vector<int> sentencepiece::SentencePieceProcessor::EncodeAsIds(int) const’
  457 |   virtual std::vector<int> EncodeAsIds(absl::string_view input) const {
      |                                        ~~~~~~~~~~~~~~~~~~^~~~~
make: *** [Makefile:185: utils.o] Error 1

@wizd
Copy link
Copy Markdown
Author

wizd commented Mar 13, 2023

Need to add sentencepiece library manually
on macos:
https://github.com/google/sentencepiece#build-and-install-using-vcpkg

@lucasjinreal
Copy link
Copy Markdown

@wizd I using your fork, the intract mode is not work:

image

it was in dead loop.....

@ggerganov
Copy link
Copy Markdown
Member

ggerganov commented Mar 13, 2023

Resolved in #79

@ggerganov ggerganov closed this Mar 13, 2023
@ggerganov
Copy link
Copy Markdown
Member

Oh wait, did I get confused?
#79 does not resolve the tokenizer issues?

@kharvd
Copy link
Copy Markdown
Contributor

kharvd commented Mar 13, 2023

I think it does. Are you still able to reproduce the issues?

@ggerganov
Copy link
Copy Markdown
Member

I reran the convert script and I get the following:

make -j && ./main -m models/13B/ggml-model-q4_0.bin -t 8 -n 64 -s 11 -p "我静静的坐在雨中,思考着"
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

make: Nothing to be done for `default'.
main: seed = 11
llama_model_load: loading model from 'models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size =   800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from 'models/13B/ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363
llama_model_load: loading model part 2/2 from 'models/13B/ggml-model-q4_0.bin.1'
llama_model_load: ............................................. done
llama_model_load: model size =  3880.49 MB / num tensors = 363

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

main: prompt: '我静静的坐在雨中,思考着'
main: number of tokens in prompt = 2
     1 -> ''
 30672 -> '我'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


我们已经开始了。 (We've already begun.)
������行������。 (A camel caravan travels in a circle. )
The above-mentioned idioms and phrases are what I found on Chinese websites when googling

main: mem per token = 22439492 bytes
main:     load time =  2962.67 ms
main:   sample time =    59.34 ms
main:  predict time =  5717.17 ms / 87.96 ms per token
main:    total time = 10370.07 ms

There are 2 problems still:

  • The prompt is not converted to tokens
  • The generated text has invalid characters

@j-f1
Copy link
Copy Markdown
Contributor

j-f1 commented Mar 13, 2023

Can you try running from shell script encoded as UTF-8 and outputting to a text file? Your terminal might not be handling Unicode correctly.

You’ll also need to re-generate your models from scratch since this PR changes how the ggml files are created.

@kharvd
Copy link
Copy Markdown
Contributor

kharvd commented Mar 13, 2023

@ggerganov just in case: did you re-run the quantization script as well?

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov just in case: did you re-run the quantization script as well?

Oops .. all good now 🦙

@wizzard0
Copy link
Copy Markdown
Contributor

wizzard0 commented Mar 13, 2023 via email

@lucasjinreal
Copy link
Copy Markdown

Does this merge into master? How to test it? The wizd's branch doesn't work well with intract mode.

@zhoujian1028
Copy link
Copy Markdown

@ggerganov
image
The prompt is not converted to tokens
How do you solve it? Thks!

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
Fix TypeError in low_level chat
InfernalDread pushed a commit to InfernalDread/llama.cpp that referenced this pull request Apr 23, 2026
…l-fix

vulkan: fix turbo3 build + coopmat FA after April upstream sync
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
* iq3_k: fix Metal dot product

I was accessing the scales as 4-byte aligned, but iq3_k is
not 4-byte aligned. Instead of throwing an error (as it happens
on CUDA when one makes this mistake), Metal silently accepts
and we get garbage.

* iq3_k: slightly faster Metal dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
retroheim pushed a commit to retroheim/prism-ml-llama.cpp that referenced this pull request May 3, 2026
Mirror of @apollosenvy's turbo3_0 Vulkan SET_ROWS port (PR ggml-org#33 + ggml-org#87)
to the other two turbo types. Reported by @dpblnt in ggml-org#50 with a clean
matrix on RX 9060 XT showing turbo3 V works on Vulkan but turbo2/turbo4
V abort with:

  pre-allocated tensor (cache_v_l*) in a buffer (Vulkan0)
  that cannot run the operation (SET_ROWS)

at llama_context::sched_reserve() time, before any compute runs.

Mechanical port across 4 files:

- vulkan-shaders/types.glsl: block_turbo2_0 + block_turbo4_0 struct
  declarations matching the C side (ggml-common.h).

- vulkan-shaders/copy_to_quant.comp: SET_ROWS quantize main() blocks
  for turbo2 (4 centroids, 2-bit pack, no signs byte) and turbo4
  (16 centroids, 4-bit nibble pack, no signs byte). WHT setup and
  reduction structure identical to turbo3 (QK = 128 across all three).
  Centroid + midpoint tables ported from CENTROIDS_2BIT and
  CENTROIDS_4BIT in ggml-turbo-quant.c.

- vulkan-shaders/vulkan-shaders-gen.cpp: turbo2_0 and turbo4_0 added
  to the set_rows iteration list at line ~789.

- ggml-vulkan.cpp: SET_ROWS pipeline registrations + supports_op
  switch + dispatch element-count all extended with TURBO2_0 and
  TURBO4_0 cases.

## Verified on llvmpipe Vulkan (CPU software, AMD MI300X cloud droplet)

Patched ggml-vulkan.cpp temporarily during repro to allow llvmpipe
(normally filtered out as eCpu); patch reverted before commit. The
SET_ROWS abort is a backend-capability check at graph build time so
it fires regardless of GPU vs CPU Vulkan backend.

| ctk / ctv         | tg16 (t/s) | status        |
|-------------------|-----------:|---------------|
| q4_0 / q4_0       | 17.68      | baseline      |
| q4_0 / turbo3     | 5.91       | already worked|
| q4_0 / turbo4     | 6.14       | was aborting  |
| q4_0 / turbo2     | 5.65       | was aborting  |

llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they
are reported here only to confirm the abort is gone and the kernels
run end-to-end without divergence.

## Needs GPU validation

Cannot validate GPU shader correctness on the droplet (MI300X SR-IOV
VF does not expose itself to RADV/amdvlk on cloud). Specifically:
- Subgroup shuffle / ballot behavior on real GPU subgroup sizes
- Shader compilation under non-llvmpipe Vulkan drivers
- PPL / quality on the actual quantization math

@dpblnt @apollosenvy if either of you has cycles, would appreciate
a quick rebuild on RDNA Vulkan (gfx1100/gfx1200) to confirm:
1. The SET_ROWS abort that triggered ggml-org#50 is gone
2. Output coherence on turbo4 V (not garbage tokens)
3. PPL stays in the expected ballpark vs the CUDA / Metal
   implementations of the same quants

Closes ggml-org#50.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jcfunk pushed a commit to Jcfunk/llama.cpp that referenced this pull request May 8, 2026
Empirical response to issues ggml-org#87 (Mitzenmacher) and ggml-org#89 (Portnoy) on
turboquant_plus. 8 hypothesis tests across synthetic Gaussian and
Qwen3-0.6B + Llama-3.2-1B real KV cache.

Headline: EDEN's optimal scale is real but second-order. Rotation choice
(WHT vs dense Haar) is first-order and accounts for ~all of the published
gap. Production already uses the first-order fix. On real KV with WHT,
matched-norm beats EDEN-S by 0.5-9% across Qwen and Llama; tightens to
0.67-0.72% on Llama. EDEN-S genuinely wins at b=8 on synthetic Gaussian
(70%+ MSE reduction) — credibility moat showing the lever exists in its
own regime.

Concedes attribution argument (ggml-org#89): DRIVE/EDEN line is real prior art.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jcfunk pushed a commit to Jcfunk/llama.cpp that referenced this pull request May 8, 2026
… proxy for KV cache quantization

Adds the empirical study covering the broader 35-hypothesis investigation
spawned by the EDEN/Mitzenmacher/Portnoy critique (ggml-org#87, ggml-org#89). Companion
to eden-optimal-s-revisit.md.

Headline result: a K-cache centroid table that improves per-vector
reconstruction MSE by 1–13% across five model families (Qwen3-0.6B,
Llama-3.2-1B, Mistral-7B, Phi-3-mini, Gemma-2-2b) causes 70–90% mean
KL@D regressions and 50–60% catastrophic-rate failures at the model
output. This is a sign inversion, not a small mismatch: methods that
are strictly better under MSE are systematically worse under KL@D in
deployment.

The paper is structured in three acts:
- Act 1: K cache is sub-Gaussian post-WHT (H8); V is Gaussian
- Act 2: per-prompt fitting fails (PDS), ensemble (ens_4way) reaches an
  irreducible ~8% catastrophic floor at scale (N=100)
- Act 3: the universal-K cross-model counterexample (F3) demonstrates
  the MSE→KL inversion at the cleanest possible level

Includes a 2-D toy attention illustration of the softmax bucket-flip
mechanism, an explicit decision boundary for when MSE proxies quality
(linear consumers) vs when it doesn't (attention softmax), an
ablation summary distinguishing first-order from second-order factors,
operational guidance for practitioners, and a "why this wasn't caught
earlier" diagnosis of the methodology blind spots.

Companion paper handles the algorithmic claim head-on. This paper is
the broader evaluation-methodology critique.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
Mirror of @apollosenvy's turbo3_0 Vulkan SET_ROWS port (PR ggml-org#33 + ggml-org#87)
to the other two turbo types. Reported by @dpblnt in ggml-org#50 with a clean
matrix on RX 9060 XT showing turbo3 V works on Vulkan but turbo2/turbo4
V abort with:

  pre-allocated tensor (cache_v_l*) in a buffer (Vulkan0)
  that cannot run the operation (SET_ROWS)

at llama_context::sched_reserve() time, before any compute runs.

Mechanical port across 4 files:

- vulkan-shaders/types.glsl: block_turbo2_0 + block_turbo4_0 struct
  declarations matching the C side (ggml-common.h).

- vulkan-shaders/copy_to_quant.comp: SET_ROWS quantize main() blocks
  for turbo2 (4 centroids, 2-bit pack, no signs byte) and turbo4
  (16 centroids, 4-bit nibble pack, no signs byte). WHT setup and
  reduction structure identical to turbo3 (QK = 128 across all three).
  Centroid + midpoint tables ported from CENTROIDS_2BIT and
  CENTROIDS_4BIT in ggml-turbo-quant.c.

- vulkan-shaders/vulkan-shaders-gen.cpp: turbo2_0 and turbo4_0 added
  to the set_rows iteration list at line ~789.

- ggml-vulkan.cpp: SET_ROWS pipeline registrations + supports_op
  switch + dispatch element-count all extended with TURBO2_0 and
  TURBO4_0 cases.

## Verified on llvmpipe Vulkan (CPU software, AMD MI300X cloud droplet)

Patched ggml-vulkan.cpp temporarily during repro to allow llvmpipe
(normally filtered out as eCpu); patch reverted before commit. The
SET_ROWS abort is a backend-capability check at graph build time so
it fires regardless of GPU vs CPU Vulkan backend.

| ctk / ctv         | tg16 (t/s) | status        |
|-------------------|-----------:|---------------|
| q4_0 / q4_0       | 17.68      | baseline      |
| q4_0 / turbo3     | 5.91       | already worked|
| q4_0 / turbo4     | 6.14       | was aborting  |
| q4_0 / turbo2     | 5.65       | was aborting  |

llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they
are reported here only to confirm the abort is gone and the kernels
run end-to-end without divergence.

## Needs GPU validation

Cannot validate GPU shader correctness on the droplet (MI300X SR-IOV
VF does not expose itself to RADV/amdvlk on cloud). Specifically:
- Subgroup shuffle / ballot behavior on real GPU subgroup sizes
- Shader compilation under non-llvmpipe Vulkan drivers
- PPL / quality on the actual quantization math

@dpblnt @apollosenvy if either of you has cycles, would appreciate
a quick rebuild on RDNA Vulkan (gfx1100/gfx1200) to confirm:
1. The SET_ROWS abort that triggered ggml-org#50 is gone
2. Output coherence on turbo4 V (not garbage tokens)
3. PPL stays in the expected ballpark vs the CUDA / Metal
   implementations of the same quants

Closes ggml-org#50.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants