Add support for tokenize and untokenize of UTF-8 encoding in prompt/output#87
Add support for tokenize and untokenize of UTF-8 encoding in prompt/output#87wizd wants to merge 8 commits into
Conversation
|
make编译不过; |
|
Need to add sentencepiece library manually |
|
@wizd I using your fork, the intract mode is not work: it was in dead loop..... |
|
Resolved in #79 |
|
Oh wait, did I get confused? |
|
I think it does. Are you still able to reproduce the issues? |
|
I reran the There are 2 problems still:
|
|
Can you try running from shell script encoded as UTF-8 and outputting to a text file? Your terminal might not be handling Unicode correctly. You’ll also need to re-generate your models from scratch since this PR changes how the ggml files are created. |
|
@ggerganov just in case: did you re-run the quantization script as well? |
Oops .. all good now 🦙 |
|
suggestion: can we add a magic version number? i feel we’ll get further updates?
…On Mon, Mar 13, 2023 at 21:08, Georgi Gerganov ***@***.***> wrote:
> ***@***.***(https://github.com/ggerganov) just in case: did you re-run the quantization script as well?
Oops .. all good now 🦙
—
Reply to this email directly, [view it on GitHub](#87 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AADHVK435NNAH7DLEKXMTEDW35WCRANCNFSM6AAAAAAVYZNR34).
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
|
Does this merge into master? How to test it? The wizd's branch doesn't work well with intract mode. |
|
@ggerganov |
Fix TypeError in low_level chat
…l-fix vulkan: fix turbo3 build + coopmat FA after April upstream sync
* iq3_k: fix Metal dot product I was accessing the scales as 4-byte aligned, but iq3_k is not 4-byte aligned. Instead of throwing an error (as it happens on CUDA when one makes this mistake), Metal silently accepts and we get garbage. * iq3_k: slightly faster Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Mirror of @apollosenvy's turbo3_0 Vulkan SET_ROWS port (PR ggml-org#33 + ggml-org#87) to the other two turbo types. Reported by @dpblnt in ggml-org#50 with a clean matrix on RX 9060 XT showing turbo3 V works on Vulkan but turbo2/turbo4 V abort with: pre-allocated tensor (cache_v_l*) in a buffer (Vulkan0) that cannot run the operation (SET_ROWS) at llama_context::sched_reserve() time, before any compute runs. Mechanical port across 4 files: - vulkan-shaders/types.glsl: block_turbo2_0 + block_turbo4_0 struct declarations matching the C side (ggml-common.h). - vulkan-shaders/copy_to_quant.comp: SET_ROWS quantize main() blocks for turbo2 (4 centroids, 2-bit pack, no signs byte) and turbo4 (16 centroids, 4-bit nibble pack, no signs byte). WHT setup and reduction structure identical to turbo3 (QK = 128 across all three). Centroid + midpoint tables ported from CENTROIDS_2BIT and CENTROIDS_4BIT in ggml-turbo-quant.c. - vulkan-shaders/vulkan-shaders-gen.cpp: turbo2_0 and turbo4_0 added to the set_rows iteration list at line ~789. - ggml-vulkan.cpp: SET_ROWS pipeline registrations + supports_op switch + dispatch element-count all extended with TURBO2_0 and TURBO4_0 cases. ## Verified on llvmpipe Vulkan (CPU software, AMD MI300X cloud droplet) Patched ggml-vulkan.cpp temporarily during repro to allow llvmpipe (normally filtered out as eCpu); patch reverted before commit. The SET_ROWS abort is a backend-capability check at graph build time so it fires regardless of GPU vs CPU Vulkan backend. | ctk / ctv | tg16 (t/s) | status | |-------------------|-----------:|---------------| | q4_0 / q4_0 | 17.68 | baseline | | q4_0 / turbo3 | 5.91 | already worked| | q4_0 / turbo4 | 6.14 | was aborting | | q4_0 / turbo2 | 5.65 | was aborting | llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they are reported here only to confirm the abort is gone and the kernels run end-to-end without divergence. ## Needs GPU validation Cannot validate GPU shader correctness on the droplet (MI300X SR-IOV VF does not expose itself to RADV/amdvlk on cloud). Specifically: - Subgroup shuffle / ballot behavior on real GPU subgroup sizes - Shader compilation under non-llvmpipe Vulkan drivers - PPL / quality on the actual quantization math @dpblnt @apollosenvy if either of you has cycles, would appreciate a quick rebuild on RDNA Vulkan (gfx1100/gfx1200) to confirm: 1. The SET_ROWS abort that triggered ggml-org#50 is gone 2. Output coherence on turbo4 V (not garbage tokens) 3. PPL stays in the expected ballpark vs the CUDA / Metal implementations of the same quants Closes ggml-org#50. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical response to issues ggml-org#87 (Mitzenmacher) and ggml-org#89 (Portnoy) on turboquant_plus. 8 hypothesis tests across synthetic Gaussian and Qwen3-0.6B + Llama-3.2-1B real KV cache. Headline: EDEN's optimal scale is real but second-order. Rotation choice (WHT vs dense Haar) is first-order and accounts for ~all of the published gap. Production already uses the first-order fix. On real KV with WHT, matched-norm beats EDEN-S by 0.5-9% across Qwen and Llama; tightens to 0.67-0.72% on Llama. EDEN-S genuinely wins at b=8 on synthetic Gaussian (70%+ MSE reduction) — credibility moat showing the lever exists in its own regime. Concedes attribution argument (ggml-org#89): DRIVE/EDEN line is real prior art. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… proxy for KV cache quantization Adds the empirical study covering the broader 35-hypothesis investigation spawned by the EDEN/Mitzenmacher/Portnoy critique (ggml-org#87, ggml-org#89). Companion to eden-optimal-s-revisit.md. Headline result: a K-cache centroid table that improves per-vector reconstruction MSE by 1–13% across five model families (Qwen3-0.6B, Llama-3.2-1B, Mistral-7B, Phi-3-mini, Gemma-2-2b) causes 70–90% mean KL@D regressions and 50–60% catastrophic-rate failures at the model output. This is a sign inversion, not a small mismatch: methods that are strictly better under MSE are systematically worse under KL@D in deployment. The paper is structured in three acts: - Act 1: K cache is sub-Gaussian post-WHT (H8); V is Gaussian - Act 2: per-prompt fitting fails (PDS), ensemble (ens_4way) reaches an irreducible ~8% catastrophic floor at scale (N=100) - Act 3: the universal-K cross-model counterexample (F3) demonstrates the MSE→KL inversion at the cleanest possible level Includes a 2-D toy attention illustration of the softmax bucket-flip mechanism, an explicit decision boundary for when MSE proxies quality (linear consumers) vs when it doesn't (attention softmax), an ablation summary distinguishing first-order from second-order factors, operational guidance for practitioners, and a "why this wasn't caught earlier" diagnosis of the methodology blind spots. Companion paper handles the algorithmic claim head-on. This paper is the broader evaluation-methodology critique. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror of @apollosenvy's turbo3_0 Vulkan SET_ROWS port (PR ggml-org#33 + ggml-org#87) to the other two turbo types. Reported by @dpblnt in ggml-org#50 with a clean matrix on RX 9060 XT showing turbo3 V works on Vulkan but turbo2/turbo4 V abort with: pre-allocated tensor (cache_v_l*) in a buffer (Vulkan0) that cannot run the operation (SET_ROWS) at llama_context::sched_reserve() time, before any compute runs. Mechanical port across 4 files: - vulkan-shaders/types.glsl: block_turbo2_0 + block_turbo4_0 struct declarations matching the C side (ggml-common.h). - vulkan-shaders/copy_to_quant.comp: SET_ROWS quantize main() blocks for turbo2 (4 centroids, 2-bit pack, no signs byte) and turbo4 (16 centroids, 4-bit nibble pack, no signs byte). WHT setup and reduction structure identical to turbo3 (QK = 128 across all three). Centroid + midpoint tables ported from CENTROIDS_2BIT and CENTROIDS_4BIT in ggml-turbo-quant.c. - vulkan-shaders/vulkan-shaders-gen.cpp: turbo2_0 and turbo4_0 added to the set_rows iteration list at line ~789. - ggml-vulkan.cpp: SET_ROWS pipeline registrations + supports_op switch + dispatch element-count all extended with TURBO2_0 and TURBO4_0 cases. ## Verified on llvmpipe Vulkan (CPU software, AMD MI300X cloud droplet) Patched ggml-vulkan.cpp temporarily during repro to allow llvmpipe (normally filtered out as eCpu); patch reverted before commit. The SET_ROWS abort is a backend-capability check at graph build time so it fires regardless of GPU vs CPU Vulkan backend. | ctk / ctv | tg16 (t/s) | status | |-------------------|-----------:|---------------| | q4_0 / q4_0 | 17.68 | baseline | | q4_0 / turbo3 | 5.91 | already worked| | q4_0 / turbo4 | 6.14 | was aborting | | q4_0 / turbo2 | 5.65 | was aborting | llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they are reported here only to confirm the abort is gone and the kernels run end-to-end without divergence. ## Needs GPU validation Cannot validate GPU shader correctness on the droplet (MI300X SR-IOV VF does not expose itself to RADV/amdvlk on cloud). Specifically: - Subgroup shuffle / ballot behavior on real GPU subgroup sizes - Shader compilation under non-llvmpipe Vulkan drivers - PPL / quality on the actual quantization math @dpblnt @apollosenvy if either of you has cycles, would appreciate a quick rebuild on RDNA Vulkan (gfx1100/gfx1200) to confirm: 1. The SET_ROWS abort that triggered ggml-org#50 is gone 2. Output coherence on turbo4 V (not garbage tokens) 3. PPL stays in the expected ballpark vs the CUDA / Metal implementations of the same quants Closes ggml-org#50. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


The tokenization process of LLaMA is filled with magic numbers and not easily replicable. However, I have found that using the SentencePiece library works well. It's possible that the original LLaMA model also used SentencePiece for its tokenization.
test prompt: '我静静的坐在雨中,思考着'
I sit quietly in the rain, thinking
This sentence was heavily tokenized to <0x??>, making it very difficult to replicate.