Skip to content

unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr…#22110

Merged
pwilkin merged 3 commits into
ggml-org:masterfrom
Kabir08:fix/qwen35-tokenizer-stack-overflow
May 14, 2026
Merged

unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr…#22110
pwilkin merged 3 commits into
ggml-org:masterfrom
Kabir08:fix/qwen35-tokenizer-stack-overflow

Conversation

@Kabir08
Copy link
Copy Markdown
Contributor

@Kabir08 Kabir08 commented Apr 19, 2026

This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.

Closes #21919.

Overview

Additional information

Requirements

…ession tests

- Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
- Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes ggml-org#21919).
- Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing.
- Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry.

This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.

Closes ggml-org#21919.
@Kabir08 Kabir08 requested a review from ggerganov as a code owner April 19, 2026 07:30
@ggerganov ggerganov requested a review from aldehir April 19, 2026 07:34
@github-actions github-actions Bot added the testing Everything test related label Apr 19, 2026
@Kabir08 Kabir08 requested a review from a team as a code owner April 19, 2026 12:37
@Kabir08
Copy link
Copy Markdown
Contributor Author

Kabir08 commented Apr 19, 2026

CI test failure - Issues Identified

  1. test-arg-parser

The test called common_params_parse() 38 times (once for each LLAMA_EXAMPLE_*).
Each call was internally triggering ggml_backend_load_all(), which re-scanned and re-loaded all backend .so/.dll files from disk every single time.
Additionally, the 3 HTTP tests had no connection timeout configured (timeout = 0), meaning a slow or unresponsive server could cause the test to hang indefinitely.

  1. test-thread-safety

The test was running 3 models × 4 parallel contexts × 128 tokens on a debug Vulkan build.
Vulkan validation layers introduce very high per-token overhead. It's causing the test to timeout in CI

  1. test-chat-template (Windows-specific)

Two separate issues:
normalize_newlines() used std::regex, which has a known bug under MSVC (issue #17830) related to ECMAScript vs. optimize flags. This could lead to deadlocks during DLL unload.
The common_log background thread was still running when the process exited, causing a hang during DLL teardown.

Fixes Applied

  1. ggml_backend_load_all()

Added std::once_flag to ensure backends are loaded exactly once.
Repeated loading provided no benefit and was unnecessary. Making the function idempotent is the correct and cleaner behavior.

  1. HTTP Timeouts (download.cpp + test-arg-parser.cpp)

Connection, read, and write timeouts are now only applied when params.timeout > 0.
All existing callers that do not explicitly set a timeout remain completely unaffected (thanks to the if (params.timeout > 0) guard).

  1. test-thread-safety

Reduced the number of tokens and parallelism for CI runs.
This change affects only throughput on slow hardware/debug builds and does not impact test correctness.

  1. test-chat-template

Replaced the std::regex call in normalize_newlines() with a simple character loop (identical semantics, no regex engine involved).
Added common_log_pause() before returning from main() to ensure the logging background thread is properly quiesced before process exit.

@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 19, 2026
Copy link
Copy Markdown
Contributor

@aldehir aldehir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert the CI related changes, they're unrelated to this PR. Also, please take a look at the AI usage policy in CONTRIBUTING.md: https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

Comment thread src/unicode.cpp Outdated
@Kabir08 Kabir08 force-pushed the fix/qwen35-tokenizer-stack-overflow branch from a56a577 to a103403 Compare April 27, 2026 18:14
@ggerganov ggerganov added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 14, 2026
@pwilkin pwilkin merged commit 42532af into ggml-org:master May 14, 2026
47 checks passed
@Kabir08 Kabir08 deleted the fix/qwen35-tokenizer-stack-overflow branch May 14, 2026 09:09
xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 14, 2026
ggml-org#22110)

* unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests

- Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
- Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes ggml-org#21919).
- Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing.
- Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry.

This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.

Closes ggml-org#21919.

* fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks

* cont : remove trailing whitespace

---------

Co-authored-by: Kabir <kabir@example.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Stack overflow / crash when tokenizing long text with Qwen3.5 models

4 participants