unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… by Kabir08 · Pull Request #22110 · ggml-org/llama.cpp

Kabir08 · 2026-04-19T07:30:42Z

Add unicode_regex_split_custom_qwen35() to src/unicode.cpp, a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes Eval bug: Stack overflow / crash when tokenizing long text with Qwen3.5 models #21919).
Add models/ggml-vocab-qwen35.gguf (test vocab), models/ggml-vocab-qwen35.gguf.inp (test cases), and models/ggml-vocab-qwen35.gguf.out (expected output) for regression testing.
Update tests/CMakeLists.txt to include the new test entry.

This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.

Closes #21919.

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

…ession tests - Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks). - Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes ggml-org#21919). - Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing. - Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry. This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows. Closes ggml-org#21919.

Kabir08 · 2026-04-19T12:38:18Z

CI test failure - Issues Identified

test-arg-parser

The test called common_params_parse() 38 times (once for each LLAMA_EXAMPLE_*).
Each call was internally triggering ggml_backend_load_all(), which re-scanned and re-loaded all backend .so/.dll files from disk every single time.
Additionally, the 3 HTTP tests had no connection timeout configured (timeout = 0), meaning a slow or unresponsive server could cause the test to hang indefinitely.

test-thread-safety

The test was running 3 models × 4 parallel contexts × 128 tokens on a debug Vulkan build.
Vulkan validation layers introduce very high per-token overhead. It's causing the test to timeout in CI

test-chat-template (Windows-specific)

Two separate issues:
normalize_newlines() used std::regex, which has a known bug under MSVC (issue #17830) related to ECMAScript vs. optimize flags. This could lead to deadlocks during DLL unload.
The common_log background thread was still running when the process exited, causing a hang during DLL teardown.

Fixes Applied

ggml_backend_load_all()

Added std::once_flag to ensure backends are loaded exactly once.
Repeated loading provided no benefit and was unnecessary. Making the function idempotent is the correct and cleaner behavior.

HTTP Timeouts (download.cpp + test-arg-parser.cpp)

Connection, read, and write timeouts are now only applied when params.timeout > 0.
All existing callers that do not explicitly set a timeout remain completely unaffected (thanks to the if (params.timeout > 0) guard).

test-thread-safety

Reduced the number of tokens and parallelism for CI runs.
This change affects only throughput on slow hardware/debug builds and does not impact test correctness.

test-chat-template

Replaced the std::regex call in normalize_newlines() with a simple character loop (identical semantics, no regex engine involved).
Added common_log_pause() before returning from main() to ensure the logging background thread is properly quiesced before process exit.

aldehir

Please revert the CI related changes, they're unrelated to this PR. Also, please take a look at the AI usage policy in CONTRIBUTING.md: https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

…arks

ggml-org#22110) * unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests - Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks). - Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes ggml-org#21919). - Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing. - Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry. This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows. Closes ggml-org#21919. * fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks * cont : remove trailing whitespace --------- Co-authored-by: Kabir <kabir@example.com> Co-authored-by: Alde Rojas <hello@alde.dev>

Kabir08 requested a review from ggerganov as a code owner April 19, 2026 07:30

ggerganov requested a review from aldehir April 19, 2026 07:34

github-actions Bot added the testing Everything test related label Apr 19, 2026

Kabir08 requested a review from a team as a code owner April 19, 2026 12:37

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 19, 2026

aldehir requested changes Apr 25, 2026

View reviewed changes

Comment thread src/unicode.cpp Outdated

fix: enhance regex handling for Qwen3.5 tokenizer to include accent m…

a103403

…arks

Kabir08 force-pushed the fix/qwen35-tokenizer-stack-overflow branch from a56a577 to a103403 Compare April 27, 2026 18:14

cont : remove trailing whitespace

4866d83

aldehir approved these changes May 14, 2026

View reviewed changes

ggerganov added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 14, 2026

pwilkin approved these changes May 14, 2026

View reviewed changes

pwilkin merged commit 42532af into ggml-org:master May 14, 2026
47 checks passed

Kabir08 deleted the fix/qwen35-tokenizer-stack-overflow branch May 14, 2026 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr…#22110

unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr…#22110
pwilkin merged 3 commits into
ggml-org:masterfrom
Kabir08:fix/qwen35-tokenizer-stack-overflow

Kabir08 commented Apr 19, 2026

Uh oh!

Kabir08 commented Apr 19, 2026

Uh oh!

aldehir left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Kabir08 commented Apr 19, 2026

Overview

Additional information

Requirements

Uh oh!

Kabir08 commented Apr 19, 2026

Uh oh!

aldehir left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aldehir left a comment •

edited

Loading