-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Unicode codepoint flags for custom regexs #7245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Looks like the tokenizer tests are failing on Windows for some reason: https://github.com/ggerganov/llama.cpp/actions/runs/9096294810/job/25001393493?pr=7245#step:12:2583 |
I can not debug this in local, it is possible to skip all but the failing test? I have reviewed the previous logs but that test was not executed, so I think i'm going to start from a clean point and redo all commits until I see the fail. Also I found that compiling tests with |
|
The problem is the stack size limit in Windows. According to MSVC \STACK documentation:
|
afcbcb5 to
6ca6c46
Compare
|
I think I'm done here. Now I have the base to fix tokenizers. |
* Replace CODEPOINT_TYPE_* with codepoint_flags * Update and bugfix brute force random test * Deterministic brute force random test * Unicode normalization NFD * Get rid of BOM




Use flags for each unicode category (
\p{N},\p{L},\p{Z}, ...) instead of definitionsCODEPOINT_TYPE_*.Including helper flags for common regex params like
\s(only this for now),\d,\w...This simplifies writing custom regexs.
All flags are precomputed in
unicode-data.cppgenerated bygen-unicode-data.py.