Attempt to fix pre-tokenizer #5613

bobqianic · 2024-02-20T16:56:39Z

This marks my second effort at resolving the issues with the pre-tokenizer in llama.cpp. I've developed a universal Unicode engine alongside a specialized regex engine. While regex engine has its limitations, only supporting very limited functionalities, it serves our needs well and offers impressive speed.

I have a question regarding tokenizers. Is the falcon model the only one utilizing the BPE tokenizer at this point? I'm asking because if that's not the case, we might encounter some issues.

My concern arises from the potential issues due to the diversity in pre-tokenization among models. The current understanding, as reflected in both the master branch and this pull request, suggests that the bpe_gpt2_preprocess function is exclusively for the falcon model. However, if other models also use BPE, this assumption could lead to complications.

hiepxanh · 2024-02-20T23:41:42Z

https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js

Did you take a look on the javascript version and python version of transformer? I think it might be useful

Great effort anyway

ggerganov

This looks good, but needs more work. I think there is some unnecessary complexity in the implementation - indirections, classes, etc. These should be eliminated.

There should be standalone regex tests - see my comments.

Regarding the question about which models we support - I think nobody knows at this point. Every models picks randomly tokenization options and don't care. It's impossible to understand what options exists and are used - we'll implement features on a case-by-case basis and 3rd party projects can always fallback to the Python bloat for tokenization

ggerganov · 2024-02-21T09:13:26Z

unicode.h

+        }
+    }
+
+    std::vector<std::string> to_category_code(const std::vector<uint32_t> & UNICODE_TYPES) {


This should be to_category_name

Yeah, I made a mistake here.

ggerganov · 2024-02-21T09:14:42Z

unicode.h

+    };
+}
+
+class UNICODE {


Can we avoid this class and have basic function calls + static containers that are lazy initialized on first use?

Prefix all functions with unicode_. For example: `unicode_to_codepoints()

This can work, but I'm not sure it's the best way to do it. See my comment below. The good thing about using a class is that we can make many instances of it, each one having different category definitions, and they won't mess with each other.

The good thing about using a class is that we can make many instances of it, each one having different category definitions, and they won't mess with each other.

Ideally, there should be a singe unicode configuration. In what case would we need to have differing category definitions?

ggerganov · 2024-02-21T09:16:15Z

unicode_regex.h

+#include "unicode.h"
+#include "unordered_set"
+
+class llm_regex {


No need for this class - use functions with regex_ prefix

Move everything from unicode_regex.h into unicode.h

ggerganov · 2024-02-21T09:19:05Z

unicode_regex.h

+    }
+
+    llm_regex() {
+        unicode_engine.overload_category(REGEX_RANGES::Whitespace, "WHITESPACE");


Why is this done explicitly instead of being done by default?

In developing a universal Unicode engine, I face a notable challenge: the Unicode 15.0 standard, as defined by the Unicode Consortium, does not specify a definition for whitespace. Furthermore, the interpretation of \s whitespace can vary among different regex engines.

ggerganov · 2024-02-21T09:21:19Z

unicode.h

+    uint32_t get_category(const uint32_t & codepoint) {
+        return category_implement(codepoint);
+    }
+


Are such indirections necessary? Just implement get_category - no need for private methods

ggerganov · 2024-02-21T09:26:26Z

unicode_regex.h

+
+class llm_regex {
+public:
+    std::vector<std::string> gpt2_style(const std::string & str) {


These implementations should be put behind a common API that accepts a regex string. Something like:

std::vector<std::vector<uint32_t>> regex_split(const std::string & str, const std::string & regex);

We then check if the regex is one that is implemented - if not we throw an error.
We should add tests where we compare the outputs from regex_split with reference Python implementations

bobqianic · 2024-02-21T12:41:56Z

BTW, when testing the tokenizer on large-scale datasets like Wikitext, the method in llama.cpp (tests) fails. This primarily due to Python-side issues. Specifically, using .encode("utf-8") to convert individual token string to bytes before writing to a file is problematic. Not all tokens represent valid UTF-8 text. Consequently, this results in numerous replacement characters �.

cebtenzzre · 2024-02-22T21:25:38Z

Specifically, using .encode("utf-8") to convert individual token string to bytes before writing to a file is problematic. Not all tokens represent valid UTF-8 text. Consequently, this results in numerous replacement characters �.

UTF-8 can represent any valid Unicode, so surely this is not an issue with the use of encode - the string must already contain Unicode replacement characters because it was incorrectly decoded (str is a Unicode-aware type in Python 3).

bobqianic · 2024-02-22T21:33:21Z

UTF-8 can represent any valid Unicode, so surely this is not an issue with the use of encode - the string must already contain Unicode replacement characters because it was incorrectly decoded (str is a Unicode-aware type in Python 3).

I have a different perspective on this. If the token string already contained Unicode replacement characters, I'm curious how combining two or more such tokens could still result in a valid UTF-8 sequence. It seems counterintuitive, doesn't it? Perhaps we can clarify this with a straightforward experiment to see what actually happens.

bobqianic · 2024-02-22T22:03:20Z

UTF-8 can represent any valid Unicode, so surely this is not an issue with the use of encode - the string must already contain Unicode replacement characters because it was incorrectly decoded (str is a Unicode-aware type in Python 3).

@cebtenzzre You are right, I still get ��の��ル��3. This indeed isn't a problem related to the use of .encode("utf-8"), but rather an issue that arises from using tokenizer.decode() to decode a single token.

bobqianic added 2 commits February 20, 2024 16:47

Add files via upload

5eebbf0

Fix parentheses error

93e2c73

bobqianic mentioned this pull request Feb 20, 2024

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization #3502

Closed

Fix punctuation split

1b8da8e

ggerganov requested changes Feb 21, 2024

View reviewed changes

bobqianic marked this pull request as draft February 21, 2024 11:54

hiepxanh mentioned this pull request Feb 22, 2024

BERT wordpiece tokenizer differers from official HF implementation #5496

Closed

ggerganov mentioned this pull request Mar 2, 2024

Deepseek coder merge #5464

Closed

sindresorhus mentioned this pull request Mar 8, 2024

Fix the decoding issues series: BPE Tokenizer ggerganov/whisper.cpp#1854

Open

bullno1 mentioned this pull request Apr 23, 2024

BPE Tokenizer: Multiple newlines doesn't merge into a single token #6809

Closed

lapp0 mentioned this pull request May 16, 2024

Circumvent Broken llama.cpp Pre-Tokenizer outlines-dev/outlines#892

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to fix pre-tokenizer #5613

Attempt to fix pre-tokenizer #5613

bobqianic commented Feb 20, 2024 •

edited

hiepxanh commented Feb 20, 2024 •

edited

ggerganov left a comment

ggerganov Feb 21, 2024

bobqianic Feb 21, 2024

ggerganov Feb 21, 2024

bobqianic Feb 21, 2024 •

edited

ggerganov Feb 21, 2024

ggerganov Feb 21, 2024

ggerganov Feb 21, 2024

bobqianic Feb 21, 2024

ggerganov Feb 21, 2024

ggerganov Feb 21, 2024

bobqianic commented Feb 21, 2024

cebtenzzre commented Feb 22, 2024 •

edited

bobqianic commented Feb 22, 2024

bobqianic commented Feb 22, 2024

Attempt to fix pre-tokenizer #5613

Are you sure you want to change the base?

Attempt to fix pre-tokenizer #5613

Conversation

bobqianic commented Feb 20, 2024 • edited

hiepxanh commented Feb 20, 2024 • edited

ggerganov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobqianic Feb 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobqianic commented Feb 21, 2024

cebtenzzre commented Feb 22, 2024 • edited

bobqianic commented Feb 22, 2024

bobqianic commented Feb 22, 2024

bobqianic commented Feb 20, 2024 •

edited

hiepxanh commented Feb 20, 2024 •

edited

bobqianic Feb 21, 2024 •

edited

cebtenzzre commented Feb 22, 2024 •

edited