Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to fix pre-tokenizer #5613

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Conversation

bobqianic
Copy link
Contributor

@bobqianic bobqianic commented Feb 20, 2024

This marks my second effort at resolving the issues with the pre-tokenizer in llama.cpp. I've developed a universal Unicode engine alongside a specialized regex engine. While regex engine has its limitations, only supporting very limited functionalities, it serves our needs well and offers impressive speed.

I have a question regarding tokenizers. Is the falcon model the only one utilizing the BPE tokenizer at this point? I'm asking because if that's not the case, we might encounter some issues.

My concern arises from the potential issues due to the diversity in pre-tokenization among models. The current understanding, as reflected in both the master branch and this pull request, suggests that the bpe_gpt2_preprocess function is exclusively for the falcon model. However, if other models also use BPE, this assumption could lead to complications.

@hiepxanh
Copy link

hiepxanh commented Feb 20, 2024

https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js

Did you take a look on the javascript version and python version of transformer? I think it might be useful

Great effort anyway

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but needs more work. I think there is some unnecessary complexity in the implementation - indirections, classes, etc. These should be eliminated.

There should be standalone regex tests - see my comments.

Regarding the question about which models we support - I think nobody knows at this point. Every models picks randomly tokenization options and don't care. It's impossible to understand what options exists and are used - we'll implement features on a case-by-case basis and 3rd party projects can always fallback to the Python bloat for tokenization

}
}

std::vector<std::string> to_category_code(const std::vector<uint32_t> & UNICODE_TYPES) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be to_category_name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I made a mistake here.

};
}

class UNICODE {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid this class and have basic function calls + static containers that are lazy initialized on first use?

Prefix all functions with unicode_. For example: `unicode_to_codepoints()

Copy link
Contributor Author

@bobqianic bobqianic Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can work, but I'm not sure it's the best way to do it. See my comment below. The good thing about using a class is that we can make many instances of it, each one having different category definitions, and they won't mess with each other.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The good thing about using a class is that we can make many instances of it, each one having different category definitions, and they won't mess with each other.

Ideally, there should be a singe unicode configuration. In what case would we need to have differing category definitions?

#include "unicode.h"
#include "unordered_set"

class llm_regex {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this class - use functions with regex_ prefix

Move everything from unicode_regex.h into unicode.h

}

llm_regex() {
unicode_engine.overload_category(REGEX_RANGES::Whitespace, "WHITESPACE");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this done explicitly instead of being done by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In developing a universal Unicode engine, I face a notable challenge: the Unicode 15.0 standard, as defined by the Unicode Consortium, does not specify a definition for whitespace. Furthermore, the interpretation of \s whitespace can vary among different regex engines.

uint32_t get_category(const uint32_t & codepoint) {
return category_implement(codepoint);
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are such indirections necessary? Just implement get_category - no need for private methods


class llm_regex {
public:
std::vector<std::string> gpt2_style(const std::string & str) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These implementations should be put behind a common API that accepts a regex string. Something like:

std::vector<std::vector<uint32_t>> regex_split(const std::string & str, const std::string & regex);

We then check if the regex is one that is implemented - if not we throw an error.
We should add tests where we compare the outputs from regex_split with reference Python implementations

@bobqianic bobqianic marked this pull request as draft February 21, 2024 11:54
@bobqianic
Copy link
Contributor Author

BTW, when testing the tokenizer on large-scale datasets like Wikitext, the method in llama.cpp (tests) fails. This primarily due to Python-side issues. Specifically, using .encode("utf-8") to convert individual token string to bytes before writing to a file is problematic. Not all tokens represent valid UTF-8 text. Consequently, this results in numerous replacement characters .

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Feb 22, 2024

Specifically, using .encode("utf-8") to convert individual token string to bytes before writing to a file is problematic. Not all tokens represent valid UTF-8 text. Consequently, this results in numerous replacement characters .

UTF-8 can represent any valid Unicode, so surely this is not an issue with the use of encode - the string must already contain Unicode replacement characters because it was incorrectly decoded (str is a Unicode-aware type in Python 3).

@bobqianic
Copy link
Contributor Author

UTF-8 can represent any valid Unicode, so surely this is not an issue with the use of encode - the string must already contain Unicode replacement characters because it was incorrectly decoded (str is a Unicode-aware type in Python 3).

I have a different perspective on this. If the token string already contained Unicode replacement characters, I'm curious how combining two or more such tokens could still result in a valid UTF-8 sequence. It seems counterintuitive, doesn't it? Perhaps we can clarify this with a straightforward experiment to see what actually happens.

@bobqianic
Copy link
Contributor Author

UTF-8 can represent any valid Unicode, so surely this is not an issue with the use of encode - the string must already contain Unicode replacement characters because it was incorrectly decoded (str is a Unicode-aware type in Python 3).

@cebtenzzre You are right, I still get ����の����ル��������3. This indeed isn't a problem related to the use of .encode("utf-8"), but rather an issue that arises from using tokenizer.decode() to decode a single token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants