You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I'm requesting this feature after trying to use the GPT2-style tokenizer I trained using HuggingFace in my .NET code. I had trained a model and converted the model to ONNX, but the tokenizer didn't transfer. An exact description of the problem is listed down below.
Describe the solution you'd like
Add support for a flag indicating that the tokenizer came from the HuggingFace BPE trainer, and behind the scenes handle the minor changes required.
Describe alternatives you've considered
Currently I have a class I wrote which wraps a BPE trainer and applies the adjustments before every call to the ML.NET BPE tokenizer.
Additional context
In the HuggingFace BPE code they have a dictionary bytes_to_unicode() which is list of utf-8 byte and a mapping to unicode strings. They run every byte in the string through the mapping before running the BPE encoder/decoder. Examples of where it's used can be found here and here and in other places.
Before the encoding, they treat the string as bytes and map all the bytes to representative unicode strings, and the same thing during after the decoding.
And then we can create the tokenizer like this (the EmptyPreTokenizer class is a custom PreTokenizer just to make sure that the WhitespaceTokenizer isn't used):
Is your feature request related to a problem? Please describe.
I'm requesting this feature after trying to use the GPT2-style tokenizer I trained using HuggingFace in my .NET code. I had trained a model and converted the model to ONNX, but the tokenizer didn't transfer. An exact description of the problem is listed down below.
Describe the solution you'd like
Add support for a flag indicating that the tokenizer came from the HuggingFace BPE trainer, and behind the scenes handle the minor changes required.
Describe alternatives you've considered
Currently I have a class I wrote which wraps a BPE trainer and applies the adjustments before every call to the ML.NET BPE tokenizer.
Additional context
In the HuggingFace BPE code they have a dictionary
bytes_to_unicode()
which is list of utf-8 byte and a mapping to unicode strings. They run every byte in the string through the mapping before running the BPE encoder/decoder. Examples of where it's used can be found here and here and in other places.Before the encoding, they treat the string as bytes and map all the bytes to representative unicode strings, and the same thing during after the decoding.
Real example:
I trained a BPE tokenizer using HuggingFace's
tokenizers.ByteLevelBPETokenizer
. The merges.txt and vocab.json can be found here: https://gist.github.com/shaltielshmid/58b7c1109639eefcd714eb6bfc3eb602.Sample python code:
Sample C# code:
Proposed Solution
Create a static dictionary in the BPE class, which is initialized once:
Then, in the BPE.cs class in the
Tokenize
function here, add the following check:And then in the BPEDecoder.cs file, in the
Decode
function hereWould be happy to compile this into a PR, if relevant.
@luisquintanilla
The text was updated successfully, but these errors were encountered: