[Tokenizers] Add support for HuggingFace BPE Tokenizer format #6901

shaltielshmid · 2023-12-06T20:43:12Z

Is your feature request related to a problem? Please describe.

I'm requesting this feature after trying to use the GPT2-style tokenizer I trained using HuggingFace in my .NET code. I had trained a model and converted the model to ONNX, but the tokenizer didn't transfer. An exact description of the problem is listed down below.

Describe the solution you'd like

Add support for a flag indicating that the tokenizer came from the HuggingFace BPE trainer, and behind the scenes handle the minor changes required.

Describe alternatives you've considered

Currently I have a class I wrote which wraps a BPE trainer and applies the adjustments before every call to the ML.NET BPE tokenizer.

Additional context

In the HuggingFace BPE code they have a dictionary bytes_to_unicode() which is list of utf-8 byte and a mapping to unicode strings. They run every byte in the string through the mapping before running the BPE encoder/decoder. Examples of where it's used can be found here and here and in other places.

Before the encoding, they treat the string as bytes and map all the bytes to representative unicode strings, and the same thing during after the decoding.

Real example:

I trained a BPE tokenizer using HuggingFace's tokenizers.ByteLevelBPETokenizer. The merges.txt and vocab.json can be found here: https://gist.github.com/shaltielshmid/58b7c1109639eefcd714eb6bfc3eb602.

Sample python code:

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('/path/to/tokenizer')
print(tokenizer.encode('שלום וברכה')); // [150, 662, 426, 1396]
print(tokenizer.decode([150, 662, 426, 1396])); // שלום וברכה

Sample C# code:

var bpe = new Bpe("/path/to/vocab.json", "/path/to/merges.txt");
string phrase = "שלום וברכה";
Console.WriteLine(string.Join(", ", bpe.Tokenize(phrase).Select(t => t.Id.ToString()))); // 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
string decoded = Bpe.Decoder.Decode(new List<int> { 150, 662, 426, 1396 }.Select(id => bpe.IdToToken(id)!)); 
Console.WriteLine(decoded); // ×©×ľ×ķ×ĿĠ×ķ×ĳ×¨×Ľ×Ķ

// with proposed solution from down below
phrase = new string(Encoding.UTF8.GetBytes(phrase).Select(b => hf_encoder[b]).ToArray());
Console.WriteLine(string.Join(", ", bpe.Tokenize(phrase).Select(t => t.Id.ToString()))); // 150, 662, 426, 1396
decoded = Encoding.UTF8.GetString(decoded.Select(c => (byte)hf_decoder[c]).ToArray());
Console.WriteLine(decoded); // שלום וברכה

Proposed Solution

Create a static dictionary in the BPE class, which is initialized once:

var hf_encoder = new Dictionary<int, char>();
for (int c = '!'; c <= '~'; c++) hf_encoder.Add(c, (char)c);
for (int c = '¡'; c <= '¬'; c++) hf_encoder.Add(c, (char)c);
for (int c = '®'; c <= 'ÿ'; c++) hf_encoder.Add(c, (char)c);
int n = 0;
for (int c = 0; c < 256; c++) {
    if (!hf_encoder.ContainsKey(c))
        hf_encoder.Add(c, (char)(256 + n++));
}

var hf_decoder = hf_encoder.ToDictionary(kvp => kvp.Value, kvp => kvp.Key);

Then, in the BPE.cs class in the Tokenize function here, add the following check:

if (_isHFFormat) {
    sequence = new string(Encoding.UTF8.GetBytes(sequence).Select(b => hf_encoder[b]).ToArray())
}

And then in the BPEDecoder.cs file, in the Decode function here

string ret = string.Join("", tokens);
if (_suffix != null)
{
    ret = ret.Replace(_suffix, " ");
}

if (_isHFFormat) {
    ret = Encoding.UTF8.GetString(ret.Select(c => (byte)hf_decoder[c]).ToArray())
}

return ret;

Would be happy to compile this into a PR, if relevant.

@luisquintanilla

The text was updated successfully, but these errors were encountered:

shaltielshmid · 2023-12-21T14:23:18Z

Alternative suggestion:

Create a normalizer + decoder for doing this (assuming BytesToUnicodeDict and UnicodeToBytesDict is a static dictionary created using the code above).

public class BytesToUnicodeNormalizer : Normalizer {
    public static readonly BytesToUnicodeNormalizer Instance = new();
    public override NormalizedString Normalize(string original) {
        string normalized = new(Encoding.UTF8.GetBytes(original).Select(b => BytesToUnicodeDict[b]).ToArray());
        return new NormalizedString(original, normalized, null, true);
    }
}

public class UnicodeToBytesBpeDecoder : TokenizerDecoder {
    TokenizerDecoder _decoder;
    public UnicodeToBytesBpeDecoder() {
        _decoder = new BpeDecoder();
    }

    public override string Decode(IEnumerable<string> tokens) {
        string decoded = _decoder.Decode(tokens);
        return Encoding.UTF8.GetString(decoded.Select(b => (byte)UnicodeToBytesDict[b]).ToArray());
    }
}

And then we can create the tokenizer like this (the EmptyPreTokenizer class is a custom PreTokenizer just to make sure that the WhitespaceTokenizer isn't used):

public class HFStyleBPETokenizer : Tokenizer {
    public GPT2Tokenizer(string vocabFname, string mergesFname, string unkToken) : base(new Bpe(vocabFname, mergesFname, unkToken), EmptyPreTokenizer.Instance, BytesToUnicodeNormalizer.Instance) {
        Decoder = new UnicodeToBytesDecoder();
    }
}

shaltielshmid added the enhancement New feature or request label Dec 6, 2023

ghost added the untriaged New issue has not been triaged label Dec 6, 2023

luisquintanilla added the Deep Learning label Dec 7, 2023

luisquintanilla added this to the ML.NET 4.0 milestone Dec 7, 2023

ghost removed the untriaged New issue has not been triaged label Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenizers] Add support for HuggingFace BPE Tokenizer format #6901

[Tokenizers] Add support for HuggingFace BPE Tokenizer format #6901

shaltielshmid commented Dec 6, 2023 •

edited

Loading

shaltielshmid commented Dec 21, 2023 •

edited

Loading

[Tokenizers] Add support for HuggingFace BPE Tokenizer format #6901

[Tokenizers] Add support for HuggingFace BPE Tokenizer format #6901

Comments

shaltielshmid commented Dec 6, 2023 • edited Loading

shaltielshmid commented Dec 21, 2023 • edited Loading

shaltielshmid commented Dec 6, 2023 •

edited

Loading

shaltielshmid commented Dec 21, 2023 •

edited

Loading