Still no o200k_base support #7154

Xan-Kun · 2024-05-16T23:30:32Z

System Information (please complete the following information):

OS & Version: all
OS & Version: all
ML.NET Version: all
.NET Version: all

Describe the bug
No way to tokenize gpt-4o strings!

To Reproduce
Tokenize a string for gpt-4o

Expected behavior
The most recent models are supported.

stephentoub · 2024-05-17T01:29:19Z

a) Please be respectful.
b) Even the official openai/tiktoken repo only merged support for gpt-4o 3 days ago: openai/tiktoken@9d01e56. Integrated support will be coming here as well.
c) Tokenizer does support the new model, it's just not as integrated as for the other models because the vocab file and associated regex aren't baked in. I expect @tarekgh would be able to share a sample.

tarekgh · 2024-05-17T02:41:28Z

Here is the sample how to create the gpt-4o.

const string ENDOFTEXT = "<|endoftext|>";
const string ENDOFPROMPT = "<|endofprompt|>";
Dictionary<string, int> specialTokens = new() 
{ 
    { ENDOFTEXT,   199999 },
    { ENDOFPROMPT, 200018 }
};
string regexPattern = @"[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n/]*|\s*[\r\n]+|\s+(?!\S)|\s+";
Regex regex = new Regex(regexPattern, RegexOptions.Compiled);
HttpClient httpClient = new HttpClient();
Tiktoken tiktoken= await Tiktoken.CreateAsync(await httpClient.GetStreamAsync(@"https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"), specialTokens);
Tokenizer gpt4o = new Tokenizer(tiktoken, new TiktokenPreTokenizer(regex, specialTokens));

gpt4o.EncodeToIds("Hello, World!<|endoftext|>").ToList().ForEach(Console.WriteLine);

Note, this is using the library version:

    <PackageReference Include="Microsoft.ML.Tokenizers" Version="0.22.0-preview.24179.1" />

I didn't do deep testing as https://platform.openai.com/tokenizer didn't enable this new model.

tarekgh · 2024-05-17T02:50:18Z

CC @ericstj @luisquintanilla

tarekgh · 2024-05-22T01:00:51Z

This change is now published to the NuGet https://www.nuget.org/packages/Microsoft.ML.Tokenizers/0.22.0-preview.24271.1

dotnet-policy-service bot added the untriaged New issue has not been triaged label May 16, 2024

tarekgh self-assigned this May 17, 2024

tarekgh added Tokenizers and removed untriaged New issue has not been triaged labels May 17, 2024

stephentoub mentioned this issue May 21, 2024

Support Gpt-4o tokenizer model #7157

Merged

tarekgh closed this as completed in #7157 May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Still no o200k_base support #7154

Still no o200k_base support #7154

Xan-Kun commented May 16, 2024 •

edited

stephentoub commented May 17, 2024

tarekgh commented May 17, 2024

tarekgh commented May 17, 2024

tarekgh commented May 22, 2024

Still no o200k_base support #7154

Still no o200k_base support #7154

Comments

Xan-Kun commented May 16, 2024 • edited

stephentoub commented May 17, 2024

tarekgh commented May 17, 2024

tarekgh commented May 17, 2024

tarekgh commented May 22, 2024

Xan-Kun commented May 16, 2024 •

edited