Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still no o200k_base support #7154

Closed
Xan-Kun opened this issue May 16, 2024 · 4 comments · Fixed by #7157
Closed

Still no o200k_base support #7154

Xan-Kun opened this issue May 16, 2024 · 4 comments · Fixed by #7157
Assignees

Comments

@Xan-Kun
Copy link

Xan-Kun commented May 16, 2024

System Information (please complete the following information):

  • OS & Version: all
  • OS & Version: all
  • ML.NET Version: all
  • .NET Version: all

Describe the bug
No way to tokenize gpt-4o strings!

To Reproduce
Tokenize a string for gpt-4o

Expected behavior
The most recent models are supported.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged label May 16, 2024
@stephentoub
Copy link
Member

a) Please be respectful.
b) Even the official openai/tiktoken repo only merged support for gpt-4o 3 days ago: openai/tiktoken@9d01e56. Integrated support will be coming here as well.
c) Tokenizer does support the new model, it's just not as integrated as for the other models because the vocab file and associated regex aren't baked in. I expect @tarekgh would be able to share a sample.

@tarekgh
Copy link
Member

tarekgh commented May 17, 2024

Here is the sample how to create the gpt-4o.

const string ENDOFTEXT = "<|endoftext|>";
const string ENDOFPROMPT = "<|endofprompt|>";
Dictionary<string, int> specialTokens = new() 
{ 
    { ENDOFTEXT,   199999 },
    { ENDOFPROMPT, 200018 }
};
string regexPattern = @"[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n/]*|\s*[\r\n]+|\s+(?!\S)|\s+";
Regex regex = new Regex(regexPattern, RegexOptions.Compiled);
HttpClient httpClient = new HttpClient();
Tiktoken tiktoken= await Tiktoken.CreateAsync(await httpClient.GetStreamAsync(@"https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"), specialTokens);
Tokenizer gpt4o = new Tokenizer(tiktoken, new TiktokenPreTokenizer(regex, specialTokens));

gpt4o.EncodeToIds("Hello, World!<|endoftext|>").ToList().ForEach(Console.WriteLine);

Note, this is using the library version:

    <PackageReference Include="Microsoft.ML.Tokenizers" Version="0.22.0-preview.24179.1" />

I didn't do deep testing as https://platform.openai.com/tokenizer didn't enable this new model.

@tarekgh tarekgh self-assigned this May 17, 2024
@tarekgh tarekgh added Tokenizers and removed untriaged New issue has not been triaged labels May 17, 2024
@tarekgh
Copy link
Member

tarekgh commented May 17, 2024

CC @ericstj @luisquintanilla

@tarekgh
Copy link
Member

tarekgh commented May 22, 2024

This change is now published to the NuGet https://www.nuget.org/packages/Microsoft.ML.Tokenizers/0.22.0-preview.24271.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants