-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Still no o200k_base support #7154
Comments
a) Please be respectful. |
Here is the sample how to create the gpt-4o. const string ENDOFTEXT = "<|endoftext|>";
const string ENDOFPROMPT = "<|endofprompt|>";
Dictionary<string, int> specialTokens = new()
{
{ ENDOFTEXT, 199999 },
{ ENDOFPROMPT, 200018 }
};
string regexPattern = @"[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n/]*|\s*[\r\n]+|\s+(?!\S)|\s+";
Regex regex = new Regex(regexPattern, RegexOptions.Compiled);
HttpClient httpClient = new HttpClient();
Tiktoken tiktoken= await Tiktoken.CreateAsync(await httpClient.GetStreamAsync(@"https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"), specialTokens);
Tokenizer gpt4o = new Tokenizer(tiktoken, new TiktokenPreTokenizer(regex, specialTokens));
gpt4o.EncodeToIds("Hello, World!<|endoftext|>").ToList().ForEach(Console.WriteLine); Note, this is using the library version: <PackageReference Include="Microsoft.ML.Tokenizers" Version="0.22.0-preview.24179.1" /> I didn't do deep testing as https://platform.openai.com/tokenizer didn't enable this new model. |
This change is now published to the NuGet https://www.nuget.org/packages/Microsoft.ML.Tokenizers/0.22.0-preview.24271.1 |
System Information (please complete the following information):
Describe the bug
No way to tokenize gpt-4o strings!
To Reproduce
Tokenize a string for gpt-4o
Expected behavior
The most recent models are supported.
The text was updated successfully, but these errors were encountered: