Tokenizers Library Design

LLM tokenizers are a crucial component in Large Language Models (LLMs) like GPT-3 or BERT. They are responsible for the tokenization process, which involves breaking down natural language text into smaller, manageable pieces called tokens. These tokens can be words, characters, sub-words, numbers, or symbols, and they allow the LLM to process and understand the text.

This issue presents the APIs proposed for the Microsoft.ML.Tokenizers library, intended for design review. The design introduces an abstract class named `Tokenizer`, which defines the primary interfaces for all supported tokenizers. Additionally, the Tokenizer class includes a factory method for creating various types of tokenizers.

The Tokenizer can be optionally configured with normalizers, which are used to normalize the text before processing it. Normalization can take various forms such as uppercasing, lowercasing, [Unicode Normalization](https://www.unicode.org/reports/tr15/), and removing or inserting specific characters from the input text. The normalization feature is optional for the tokenizer, and it is left to the discretion of either the tokenizer or the user to decide whether to utilize any normalizers.

Pre-tokenization is an additional component that the tokenizer can be configured with, aimed at splitting the input text into smaller units prior to processing. While pre-tokenization is also an optional feature, it is commonly utilized in most tokenizers. Many pre-tokenizers employ regex for this purpose.

The typical sequence of operations for the Tokenizer involves:

- Normalizing the input text if a normalizer is configured.
- Pre-tokenizing the input or normalized text to segment it into smaller units.
- Encoding each unit of text, potentially dividing it into smaller tokens and generating string tokens, IDs for the tokens, and/or offsets that map each token to a portion of the input or normalized text.

Tokenizers offer the following functionalities:

- Encoding the input text into IDs, which can be utilized as input for Language Models. This operation is referred to as `EncodeToIds` in the proposed design.
- Counting the tokens within the input text, aiding in calculating the quota allowed for processing at any given time. This operation is named `CountTokens` in the proposed design.
- Full encoding, providing detailed results such as string tokens, IDs, and offsets mapping the tokens to parts of the input text. This operation is labeled as `Encode` in the proposed design.
- Given a maximum token count, the tokenizer can determine how far into the input text tokens can be produced, either from the beginning or the end. These operations are denoted as `IndexOfTokenCount` and `LastIndexOfTokenCount`.
- Decoding the generated IDs back into text. This operation is named `Decode` in the proposed design.
- Establishing mappings between string tokens and IDs. These operations are termed `MapTokenToId` and `MapIdToToken` in the proposed design.

Tokenizers typically rely on vocabulary files, which are provided to the tokenizer during instantiation. Users commonly pass these vocabularies as either a file or a stream to the tokenizer constructor. Vocabulary files can vary in format, such as JSON, plain text, protobuf, and more. Each tokenizer determines the specific formats of files it can be instantiated with.

# Usage Example:

### Create BPE tokenizer using the constructor

```C#
    Tokenizer tokenizer = new Bpe(vocabStream: vocabStream, , mergesStream: mergesStream, normalizer: null, preTokenizer: WhiteSpace.Instance);
```

### Create Tiktoken tokenizer using factory method:

```C#
    Dictionary<string, int> specialTokens = new Dictionary<string, int> { { IMStart, 100264}, { IMEnd, 100265}, };
    Tokenizer tokenizer = Tokenizer.CreateTiktokenForModel("gpt-4", specialTokens);
```

### Encode to Ids:

```C#
    IReadOnlyList<int> encoded = tokenizer.EncodeToIds("Hello World");
```

### Count Tokens

```C#
    int idsCount = tokenizer.CountTokens("Hello World");
```

### Ful Encoding:

```C#
    // APIs return any information related to the input or normalized text will usually out normalizedString which can be null if there is no normalization performed.
    // Token contain the string token, the token ID, and the offset of the token mapped to the input or normalized text.
    IReadOnlyList<Token> result = tokenizer.Encode(text, out string? normalizedString);
```

### Count tokens up to max token count:

```C#
    int length = tokenizer.IndexOfTokenCount(text, maxTokenCount: 10, out string? normalizedString, out int tokenCount);
    
    int index = tokenizer.LastIndexOfTokenCount(text, maxTokenCount: 3, out normalizedString, out tokenCount)
```

### Decoding Ids back to string

```C#
string decodedText = tokenizer.Decode(idsArray);
```

### Map string token to Id and vice versa

```C#
int? id = tokenizer.MapTokenToId("Hello");

string? token = MapIdToToken(tokenId);
```

# Proposal:

### Namespace

```C#
namespace Microsoft.ML.Tokenizers
```

### Tokenizer Abstraction

```C#
    public abstract partial class Tokenizer
    {
        protected Tokenizer() { }

        public virtual Normalizer? Normalizer { get { throw null; } }

        public virtual PreTokenizer? PreTokenizer { get { throw null; } }

        public virtual IReadOnlyList<int> EncodeToIds(string text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public abstract IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, bool considerPreTokenization = true, bool considerNormalization = true);

        public virtual IReadOnlyList<int> EncodeToIds(string text, int maxTokenCount, out string? normalizedText, out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public abstract IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedText, out int textLength, bool considerPreTokenization = true, bool considerNormalization = true);

        public virtual int CountTokens(string text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public abstract int CountTokens(ReadOnlySpan<char> text, bool considerPreTokenization = true, bool considerNormalization = true);

        public virtual IReadOnlyList<Token> Encode(string text, out string? normalizedString, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public abstract IReadOnlyList<Token> Encode(ReadOnlySpan<char> text, out string? normalizedString, bool considerPreTokenization = true, bool considerNormalization = true);

        public virtual int IndexOfTokenCount(string text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public abstract int IndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true);

        public virtual int LastIndexOfTokenCount(string text, int maxTokenCount, out string? processedText, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public abstract int LastIndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, out string? processedText, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true);

        public virtual string? Decode(IEnumerable<int> ids) { throw null; }

        public virtual int? MapTokenToId(string token) { throw null; }
        public abstract int? MapTokenToId(ReadOnlySpan<char> token);

        public abstract string? MapIdToToken(int? id);

       //
       // Factory methods
       // 

        public static Task<Tokenizer> CreateTiktokenAsync(Stream vocabStream, PreTokenizer? preTokenizer, Normalizer? normalizer, IReadOnlyDictionary<string, int> specialTokens = null, 
                                                                                                  int cacheSize = 8192, Threading.CancellationToken cancellationToken = null) { throw null; }

        public static Task<Tokenizer> CreateTiktokenAsync(string vocabFilePath, PreTokenizer? preTokenizer, Normalizer? normalizer, IReadOnlyDictionary<string, int> specialTokensEncoder = null, 
                                                                                                  int cacheSize = 8192, Threading.CancellationToken cancellationToken = null) { throw null; }

        public static Tokenizer CreateTiktokenForEncoding(string encodingName, IReadOnlyDictionary<string, int> extraSpecialTokens = null, Normalizer? normalizer = null) { throw null; }

        public static Tokenizer CreateTiktokenForModel(string modelName, IReadOnlyDictionary<string, int> extraSpecialTokens = null, Normalizer? normalizer = null) { throw null; }

        public static Tokenizer CreateTiktokenForModel(string modelName, Stream vocabStream, IReadOnlyDictionary<string, int> extraSpecialTokens = null, 
                                                                                                    int cacheSize = 8192, Normalizer? normalizer = null) { throw null; }

        public static Task<Tokenizer> CreateTiktokenForModelAsync(string modelName, Stream vocabStream, IReadOnlyDictionary<string, int> extraSpecialTokens = null, 
                                                                                                   int cacheSize = 8192, Normalizer? normalizer = null, Threading.CancellationToken cancellationToken = null) { throw null; }

        public static Tokenizer CreateLlama(Stream modelStream, bool addBeginOfSentence = true, bool addEndOfSentence = false) { throw null; }

        public static Tokenizer CreateCodeGen(Stream vocabStream, Stream mergesStream, bool addPrefixSpace = false, bool addBeginOfSentence = false, bool addEndOfSentence = false) { throw null; }

        public static Tokenizer CreatePhi2(Stream vocabStream, Stream mergesStream, bool addPrefixSpace = false, bool addBeginOfSentence = false, bool addEndOfSentence = false) { throw null; }
    }
```

### Normalization abstraction 

```C#
    public abstract partial class Normalizer
    {
        protected Normalizer() { }

        public abstract string Normalize(string original);
        public abstract string Normalize(ReadOnlySpan<char> original);
    }

```

### Pre-tokenization abstraction 

```C#
    public abstract partial class PreTokenizer
    {
        protected PreTokenizer() { }

        public abstract IEnumerable<(int, int)> PreTokenize(string text);
        public abstract IEnumerable<(int, int)> PreTokenize(ReadOnlySpan<char> text);
    }
```

### Token class 

```C#
   // returned from Tokenizer.Encode(...)
   
    public readonly struct Token
    {
        public Token(int id, string value, (int, int) offset) { }

        public int Id { get { throw null; } }

        public (int Index, int Length) Offset { get { throw null; } }

        public string Value { get { throw null; } }
    }
```

### Concrete Normalizers 

```C#
    public sealed partial class LowerCaseNormalizer : Normalizer
    {
        public override string Normalize(ReadOnlySpan<char> original) { throw null; }
        public override string Normalize(string original) { throw null; }
    }

    public sealed partial class UpperCaseNormalizer : Normalizer
    {
        public override string Normalize(ReadOnlySpan<char> original) { throw null; }

        public override string Normalize(string original) { throw null; }
    }
    
    public sealed partial class SentencePieceNormalizer : Normalizer
    {
        public SentencePieceNormalizer(bool removeExtraWhiteSpaces, bool addDummyPrefix, bool escapeWhiteSpaces, bool treatWhitespaceAsSuffix) { }
        public bool AddDummyPrefix { get { throw null; } }
        public bool EscapeWhiteSpaces { get { throw null; } }
        public bool RemoveExtraWhiteSpaces { get { throw null; } }
        public bool TreatWhitespaceAsSuffix { get { throw null; } }

        public override string Normalize(ReadOnlySpan<char> original) { throw null; }
        public override string Normalize(string original) { throw null; }
    }
```

### Concrete Pre-tokenizers

```C#
    public sealed partial class TiktokenPreTokenizer : PreTokenizer
    {
        public TiktokenPreTokenizer(Text.RegularExpressions.Regex regex, IReadOnlyDictionary<string, int> specialTokensEncoder) { }

        public override IEnumerable<(int, int)> PreTokenize(string text) { throw null; }
        public override IEnumerable<(int, int)> PreTokenize(ReadOnlySpan<char> text) { throw null; }
    }

    public sealed partial class WhiteSpace : PreTokenizer
    {
        public static WhiteSpace Instance { get { throw null; } }

        public override IEnumerable<(int, int)> PreTokenize(string text) { throw null; }
        public override IEnumerable<(int, int)> PreTokenize(ReadOnlySpan<char> text) { throw null; }
    }

    public sealed partial class RobertaPreTokenizer : PreTokenizer
    {
        public static RobertaPreTokenizer Instance { get { throw null; } }

        public override IEnumerable<(int, int)> PreTokenize(string text) { throw null; }
        public override IEnumerable<(int, int)> PreTokenize(ReadOnlySpan<char> text) { throw null; }
    }
```

### Concrete Tokenizer - Bpe

```C#
    public sealed partial class Bpe : Tokenizer
    {
        public Bpe(string vocabFile, string? mergesFile, PreTokenizer? preTokenizer = null, Normalizer? normalizer = null, string? unknownToken = null, 
                           string? continuingSubwordPrefix = null, string? endOfWordSuffix = null, bool? fuseUnknownTokens = false) { }

        public Bpe(Stream vocabStream, Stream? mergesStream, PreTokenizer? preTokenizer = null, Normalizer? normalizer = null, string? unknownToken = null, 
                          string? continuingSubwordPrefix = null, string? endOfWordSuffix = null, bool? fuseUnknownTokens = false) { }

        public string? ContinuingSubwordPrefix { get { throw null; } }

        public string? EndOfWordSuffix { get { throw null; } }

        public bool? FuseUnknownTokens { get { throw null; } }

        public string? UnknownToken { get { throw null; } }

        public IReadOnlyDictionary<string, int> Vocab { get { throw null; } }

        public string? Decode(IEnumerable<int> ids, bool considerSpecialTokens) { throw null; }

        public override Normalizer? Normalizer { get { throw null; } }
        public override PreTokenizer? PreTokenizer { get { throw null; } }
        public override int CountTokens(ReadOnlySpan<char> text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int CountTokens(string text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override string? Decode(IEnumerable<int> ids) { throw null; }
        public override IReadOnlyList<Token> Encode(ReadOnlySpan<char> text, out string? normalizedString, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<Token> Encode(string text, out string? normalizedString, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(string text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(string text, int maxTokenCount, out string? normalizedString, out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int IndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int IndexOfTokenCount(string text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int LastIndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int LastIndexOfTokenCount(string text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override string? MapIdToToken(int? id) { throw null; }
        public override int? MapTokenToId(ReadOnlySpan<char> token) { throw null; }
    }
```

### Concrete Tokenizer - Tiktoken 

```C#
    public sealed partial class Tiktoken : Tokenizer
    {
        public Tiktoken(Stream vocabStream, PreTokenizer? preTokenizer, IReadOnlyDictionary<string, int> specialTokens = null, Normalizer? normalizer = null, int? cacheSize = 8192) { }

        public Tiktoken(string vocabFilePath, PreTokenizer? preTokenizer, IReadOnlyDictionary<string, int> specialTokens = null, Normalizer? normalizer = null, int? cacheSize = 8192) { }

        public IReadOnlyDictionary<int, ReadOnlyMemory<Byte>> Decoder { get { throw null; } }

        public IReadOnlyDictionary<ReadOnlyMemory<Byte>, int> Encoder { get { throw null; } }

        public IReadOnlyDictionary<string, int> SpecialTokens { get { throw null; } }

        public IReadOnlyDictionary<string, int> Vocab { get { throw null; } }

        public override Normalizer? Normalizer { get { throw null; } }
        public override PreTokenizer? PreTokenizer { get { throw null; } }
        public override int CountTokens(ReadOnlySpan<char> text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int CountTokens(string text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override string? Decode(IEnumerable<int> ids) { throw null; }
        public override IReadOnlyList<Token> Encode(ReadOnlySpan<char> text, out string? normalizedString, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<Token> Encode(string text, out string? normalizedString, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(string text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(string text, int maxTokenCount, out string? normalizedString, out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int IndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int IndexOfTokenCount(string text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int LastIndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int LastIndexOfTokenCount(string text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override string? MapIdToToken(int? id) { throw null; }
        public override int? MapTokenToId(ReadOnlySpan<char> token) { throw null; }
    }
```

### Concrete Tokenizer - EnglishRoberta

```C#
    public sealed partial class EnglishRoberta : Tokenizer
    {
        public EnglishRoberta(Stream vocabularyStream, Stream mergeStream, Stream highestOccurrenceMappingStream, PreTokenizer? preTokenizer, Normalizer? normalizer, bool filterUnsupportedChars, bool disposeStream) { }

        public EnglishRoberta(Stream vocabularyStream, Stream mergeStream, Stream highestOccurrenceMappingStream, PreTokenizer? preTokenizer = null, Normalizer? normalizer = null, bool filterUnsupportedChars = true) { }

        public EnglishRoberta(string vocabularyPath, string mergePath, string highestOccurrenceMappingPath, PreTokenizer? preTokenizer = null, Normalizer? normalizer = null, bool filterUnsupportedChars = true) { }

        public bool FilterUnsupportedChars { get { throw null; } }

        public int PadIndex { get { throw null; } }

        public int SymbolsCount { get { throw null; } }

        public IReadOnlyDictionary<string, int> Vocab { get { throw null; } }

        public int AddMaskSymbol(string mask = "<mask>") { throw null; }

        public IReadOnlyList<int> ConvertIdsToOccurrenceRanks(IReadOnlyList<int> ids) { throw null; }

        public IReadOnlyList<int> ConvertIdsToOccurrenceValues(IReadOnlyList<int> ids) { throw null; }

        public IReadOnlyList<int> ConvertOccurrenceRanksToIds(IReadOnlyList<int> ranks) { throw null; }

        public bool IsSupportedChar(char ch) { throw null; }

        public override Normalizer? Normalizer { get { throw null; } }
        public override PreTokenizer? PreTokenizer { get { throw null; } }
        public override int CountTokens(ReadOnlySpan<char> text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int CountTokens(string text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override string? Decode(IEnumerable<int> ids) { throw null; }
        public override IReadOnlyList<Token> Encode(ReadOnlySpan<char> text, out string? normalizedString, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<Token> Encode(string text, out string? normalizedString, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(string text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(string text, int maxTokenCount, out string? normalizedString, out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int IndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int IndexOfTokenCount(string text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int LastIndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int LastIndexOfTokenCount(string text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override string? MapIdToToken(int? id) { throw null; }
        public override int? MapTokenToId(ReadOnlySpan<char> token) { throw null; }
    }
```

### Concrete Tokenizer - CodeGen 

```C#
    public sealed partial class CodeGen : Tokenizer
    {
        public CodeGen(string vocabularyPath, string mergePath, PreTokenizer? preTokenizer = null, Normalizer? normalizer = null, IReadOnlyDictionary<string, int> addedTokens = null, 
                                     bool? addPrefixSpace = false, bool? addBeginningOfSentence = false, bool? addEndOfSentence = false, string? unknownToken = "<|endoftext|>", 
                                     string? beginningOfSentenceToken = "<|endoftext|>", string? endOfSentenceToken = "<|endoftext|>") { }

        public CodeGen(Stream vocabularyStream, Stream mergeStream, PreTokenizer? preTokenizer = null, Normalizer? normalizer = null, IReadOnlyDictionary<string, int> addedTokens = null, 
                                    bool? addPrefixSpace = false, bool? addBeginningOfSentence = false, bool? addEndOfSentence = false, string? unknownToken = "<|endoftext|>", 
                                    string? beginningOfSentenceToken = "<|endoftext|>", string? endOfSentenceToken = "<|endoftext|>") { }

        public bool AddBeginningOfSentence { get { throw null; } }

        public IReadOnlyDictionary<string, int> AddedTokens { get { throw null; } }

        public bool AddEndOfSentence { get { throw null; } }

        public bool AddPrefixSpace { get { throw null; } }

        public int? BeginningOfSentenceId { get { throw null; } }

        public string? BeginningOfSentenceToken { get { throw null; } }

        public int? EndOfSentenceId { get { throw null; } }

        public string? EndOfSentenceToken { get { throw null; } }

        public string? UnknownToken { get { throw null; } }

        public int? UnknownTokenId { get { throw null; } }

        public IReadOnlyDictionary<string, int> Vocab { get { throw null; } }

        public IReadOnlyList<int> EncodeToIds(string text, bool addPrefixSpace, bool addBeginningOfSentence, bool addEndOfSentence, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, bool addPrefixSpace, bool addBeginningOfSentence, bool addEndOfSentence, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public IReadOnlyList<int> EncodeToIds(string text, int maxTokenCount, bool addPrefixSpace, bool addBeginningOfSentence, bool addEndOfSentence, out string? normalizedString, 
                                                               out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, int maxTokenCount, bool addPrefixSpace, bool addBeginningOfSentence, bool addEndOfSentence, out string? normalizedString, 
                                                               out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public int CountTokens(ReadOnlySpan<char> text, bool addPrefixSpace, bool addBeginningOfSentence, bool addEndOfSentence, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public int CountTokens(string text, bool addPrefixSpace, bool addBeginningOfSentence, bool addEndOfSentence, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        
        public int IndexOfTokenCount(string text, int maxTokenCount, bool addPrefixSpace, bool addBeginningOfSentence, bool addEndOfSentence, out string? normalizedString, 
                                                               out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public int IndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, bool addPrefixSpace, bool addBeginningOfSentence, bool addEndOfSentence, out string? normalizedString, 
                                                                out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public int LastIndexOfTokenCount(string text, int maxTokenCount, bool addPrefixSpace, bool addBeginningOfSentence, bool addEndOfSentence, out string? normalizedString, 
                                                                out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public int LastIndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, bool addPrefixSpace, bool addBeginningOfSentence, bool addEndOfSentence, out string? normalizedString, 
                                                                 out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        
        public IReadOnlyList<Token> Encode(ReadOnlySpan<char> text, bool addPrefixSpace, bool addBeginningOfSentence, bool addEndOfSentence, out string? normalizedString, 
                                                                 bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public IReadOnlyList<Token> Encode(string text, bool addPrefixSpace, bool addBeginningOfSentence, bool addEndOfSentence, out string? normalizedString, 
                                                                  bool considerPreTokenization = true,  bool considerNormalization = true) { throw null; }

        public string? Decode(IEnumerable<int> ids, bool hasPrefixSpace, bool considerSpecialTokens) { throw null; }

        public override Normalizer? Normalizer { get { throw null; } }
        public override PreTokenizer? PreTokenizer { get { throw null; } }
        public override int CountTokens(ReadOnlySpan<char> text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int CountTokens(string text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override string? Decode(IEnumerable<int> ids) { throw null; }
        public override IReadOnlyList<Token> Encode(ReadOnlySpan<char> text, out string? normalizedString, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<Token> Encode(string text, out string? normalizedString, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(string text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(string text, int maxTokenCount, out string? normalizedString, out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int IndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int IndexOfTokenCount(string text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int LastIndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int LastIndexOfTokenCount(string text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override string? MapIdToToken(int? id) { throw null; }
        public override int? MapTokenToId(ReadOnlySpan<char> token) { throw null; }
    }
```

### Concrete Tokenizer - SentencePiece

```C#
    public sealed partial class SentencePiece : Tokenizer
    {
        internal SentencePiece() { }

        public bool AddBeginningOfSentence { get { throw null; } }

        public bool AddDummyPrefix { get { throw null; } }

        public bool AddEndOfSentence { get { throw null; } }

        public int BeginningOfSentenceId { get { throw null; } }

        public string BeginningOfSentenceToken { get { throw null; } }

        public bool ByteFallback { get { throw null; } }

        public int EndOfSentenceId { get { throw null; } }

        public string EndOfSentenceToken { get { throw null; } }

        public bool EscapeWhiteSpaces { get { throw null; } }


        public bool TreatWhitespaceAsSuffix { get { throw null; } }

        public int UnknownId { get { throw null; } }

        public string UnknownToken { get { throw null; } }

        public IReadOnlyDictionary<string, int> Vocab { get { throw null; } }

        public int CountTokens(ReadOnlySpan<char> text, bool addBeginningOfSentence, bool addEndOfSentence, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public int CountTokens(ReadOnlySpan<char> text, bool addBeginningOfSentence, bool addEndOfSentence, bool considerNormalization, out string? normalizedString, out int textLength, int maxTokenCount = int.MaxValue) { throw null; }

        public int CountTokens(string text, bool addBeginningOfSentence, bool addEndOfSentence, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public IReadOnlyList<int> EncodeToIds(string text, bool addBeginningOfSentence, bool addEndOfSentence, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, bool addBeginningOfSentence, bool addEndOfSentence, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, bool addBeginningOfSentence, bool addEndOfSentence, bool considerNormalization, 
                                                            out string? normalizedString, out int textLength, int maxTokenCount = int.MaxValue) { throw null; }

        public IReadOnlyList<int> EncodeToIds(string text, bool addBeginningOfSentence, bool addEndOfSentence, int maxTokenCount, out string? normalizedString, 
                                                             out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, bool addBeginningOfSentence, bool addEndOfSentence, int maxTokenCount, out string? normalizedString, 
                                                             out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public int IndexOfTokenCount(string text, bool addBeginningOfSentence, bool addEndOfSentence, int maxTokenCount, out string? normalizedString, 
                                                            out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public int IndexOfTokenCount(ReadOnlySpan<char> text, bool addBeginningOfSentence, bool addEndOfSentence, int maxTokenCount, out string? normalizedString, 
                                                             out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public int LastIndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, bool addBeginningOfSentence, bool addEndOfSentence, bool considerNormalization, 
                                                              out string? normalizedString, out int tokenCount) { throw null; }

        public IReadOnlyList<Token> Encode(string text, out string? normalizedString, bool addBeginningOfSentence, bool addEndOfSentence, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public IReadOnlyList<Token> Encode(ReadOnlySpan<char> text, out string? normalizedString, bool addBeginningOfSentence, bool addEndOfSentence, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }

        public override Normalizer? Normalizer { get { throw null; } }
        public override PreTokenizer? PreTokenizer { get { throw null; } }
        public override int CountTokens(ReadOnlySpan<char> text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int CountTokens(string text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override string? Decode(IEnumerable<int> ids) { throw null; }
        public override IReadOnlyList<Token> Encode(ReadOnlySpan<char> text, out string? normalizedString, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<Token> Encode(string text, out string? normalizedString, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(string text, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override IReadOnlyList<int> EncodeToIds(string text, int maxTokenCount, out string? normalizedString, out int textLength, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int IndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int IndexOfTokenCount(string text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int LastIndexOfTokenCount(ReadOnlySpan<char> text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override int LastIndexOfTokenCount(string text, int maxTokenCount, out string? normalizedString, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true) { throw null; }
        public override string? MapIdToToken(int? id) { throw null; }
        public override int? MapTokenToId(ReadOnlySpan<char> token) { throw null; }
    }
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizers Library Design #7144

Usage Example:

Create BPE tokenizer using the constructor

Create Tiktoken tokenizer using factory method:

Encode to Ids:

Count Tokens

Ful Encoding:

Count tokens up to max token count:

Decoding Ids back to string

Map string token to Id and vice versa

Proposal:

Namespace

Tokenizer Abstraction

Normalization abstraction

Pre-tokenization abstraction

Token class

Concrete Normalizers

Concrete Pre-tokenizers

Concrete Tokenizer - Bpe

Concrete Tokenizer - Tiktoken

Concrete Tokenizer - EnglishRoberta

Concrete Tokenizer - CodeGen

Concrete Tokenizer - SentencePiece

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenizers Library Design #7144

Description

Usage Example:

Create BPE tokenizer using the constructor

Create Tiktoken tokenizer using factory method:

Encode to Ids:

Count Tokens

Ful Encoding:

Count tokens up to max token count:

Decoding Ids back to string

Map string token to Id and vice versa

Proposal:

Namespace

Tokenizer Abstraction

Normalization abstraction

Pre-tokenization abstraction

Token class

Concrete Normalizers

Concrete Pre-tokenizers

Concrete Tokenizer - Bpe

Concrete Tokenizer - Tiktoken

Concrete Tokenizer - EnglishRoberta

Concrete Tokenizer - CodeGen

Concrete Tokenizer - SentencePiece

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions