-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Introducing WordPiece and Bert tokenizers #7275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /// </summary> | ||
| /// <param name="specialTokensEncoder">The dictionary containing the special tokens and their corresponding ids.</param> | ||
| /// <returns>The pre-tokenizer that splits the text at the word boundary.</returns> | ||
| public static PreTokenizer CreateWordOrNonWordPreTokenizer(IReadOnlyDictionary<string, int>? specialTokensEncoder = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case NonWord, we're using that because it's a well-known RegEx name for anything that's not a word?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the best name I came up with. The regex pattern used there can break between words (like whitespaces) and break on no words too (like punctuations, delimiters,). If you have a better name that we can use that will be fantastic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WordBoundaryPreTokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tarek and I chatted offline. In RegEx lingo, word boundary has a specific meaning. Given what this PreTokenizer is intended to do while not great, it's probably the "best" name.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7275 +/- ##
==========================================
+ Coverage 68.80% 68.89% +0.08%
==========================================
Files 1461 1466 +5
Lines 272400 273778 +1378
Branches 28176 28349 +173
==========================================
+ Hits 187436 188606 +1170
- Misses 77729 77887 +158
- Partials 7235 7285 +50
Flags with carried forward coverage won't be shown. Click here to find out more.
|
|
/ba-g the failures are regarding |
| /// The implementation of the BertTokenizer is based on the original Bert implementation in the Hugging Face Transformers library. | ||
| /// https://huggingface.co/transformers/v3.0.2/model_doc/bert.html?highlight=berttokenizerfast#berttokenizer | ||
| /// </remarks> | ||
| public sealed partial class BertTokenizer : WordPieceTokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this compare in behavior / performance to https://www.nuget.org/packages/FastBertTokenizer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't compare the perf but FastBertTokenizer is very customized to the Bert tokenizer only and our implementation is just conforming to our abstract interfaces which I can expect have some overhead. But in general, I expect our implementation should be good enough in common scenarios.
| /// <summary> | ||
| /// Gets a value indicating whether the tokenizer should lowercase the input text. | ||
| /// </summary> | ||
| public bool DoLowerCase { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we taken any of these tokenizers through API review? I suspect we'd prefer different names for some of these properties, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to find out if I can get spot soon and review it.
| } | ||
|
|
||
| // Add 2 for [CLS] and [SEP] tokens. Add 1 for [SEP] token if tokenIds1 is not null. | ||
| int capacity = tokenIds0.Count() + 2 + (tokenIds1 is null ? 0 : tokenIds1.Count() + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These Count() calls could end up fully enumerating the source IEnumerable, which is problematic for two reasons. First, there's the performance overhead of doing that enumeration, which is likely to swamp any benefits from setting the capacity. But more important, second, there's no guarantee that the second enumeration is going to produce the same sequence as the first enumeration.
| { | ||
| foreach (int id in tokenIds1) | ||
| { | ||
| buffer[written++] = id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if this second enumeration of tokenIds1 produces fewer elements than in the capacity computation above, this will likely result in an IndexOutOfRangeException.
| return OperationStatus.DestinationTooSmall; | ||
| } | ||
|
|
||
| for (int i = 0; i < tokenIds0.Count() + 2; i++) // Add 2 for [CLS] and [SEP] tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be O(N^2) if tokenIds0 isn't ICollection.
All of the uses of Count() in this PR need to be revisted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revisit all Count calls and fix them.
| bool doLowerCase = true, | ||
| bool doBasicTokenization = true, | ||
| bool splitOnSpecialTokens = true, | ||
| string unknownToken = "[UNK]", | ||
| string sepToken = "[SEP]", | ||
| string padToken = "[PAD]", | ||
| string clsToken = "[CLS]", | ||
| string maskToken = "[MASK]", | ||
| bool tokenizeChineseChars = true, | ||
| bool stripAccents = false) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of optional parameters. Do we expect there to ever be more? I'm wondering if we should have an options bag type.
| Dictionary<StringSpanOrdinalKey, int> vocab = new Dictionary<StringSpanOrdinalKey, int>(); | ||
| Dictionary<int, string> vocabReverse = new Dictionary<int, string>(); | ||
|
|
||
| StreamReader reader = new StreamReader(vocabStream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be disposed of? What are the ownership requirements of vocabStream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we cannot dispose the reader because it will dispose the stream which we need to avoid. The ownership and disposing the stream is done in the callers of the method (i.e. in Create method).
try
{
(Dictionary<StringSpanOrdinalKey, int> vocab, Dictionary<int, string> vocabReverse) = await LoadVocabAsync(vocabStream, useAsync: true, cancellationToken);
return new WordPieceTokenizer(vocab, vocabReverse, preTokenizer, normalizer, specialTokens, unknownToken, continuingSubwordPrefix, maxInputCharsPerWord);
}
finally
{
if (disposeStream)
{
vocabStream.Dispose();
}
}| string continuingSubwordPrefix = DefaultContinuingSubwordPrefix, | ||
| int maxInputCharsPerWord = DefaultMaxInputCharsPerWord, | ||
| CancellationToken cancellationToken = default) => | ||
| await CreateAsync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should either remove the async/await or it should add ConfigureAwait. Same with some of the other one-liner delegations.
| } | ||
|
|
||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| private static void InsertChar(ref char[] buffer, ref int index, char c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally when "Insert" is used, everything after the provided index is shifted down, but that's not happening here. Is this more of an Add or an Append rather than an Insert?
|
@stephentoub I'll address the feedback in another PR. Thanks! |
No description provided.