Skip to content

Conversation

@tarekgh
Copy link
Member

@tarekgh tarekgh commented Oct 22, 2024

No description provided.

@tarekgh tarekgh added this to the ML.NET 4.0 milestone Oct 22, 2024
@tarekgh tarekgh self-assigned this Oct 22, 2024
@tarekgh tarekgh requested a review from michaelgsharp October 22, 2024 03:57
@tarekgh
Copy link
Member Author

tarekgh commented Oct 22, 2024

CC @luisquintanilla @ericstj

/// </summary>
/// <param name="specialTokensEncoder">The dictionary containing the special tokens and their corresponding ids.</param>
/// <returns>The pre-tokenizer that splits the text at the word boundary.</returns>
public static PreTokenizer CreateWordOrNonWordPreTokenizer(IReadOnlyDictionary<string, int>? specialTokensEncoder = null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case NonWord, we're using that because it's a well-known RegEx name for anything that's not a word?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the best name I came up with. The regex pattern used there can break between words (like whitespaces) and break on no words too (like punctuations, delimiters,). If you have a better name that we can use that will be fantastic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WordBoundaryPreTokenizer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tarek and I chatted offline. In RegEx lingo, word boundary has a specific meaning. Given what this PreTokenizer is intended to do while not great, it's probably the "best" name.

@codecov
Copy link

codecov bot commented Oct 22, 2024

Codecov Report

Attention: Patch coverage is 84.06633% with 221 lines in your changes missing coverage. Please review.

Project coverage is 68.89%. Comparing base (f385b06) to head (e78d834).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...icrosoft.ML.Tokenizers/Model/WordPieceTokenizer.cs 75.05% 82 Missing and 28 partials ⚠️
src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs 74.19% 58 Missing and 14 partials ⚠️
...crosoft.ML.Tokenizers/Normalizer/BertNormalizer.cs 66.66% 26 Missing and 9 partials ⚠️
...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs 78.57% 2 Missing and 1 partial ⚠️
src/Microsoft.ML.Tokenizers/EncodedToken.cs 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7275      +/-   ##
==========================================
+ Coverage   68.80%   68.89%   +0.08%     
==========================================
  Files        1461     1466       +5     
  Lines      272400   273778    +1378     
  Branches    28176    28349     +173     
==========================================
+ Hits       187436   188606    +1170     
- Misses      77729    77887     +158     
- Partials     7235     7285      +50     
Flag Coverage Δ
Debug 68.89% <84.06%> (+0.08%) ⬆️
production 63.34% <73.78%> (+0.04%) ⬆️
test 89.17% <100.00%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs 76.75% <100.00%> (ø)
...icrosoft.ML.Tokenizers.Tests/BertTokenizerTests.cs 100.00% <100.00%> (ø)
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs 100.00% <100.00%> (ø)
...Microsoft.ML.Tokenizers.Tests/PreTokenizerTests.cs 92.00% <100.00%> (+0.69%) ⬆️
...st/Microsoft.ML.Tokenizers.Tests/WordPieceTests.cs 100.00% <100.00%> (ø)
src/Microsoft.ML.Tokenizers/EncodedToken.cs 88.88% <0.00%> (-11.12%) ⬇️
...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs 92.68% <78.57%> (-7.32%) ⬇️
...crosoft.ML.Tokenizers/Normalizer/BertNormalizer.cs 66.66% <66.66%> (ø)
src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs 74.19% <74.19%> (ø)
...icrosoft.ML.Tokenizers/Model/WordPieceTokenizer.cs 75.05% <75.05%> (ø)

... and 6 files with indirect coverage changes

@tarekgh
Copy link
Member Author

tarekgh commented Oct 22, 2024

/ba-g the failures are regarding libomp which known and @michaelgsharp currently working fixing them.

@tarekgh tarekgh merged commit 81122c4 into dotnet:main Oct 22, 2024
22 of 25 checks passed
/// The implementation of the BertTokenizer is based on the original Bert implementation in the Hugging Face Transformers library.
/// https://huggingface.co/transformers/v3.0.2/model_doc/bert.html?highlight=berttokenizerfast#berttokenizer
/// </remarks>
public sealed partial class BertTokenizer : WordPieceTokenizer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this compare in behavior / performance to https://www.nuget.org/packages/FastBertTokenizer ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't compare the perf but FastBertTokenizer is very customized to the Bert tokenizer only and our implementation is just conforming to our abstract interfaces which I can expect have some overhead. But in general, I expect our implementation should be good enough in common scenarios.

/// <summary>
/// Gets a value indicating whether the tokenizer should lowercase the input text.
/// </summary>
public bool DoLowerCase { get; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we taken any of these tokenizers through API review? I suspect we'd prefer different names for some of these properties, for example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to find out if I can get spot soon and review it.

}

// Add 2 for [CLS] and [SEP] tokens. Add 1 for [SEP] token if tokenIds1 is not null.
int capacity = tokenIds0.Count() + 2 + (tokenIds1 is null ? 0 : tokenIds1.Count() + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These Count() calls could end up fully enumerating the source IEnumerable, which is problematic for two reasons. First, there's the performance overhead of doing that enumeration, which is likely to swamp any benefits from setting the capacity. But more important, second, there's no guarantee that the second enumeration is going to produce the same sequence as the first enumeration.

{
foreach (int id in tokenIds1)
{
buffer[written++] = id;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if this second enumeration of tokenIds1 produces fewer elements than in the capacity computation above, this will likely result in an IndexOutOfRangeException.

return OperationStatus.DestinationTooSmall;
}

for (int i = 0; i < tokenIds0.Count() + 2; i++) // Add 2 for [CLS] and [SEP] tokens.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be O(N^2) if tokenIds0 isn't ICollection.

All of the uses of Count() in this PR need to be revisted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revisit all Count calls and fix them.

Comment on lines +557 to +566
bool doLowerCase = true,
bool doBasicTokenization = true,
bool splitOnSpecialTokens = true,
string unknownToken = "[UNK]",
string sepToken = "[SEP]",
string padToken = "[PAD]",
string clsToken = "[CLS]",
string maskToken = "[MASK]",
bool tokenizeChineseChars = true,
bool stripAccents = false) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of optional parameters. Do we expect there to ever be more? I'm wondering if we should have an options bag type.

Dictionary<StringSpanOrdinalKey, int> vocab = new Dictionary<StringSpanOrdinalKey, int>();
Dictionary<int, string> vocabReverse = new Dictionary<int, string>();

StreamReader reader = new StreamReader(vocabStream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be disposed of? What are the ownership requirements of vocabStream?

Copy link
Member Author

@tarekgh tarekgh Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we cannot dispose the reader because it will dispose the stream which we need to avoid. The ownership and disposing the stream is done in the callers of the method (i.e. in Create method).

            try
            {
                (Dictionary<StringSpanOrdinalKey, int> vocab, Dictionary<int, string> vocabReverse) = await LoadVocabAsync(vocabStream, useAsync: true, cancellationToken);

                return new WordPieceTokenizer(vocab, vocabReverse, preTokenizer, normalizer, specialTokens, unknownToken, continuingSubwordPrefix, maxInputCharsPerWord);
            }
            finally
            {
                if (disposeStream)
                {
                    vocabStream.Dispose();
                }
            }

string continuingSubwordPrefix = DefaultContinuingSubwordPrefix,
int maxInputCharsPerWord = DefaultMaxInputCharsPerWord,
CancellationToken cancellationToken = default) =>
await CreateAsync(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should either remove the async/await or it should add ConfigureAwait. Same with some of the other one-liner delegations.

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static void InsertChar(ref char[] buffer, ref int index, char c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally when "Insert" is used, everything after the provided index is shifted down, but that's not happening here. Is this more of an Add or an Append rather than an Insert?

@tarekgh
Copy link
Member Author

tarekgh commented Oct 23, 2024

@stephentoub I'll address the feedback in another PR. Thanks!

@github-actions github-actions bot locked and limited conversation to collaborators Nov 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants