Introducing WordPiece and Bert tokenizers #7275

tarekgh · 2024-10-22T03:56:42Z

No description provided.

tarekgh · 2024-10-22T03:58:13Z

luisquintanilla · 2024-10-22T14:48:02Z

src/Microsoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs

+        /// </summary>
+        /// <param name="specialTokensEncoder">The dictionary containing the special tokens and their corresponding ids.</param>
+        /// <returns>The pre-tokenizer that splits the text at the word boundary.</returns>
+        public static PreTokenizer CreateWordOrNonWordPreTokenizer(IReadOnlyDictionary<string, int>? specialTokensEncoder = null)


In this case NonWord, we're using that because it's a well-known RegEx name for anything that's not a word?

This is the best name I came up with. The regex pattern used there can break between words (like whitespaces) and break on no words too (like punctuations, delimiters,). If you have a better name that we can use that will be fantastic.

WordBoundaryPreTokenizer?

Tarek and I chatted offline. In RegEx lingo, word boundary has a specific meaning. Given what this PreTokenizer is intended to do while not great, it's probably the "best" name.

codecov · 2024-10-22T19:05:15Z

Codecov Report

Attention: Patch coverage is 84.06633% with 221 lines in your changes missing coverage. Please review.

Project coverage is 68.89%. Comparing base (f385b06) to head (e78d834).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...icrosoft.ML.Tokenizers/Model/WordPieceTokenizer.cs	75.05%	82 Missing and 28 partials ⚠️
src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs	74.19%	58 Missing and 14 partials ⚠️
...crosoft.ML.Tokenizers/Normalizer/BertNormalizer.cs	66.66%	26 Missing and 9 partials ⚠️
...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs	78.57%	2 Missing and 1 partial ⚠️
src/Microsoft.ML.Tokenizers/EncodedToken.cs	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7275      +/-   ##
==========================================
+ Coverage   68.80%   68.89%   +0.08%     
==========================================
  Files        1461     1466       +5     
  Lines      272400   273778    +1378     
  Branches    28176    28349     +173     
==========================================
+ Hits       187436   188606    +1170     
- Misses      77729    77887     +158     
- Partials     7235     7285      +50

Flag	Coverage Δ
Debug	`68.89% <84.06%> (+0.08%)`	⬆️
production	`63.34% <73.78%> (+0.04%)`	⬆️
test	`89.17% <100.00%> (+0.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs	`76.75% <100.00%> (ø)`
...icrosoft.ML.Tokenizers.Tests/BertTokenizerTests.cs	`100.00% <100.00%> (ø)`
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs	`100.00% <100.00%> (ø)`
...Microsoft.ML.Tokenizers.Tests/PreTokenizerTests.cs	`92.00% <100.00%> (+0.69%)`	⬆️
...st/Microsoft.ML.Tokenizers.Tests/WordPieceTests.cs	`100.00% <100.00%> (ø)`
src/Microsoft.ML.Tokenizers/EncodedToken.cs	`88.88% <0.00%> (-11.12%)`	⬇️
...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs	`92.68% <78.57%> (-7.32%)`	⬇️
...crosoft.ML.Tokenizers/Normalizer/BertNormalizer.cs	`66.66% <66.66%> (ø)`
src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs	`74.19% <74.19%> (ø)`
...icrosoft.ML.Tokenizers/Model/WordPieceTokenizer.cs	`75.05% <75.05%> (ø)`

... and 6 files with indirect coverage changes

tarekgh · 2024-10-22T19:33:34Z

/ba-g the failures are regarding libomp which known and @michaelgsharp currently working fixing them.

stephentoub · 2024-10-23T14:18:31Z