Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizers Support #6272

Merged
merged 3 commits into from
Aug 10, 2022
Merged

Tokenizers Support #6272

merged 3 commits into from
Aug 10, 2022

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Aug 4, 2022

This PR introduces the first version of Tokenizers support. This version will include the following:

  • Tokenizer APIs (creation and encoding).
  • Abstraction for tokenizers Normalization, pre-processing, and models.
  • Bpe model including training support.
  • EnglishRoberta model which is used with the text classification machine learning model.
  • Lower and upper casing normalization.
  • White space and English roberta pre-processors.
  • The integration of the tokenizer with the text classification model.

Features to add later to teh tokenizers:

  • More tokenization models (e.g. WordPiece, Unigram,...etc.)
  • Save and Load the whole tokenizer. We support saving/loading models but not the whole tokenizer.
  • Post-Processing. We support pre-processing only for now. Supporting post processing is good to support too.
  • Batch Processing. Support encoding multiple sentences.
  • Adding more normalizers and pre-processors.

@ghost ghost assigned tarekgh Aug 4, 2022
@codecov
Copy link

codecov bot commented Aug 5, 2022

Codecov Report

Merging #6272 (293d75d) into main (c0d449f) will increase coverage by 0.13%.
The diff coverage is 82.16%.

@@            Coverage Diff             @@
##             main    #6272      +/-   ##
==========================================
+ Coverage   68.42%   68.55%   +0.13%     
==========================================
  Files        1144     1170      +26     
  Lines      244991   246881    +1890     
  Branches    25411    25666     +255     
==========================================
+ Hits       167627   169250    +1623     
- Misses      70697    70902     +205     
- Partials     6667     6729      +62     
Flag Coverage Δ
Debug 68.55% <82.16%> (+0.13%) ⬆️
production 63.01% <78.47%> (+0.13%) ⬆️
test 89.04% <94.10%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/Microsoft.ML.SearchSpace/Parameter.cs 69.88% <ø> (ø)
src/Microsoft.ML.Tokenizers/Model/Progress.cs 0.00% <0.00%> (ø)
test/Microsoft.ML.Tests/TextClassificationTests.cs 93.75% <5.88%> (-6.25%) ⬇️
src/Microsoft.ML.Tokenizers/Model/BPEDecoder.cs 9.09% <9.09%> (ø)
src/Microsoft.ML.Tokenizers/Model/Cache.cs 44.61% <44.61%> (ø)
src/Microsoft.ML.Tokenizers/Model/Merge.cs 57.14% <57.14%> (ø)
src/Microsoft.ML.Tokenizers/Utils/PriorityQueue.cs 61.01% <61.01%> (ø)
...osoft.ML.Tokenizers/Normalizer/NormalizedString.cs 71.42% <71.42%> (ø)
src/Microsoft.ML.Tokenizers/Model/BPE.cs 71.54% <71.54%> (ø)
...rc/Microsoft.ML.Tokenizers/PreTokenizer/Roberta.cs 76.92% <76.92%> (ø)
... and 37 more

Copy link
Member

@michaelgsharp michaelgsharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor questions but otherwise it looks good to me.

@michaelgsharp
Copy link
Member

We are also going to release the tokenizers as their own nuget package right?

@tarekgh
Copy link
Member Author

tarekgh commented Aug 10, 2022

We are also going to release the tokenizers as their own nuget package right?

Yes, I have included the following item inside the csproj:

<Import Project="$(RepoRoot)eng/pkg/Pack.props" />

Isn't good enough to produce its own package? or do I need to do anything more?

@tarekgh
Copy link
Member Author

tarekgh commented Aug 10, 2022

@michaelgsharp this failure is unrelated, I am wondering if it is a known issue we need to fix.

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-machinelearning-refs-pull-6272-merge-e0b23e38b82d4f0199/Microsoft.ML.AutoML.Tests/1/console.610df26f.log?helixlogtype=result

Starting test: Microsoft.ML.AutoML.Test.AutoFitTests.AutoFitContextLogTest
    Microsoft.ML.AutoML.Test.AutoFitTests.AutoFitRankingCVTest [FAIL]
      Assert.True() Failure
      Expected: True
      Actual:   False
      Stack Trace:
        D:\a\1\s\test\Microsoft.ML.AutoML.Tests\AutoFitTests.cs(351,0): at Microsoft.ML.AutoML.Test.AutoFitTests.AutoFitRankingCVTest()

@michaelgsharp
Copy link
Member

If I remember correctly yes this is one of the ones that is flaky every once and a while. I'll requeue, but we should be good to merge.

@michaelgsharp michaelgsharp merged commit e8073ad into dotnet:main Aug 10, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Sep 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants