Skip to content

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Oct 1, 2025

Fixes #7512

@Copilot Copilot AI review requested due to automatic review settings October 1, 2025 19:15
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses design review feedback for the ML.NET tokenizers library by improving API consistency and usability. The changes focus on standardizing parameter types, adding constructor overloads, and improving property visibility.

  • Replaced tuple-based vocabulary parameters with KeyValuePair<string, int> for better API consistency
  • Added a new BpeOptions constructor that accepts file paths for vocabulary and merges files
  • Updated parameter naming and property visibility for better API design

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs Updated test code to use new KeyValuePair vocabulary format and added tests for new constructor
src/Microsoft.ML.Tokenizers/PreTokenizer/CompositePreTokenizer.cs Added missing namespace declaration
src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs Renamed parameter from addBeginOfSentence to addBeginningOfSentence for consistency
src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs Added new constructor accepting file paths and changed vocabulary type from tuples to KeyValuePair
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs Updated to work with new vocabulary format and made BeginningOfSentenceToken property public

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@tarekgh tarekgh added this to the ML.NET 5.0 milestone Oct 1, 2025
@tarekgh tarekgh self-assigned this Oct 1, 2025
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Member

@ericstj ericstj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple small questions, but code changes look fine.

tarekgh and others added 2 commits October 1, 2025 12:45
Copy link

codecov bot commented Oct 1, 2025

Codecov Report

❌ Patch coverage is 70.00000% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.00%. Comparing base (fb39755) to head (f80f3bf).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs 56.09% 12 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7513      +/-   ##
==========================================
- Coverage   69.01%   69.00%   -0.01%     
==========================================
  Files        1482     1482              
  Lines      273999   274048      +49     
  Branches    28258    28266       +8     
==========================================
+ Hits       189093   189113      +20     
- Misses      77520    77540      +20     
- Partials     7386     7395       +9     
Flag Coverage Δ
Debug 69.00% <70.00%> (-0.01%) ⬇️
production 63.29% <60.00%> (-0.01%) ⬇️
test 89.46% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs 71.98% <100.00%> (ø)
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs 90.00% <100.00%> (ø)
...L.Tokenizers/PreTokenizer/CompositePreTokenizer.cs 40.15% <ø> (ø)
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs 97.34% <100.00%> (+0.03%) ⬆️
src/Microsoft.ML.Tokenizers/Model/BpeOptions.cs 63.15% <56.09%> (-21.06%) ⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ericstj ericstj merged commit 965b755 into dotnet:main Oct 1, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Added Tokenizer's APIs for v2

2 participants