Skip to content

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Oct 2, 2025

This change is a simple clean-up for the BPE tokenizer:

  • Adds support for BeginningOfSentenceToken and EndOfSentenceToken when these tokens are not present in the vocabulary but are instead provided as special tokens by the tokenizer.
  • Simplifies the implementation of the Decode method.

@Copilot Copilot AI review requested due to automatic review settings October 2, 2025 21:57
@tarekgh tarekgh added this to the ML.NET 5.0 milestone Oct 2, 2025
@tarekgh tarekgh self-assigned this Oct 2, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements cleanup improvements for the BPE tokenizer by adding support for special tokens that are not part of the main vocabulary and simplifying the decode method implementation.

  • Enhances special token handling for BeginningOfSentenceToken and EndOfSentenceToken when they exist only in special tokens
  • Refactors the Decode method to use a single loop instead of duplicated logic
  • Adds comprehensive test coverage for the new special token functionality

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs Enhanced constructor to check special tokens for BOS/EOS tokens and streamlined Decode method implementation
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs Added comprehensive test case for special token handling with BOS/EOS tokens

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Co-authored-by: Eric StJohn <ericstj@microsoft.com>
Copy link

codecov bot commented Oct 2, 2025

Codecov Report

❌ Patch coverage is 96.55172% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.02%. Comparing base (6a5ec43) to head (0aea52c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs 86.66% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7514      +/-   ##
==========================================
+ Coverage   69.01%   69.02%   +0.01%     
==========================================
  Files        1482     1482              
  Lines      274050   274092      +42     
  Branches    28266    28266              
==========================================
+ Hits       189128   189195      +67     
+ Misses      77526    77516      -10     
+ Partials     7396     7381      -15     
Flag Coverage Δ
Debug 69.02% <96.55%> (+0.01%) ⬆️
production 63.31% <86.66%> (+0.01%) ⬆️
test 89.47% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs 97.47% <100.00%> (+0.13%) ⬆️
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs 74.89% <86.66%> (+2.90%) ⬆️

... and 9 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tarekgh tarekgh merged commit b28b6d4 into dotnet:main Oct 3, 2025
25 checks passed
asp2286 pushed a commit to asp2286/machinelearning that referenced this pull request Oct 11, 2025
* BpeTokenizer Cleanup

* Apply suggestions from code review

Co-authored-by: Eric StJohn <ericstj@microsoft.com>

---------

Co-authored-by: Eric StJohn <ericstj@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants