Skip to content

v1.0.28

Latest
Compare
Choose a tag to compare
@georg-jung georg-jung released this 30 Apr 12:32
· 4 commits to master since this release

Highlights

  • The API surface is considered stable now (except the parts explicitly marked as experimental).
  • Support for netstandard2.0 was added.
  • Native AOT is fully supported.
  • FastBertTokenizer is now almost allocation free. That makes single-threaded encoding a bit faster and leads to larger improvements when encoding multi-threaded.

Breaking Changes

  • Method signature changed: The Encode overload that returned ReadOnlyMemory returns Memory now instead. The old design made sense as the Memory points to a buffer internal to FastBertTokenizer. onnxruntime requires Memory instead of ReadOnlyMemory though. Writing to the buffer from outside doesn't break FastBertTokenizer, so it's okay to expose the buffer as Memory instead of ReadOnlyMemory to simplify usage with onnxruntime.
    - public (ReadOnlyMemory<long> InputIds, ReadOnlyMemory<long> AttentionMask, ReadOnlyMemory<long> TokenTypeIds) Encode(string input, int maximumTokens = 512, int? padTo = null)
    + public (Memory<long> InputIds, Memory<long> AttentionMask, Memory<long> TokenTypeIds) Encode(string input, int maximumTokens = 512, int? padTo = null)
  • Some APIs are marked as experimental now. None were before, so it might be required to add <NoWarn>FBERTTOK001</NoWarn> <!-- Experimental FastBertTokenizer features --> to your csproj if you use them.
  • The methods with the name Tokenize that were marked as obsolete before and were just renamed to Encode are removed.

Other

  • Fixed #39 Add Decode support for input_id sequences that don't start at a word prefix