Skip to content

Releases: georg-jung/FastBertTokenizer

v1.0.28

30 Apr 12:32
Compare
Choose a tag to compare

Highlights

  • The API surface is considered stable now (except the parts explicitly marked as experimental).
  • Support for netstandard2.0 was added.
  • Native AOT is fully supported.
  • FastBertTokenizer is now almost allocation free. That makes single-threaded encoding a bit faster and leads to larger improvements when encoding multi-threaded.

Breaking Changes

  • Method signature changed: The Encode overload that returned ReadOnlyMemory returns Memory now instead. The old design made sense as the Memory points to a buffer internal to FastBertTokenizer. onnxruntime requires Memory instead of ReadOnlyMemory though. Writing to the buffer from outside doesn't break FastBertTokenizer, so it's okay to expose the buffer as Memory instead of ReadOnlyMemory to simplify usage with onnxruntime.
    - public (ReadOnlyMemory<long> InputIds, ReadOnlyMemory<long> AttentionMask, ReadOnlyMemory<long> TokenTypeIds) Encode(string input, int maximumTokens = 512, int? padTo = null)
    + public (Memory<long> InputIds, Memory<long> AttentionMask, Memory<long> TokenTypeIds) Encode(string input, int maximumTokens = 512, int? padTo = null)
  • Some APIs are marked as experimental now. None were before, so it might be required to add <NoWarn>FBERTTOK001</NoWarn> <!-- Experimental FastBertTokenizer features --> to your csproj if you use them.
  • The methods with the name Tokenize that were marked as obsolete before and were just renamed to Encode are removed.

Other

  • Fixed #39 Add Decode support for input_id sequences that don't start at a word prefix

v0.4.67

11 Dec 00:31
Compare
Choose a tag to compare
  • The Tokenize methods are now called Encode as that better expresses what they do.
    • The old methods still exist for now as redirects, marked as Obsolete.
  • Add new CreateBatchEnumerator and CreateAsyncBatchEnumerator APIs that support encoding of inputs that are longer than what fits in one model input (overlap/stride).
  • Make the Encode overload that returns ReadOnlyMemorys re-use it's internal buffers.
  • Use FrozenDictionary on .NET 8.
  • Add support for reading configuration from tokenizer.json files.
  • Add a LoadFromHuggingFaceAsync method to ease getting started.
  • Decode produces cleaner outputs (e.g. "hello, world" instead of "hello , world").
  • Produce results more similar to Hugging Face (don't unicode normalize token lookup keys).
  • Greatly improve test coverage and thus correctness verification.

v0.3.29

18 Sep 00:28
Compare
Choose a tag to compare

Breaking Changes

  • PreTokenizer is now internal instead of public, as it should have been before too.
  • The publicly visible API now uses string instead of ReadOnlySpan/Memory<char>
    • This enables better unicode normalization handling without having to create a string based on all inputs first

New Features

  • Automatically test correctness of tokenization against Huggingface tokenizers using unit tests
  • Added support for multi-threaded tokenization
    • On a 8-core notebook CPU multithreaded tokenization is 3x faster than singlethreaded tokenization
    • In a GitHub actions runner it is about 2x faster
  • Inputs are unicode normalized prior to tokenization if (and only if) required

v0.2.7

14 Sep 16:00
Compare
Choose a tag to compare
  • deterministic builds
  • icon and readme in .nupkg

v0.1.29

14 Sep 14:41
Compare
Choose a tag to compare
Fix nuget release notes