Releases · georg-jung/FastBertTokenizer

30 Apr 12:32

georg-jung

v1.0.28

c8337d8

v1.0.28 Latest

Latest

Highlights

The API surface is considered stable now (except the parts explicitly marked as experimental).
Support for netstandard2.0 was added.
Native AOT is fully supported.
FastBertTokenizer is now almost allocation free. That makes single-threaded encoding a bit faster and leads to larger improvements when encoding multi-threaded.

Breaking Changes

Method signature changed: The Encode overload that returned ReadOnlyMemory returns Memory now instead. The old design made sense as the Memory points to a buffer internal to FastBertTokenizer. onnxruntime requires Memory instead of ReadOnlyMemory though. Writing to the buffer from outside doesn't break FastBertTokenizer, so it's okay to expose the buffer as Memory instead of ReadOnlyMemory to simplify usage with onnxruntime.
```
- public (ReadOnlyMemory<long> InputIds, ReadOnlyMemory<long> AttentionMask, ReadOnlyMemory<long> TokenTypeIds) Encode(string input, int maximumTokens = 512, int? padTo = null)
+ public (Memory<long> InputIds, Memory<long> AttentionMask, Memory<long> TokenTypeIds) Encode(string input, int maximumTokens = 512, int? padTo = null)
```
Some APIs are marked as experimental now. None were before, so it might be required to add <NoWarn>FBERTTOK001</NoWarn>  to your csproj if you use them.
The methods with the name Tokenize that were marked as obsolete before and were just renamed to Encode are removed.

Other

Fixed #39 Add Decode support for input_id sequences that don't start at a word prefix

Assets 4

11 Dec 00:31

georg-jung

v0.4.67

07db29c

v0.4.67

The Tokenize methods are now called Encode as that better expresses what they do.
- The old methods still exist for now as redirects, marked as Obsolete.
Add new CreateBatchEnumerator and CreateAsyncBatchEnumerator APIs that support encoding of inputs that are longer than what fits in one model input (overlap/stride).
Make the Encode overload that returns ReadOnlyMemorys re-use it's internal buffers.
Use FrozenDictionary on .NET 8.
Add support for reading configuration from tokenizer.json files.
Add a LoadFromHuggingFaceAsync method to ease getting started.
Decode produces cleaner outputs (e.g. "hello, world" instead of "hello , world").
Produce results more similar to Hugging Face (don't unicode normalize token lookup keys).
Greatly improve test coverage and thus correctness verification.

Assets 4

18 Sep 00:28

georg-jung

v0.3.29

0ef5375

v0.3.29

Breaking Changes

PreTokenizer is now internal instead of public, as it should have been before too.
The publicly visible API now uses string instead of ReadOnlySpan/Memory<char>
- This enables better unicode normalization handling without having to create a string based on all inputs first

New Features

Automatically test correctness of tokenization against Huggingface tokenizers using unit tests
Added support for multi-threaded tokenization
- On a 8-core notebook CPU multithreaded tokenization is 3x faster than singlethreaded tokenization
- In a GitHub actions runner it is about 2x faster
Inputs are unicode normalized prior to tokenization if (and only if) required

Assets 4

14 Sep 16:00

georg-jung

v0.2.7

d033a21

v0.2.7

deterministic builds
icon and readme in .nupkg

Assets 4

14 Sep 14:41

georg-jung

v0.1.29

391d36a

v0.1.29

Fix nuget release notes

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlights

Breaking Changes

Other

Breaking Changes

New Features

Releases: georg-jung/FastBertTokenizer

v1.0.28

Highlights

Breaking Changes

Other

v0.4.67

v0.3.29

Breaking Changes

New Features

v0.2.7

v0.1.29