Skip to content

Releases: danieldk/alpino-tokenizer

Transducer Protobuf

08 Oct 08:49
Compare
Choose a tag to compare

This release is a major change from previous releases. In previous releases, we compiled the transition table of the Alpino tokenizer into an object file, similar to Alpino. Starting with this release, however, the transitions are stored in a file using Protocol Buffers. This speeds up compilation time dramatically, reduces the size of binaries that use this crate, does not rely on unsafe C code, and makes it possible to load a different transducer.

Other changes:

  • Add the Tokenizer trait to abstract over tokenizers. There are two implementations: (1) FiniteStateTokenizer, which uses a transducer without any pre- or post-processing; (2) AlpinoTokenizer which applies pre- and post-processing needed for the Alpino tokenizer.
  • Use small string optimization for outputs of transitions, drastically reducing memory use.
  • Change the license of the crate to the Apache License version 2.0. Since we do not use any code from Alpino anymore, we can pick our own license (the transducer for the Alpino tokenizer is still licensed under the LGPL).