(Note for versions 0.2.0 and earlier)
A Rustler bridge to jieba-rs, the Rust Jieba implementation.
This provides the ability to use the Jieba-rs segmenter in Elixir for segmenting Chinese text.
The API is mostly a direct mapping of the Rust API. The constructors have all
been combined under one new/2
API that allows the code to feel less imperative.
The KeywordExtract functionality for both TFIDF
and TextRank
are also provided
but due to the design of jieba-rs
that restricts to project those two Rust
structs into the Beam while respecting the Rust lifetime rules and ensuring mutual
exclusion across threads, they are exported as single use functions that
construct/tear-down the TFIDF
and TextRank
instances per call. This is
possibly slow but fixing it to be fast would require modifying the jieba-rs
API so that neither TFIDF
or TextRank
held a reference to the underlying
jieba
instance on construction and instead took the wanted instance on the
extract_tags()
call.
If available in Hex, the package can be installed
by adding jieba
to your list of dependencies in mix.exs
:
def deps do
[
{:jieba, "~> 0.3.1"}
]
end
Versions prior to 0.2.0 were written by mjason
(lmj on hex and released from the
mjason/jieba_ex source tree. It exposed
a single Jieba.cut(sentence)
method will used a single, unsyncrhonized, static
instance of Jieba on the Rust side loaded with the default dictionary.
The cut(sentence)
was hardcoded to have hmm=false
.
In March 2024, this codebase was written to help with the
Visual Fonts project, not realizing an existing
codebase was available. This codebase had a more complete exposure of the Rust
API. After talking with mjason
, it was decided to switch to this codebase and
to increment the version number to signify the API break.
The 0.3.z versions still include Jieba.cut/1
interface, but have it marked
deprecated. In 1.0.0, this API will be removed in favor of non-global-object
based API.