Kotori

A Japanese tokenizer and morphological analysis engine written in Kotlin

Usage

import com.github.wanasit.kotori.Tokenizer

fun main(args: Array<String>) {
    val tokenizer = Tokenizer.createDefaultTokenizer()
    val words = tokenizer.tokenize("お寿司が食べたい。").map { it.text }

    println(words) // [お, 寿司, が, 食べ, たい, 。]
}

Installation

Kotori packages are hosted by bintray and JCenter. You can download and install it via Gradle or Maven.

Gradle:

repositories {
    jcenter()
}

dependencies {
    ...
    implementation 'com.github.wanasit.kotori:kotori:0.0.3'
}

Maven:

<dependency>
  <groupId>com.github.wanasit.kotori</groupId>
  <artifactId>kotori</artifactId>
  <version>VERSION_NUMBER</version>
  <type>pom</type>
</dependency>

You can also install Kotori via Jitpack.

Dictionary

Kotori has a built-in dictionary, based-on mecab-ipadic-2.7.0-20070801.

val dictionary = Dictionary.readDefaultFromResource()
val tokenizer = Tokenizer.create(dictionary)

tokenizer.tokenize("お寿司が食べたい。")

However, it also works out-of-box with any Mecab dictionary. For example:

IPADIC (2.7.0-20070801)
UniDic (2.1.2)
JUMANDIC (7.0-20130310)

val dictionary = MeCabDictionary.readFromDirectory("~/Download/mecab-ipadic-2.7.0-20070801")
val tokenizer = Tokenizer.create(dictionary)

tokenizer.tokenize("お寿司が食べたい。")

Note: Sudachi dictionaries and plugins support are under development.

Performance

Kotori is heavily inspired by Kuromoji and Sudachi, but its tokenization is even faster than other JVM-based tokenizers (based-on our probably unfair benchmark).

The following is statistic from tokenizing Japanese sentences from Tatoeba (193,898 sentences entries, 3,561,854 total characters) on Macbook Pro 2020 (2.4 GHz 8-Core Intel Core i9).

	Token Count	Time (ns per document)	Time (ns per token)
Kuromoji (IPADIC)	2,264,560	10,095	864
Kotori (IPADIC)	2,264,705	8,190	701
Sudachi (sudachi-dictionary-20200330-small)	2,308,873	27,352	2296
Kotori (sudachi-dictionary-20200330-small)	2,157,820	13,079	1175

(Speculative) What makes Kotori fast

Minimal String.substring() usage. After JDK 7, the function makes string copy and has O(n) overhead. Some tokenizers that design before the change (e.g. Kuromoji) still have a lot of substrings.
A customized Trie data structure. TransitionArrayTrie can be quickly built just-in-time when creating a tokenizer, but it has pretty good performance on Japanese in UTF-16.

(Speculative) What makes Kotori slow

Kotori doesn't rely on any pre-built data structure (e.g. DoubleArrayTrie). It reads a dictionary as list-of-terms format and builds Trie just-in-time. This is a design decision to make Kotori open to multiple dictionary formats in exchange for some bootup time.
Kotlin (written by the inexperience library author) is slower than Java, mostly, because Kotlin's Array<T?> has some overhead comparing to Java's native T[].

Benchmark

Benchmark can be run as a gradle task.

./gradlew benchmark
./gradlew benchmark --args='--tokenizer=kuromoji'
./gradlew benchmark --args='--tokenizer=kotori --dictionary=sudachi-small'

Check the source code in kotori-benchmark project for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
buildSrc		buildSrc
gradle/wrapper		gradle/wrapper
kotori-benchmark		kotori-benchmark
kotori-dictionaries		kotori-dictionaries
kotori-sudachi		kotori-sudachi
kotori		kotori
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kotori

Usage

Installation

Dictionary

Performance

(Speculative) What makes Kotori fast

(Speculative) What makes Kotori slow

Benchmark

About

Releases

Packages

Languages

License

cordone/kotori

Folders and files

Latest commit

History

Repository files navigation

Kotori

Usage

Installation

Dictionary

Performance

(Speculative) What makes Kotori fast

(Speculative) What makes Kotori slow

Benchmark

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages