Kabosu

Ruby bindings for sudachi.rs, a Rust implementation of the Sudachi Japanese morphological analyzer.

Usage

require "kabosu"

# Explicit dictionary + tokenizer lifecycle
dict = Kabosu::Dictionary.new(system_dict: Kabosu::Dictionary.path)
tokenizer = dict.create(mode: :c)

# Tokenize Japanese text
morphemes = Kabosu.tokenize("東京都に住んでいる", tokenizer: tokenizer)

# Bulk accessors for quick extraction
morphemes.surfaces          # => ["東京都", "に", "住ん", "で", "いる"]
morphemes.readings          # => ["トウキョウト", "ニ", "スン", "デ", "イル"]
morphemes.dictionary_forms  # => ["東京都", "に", "住む", "で", "居る"]

# Each morpheme exposes rich linguistic detail
morpheme = morphemes.first
morpheme.surface             # => "東京都"          - surface form (as it appears in text)
morpheme.part_of_speech      # => ["名詞", "固有名詞", "地名", "一般"] — part-of-speech tags
morpheme.part_of_speech_id   # => 5                - numeric POS id
morpheme.dictionary_form     # => "東京都"          - base/dictionary form
morpheme.normalized_form     # => "東京都"          - normalized form
morpheme.reading_form        # => "トウキョウト"     - phonetic reading
morpheme.oov?                # => false            - out-of-vocabulary?
morpheme.dictionary_id       # => 0                - source dictionary id
morpheme.word_id             # => 544373           - internal word id
morpheme.synonym_group_ids   # => []               - synonym group ids
morpheme.dictionary_form_word_id # => -1           - dictionary-form word id
morpheme.head_word_length    # => 3                - head word length in codepoints
morpheme.a_unit_split        # => [123, 456]       - split-A word ids
morpheme.b_unit_split        # => []               - split-B word ids
morpheme.word_structure      # => [123, 456]       - word-structure ids
morpheme.total_cost          # => 5765             - morphological analysis cost
morpheme.begin               # => 0                - start byte offset
morpheme.end                 # => 9                - end byte offset
morpheme.begin_c             # => 0                - start character offset
morpheme.end_c               # => 3                - end character offset
morpheme.system?             # => true             - from system dictionary?
morpheme.user?               # => false            - from user dictionary?

# Split text into natural Japanese sentence boundaries
Kabosu.split_sentences("東京都に住んでいる。大阪も好きだ。")
# => ["東京都に住んでいる。", "大阪も好きだ。"]

Installation

Ruby >= 3.1
Rust toolchain (for compiling the native extension)

Add to your Gemfile:

gem "kabosu"

Then install and download a Sudachi dictionary:

bundle install
bundle exec rake kabosu:install[small]  # or core, full

Dictionary editions (from smallest to largest): small, core, full. See the SudachiDict documentation for details on the differences between editions.

Dictionary management

Rake tasks for managing Sudachi dictionaries:

rake kabosu:install[small]     # Install a dictionary (VERSION=YYYYMMDD for a specific version)
rake kabosu:list               # List installed dictionaries
rake kabosu:versions           # Show available versions from GitHub
rake kabosu:path               # Show path to best available dictionary
rake kabosu:remove[small]      # Remove a dictionary (VERSION=YYYYMMDD for a specific version)

Dictionaries are stored in ~/.kabosu/dict/ by default. Set KABOSU_DICT_DIR to customize.

Tokenization modes

Sudachi provides three split modes:

Mode	Description
`A`	Short units (most granular)
`B`	Middle units
`C`	Named entity units (default)

dict = Kabosu::Dictionary.new(system_dict: Kabosu::Dictionary.path)
tok_a = dict.create(mode: :a)
tok_c = dict.create(mode: :c)
tok_a.tokenize("東京都").surfaces  # => ["東京", "都"]
tok_c.tokenize("東京都").surfaces  # => ["東京都"]

Modes are symbols only (:a, :b, :c or Kabosu::MODE_A/B/C).

Advanced Use Cases

# Custom system dictionary + optional user dictionaries
dict = Kabosu::Dictionary.new(
  system_dict: "/path/to/custom/system.dic",
  user_dicts: ["/path/to/domain.dic", "/path/to/names.dic"]
)

# Create tokenizer with explicit mode/fields
tokenizer = dict.create(mode: :c, fields: %i[surface pos_id reading_form])

# Tokenize (returns MorphemeList; lazily hydrates morphemes)
list = tokenizer.tokenize("国会議事堂前駅")
list.surfaces
list.first.part_of_speech

# Dictionary prefix lookup
dict.lookup("東京都").surfaces

# Morpheme split
m = tokenizer.tokenize("東京都").first
m.split(mode: :a).surfaces

# Sentence splitting
Kabosu.split_sentences("東京都に住んでいる。大阪も好きだ。", ranges: true)
Kabosu.split_sentences("長い文...", limit: 12, with_checker: true)

Benchmarks

Kabosu ships with a benchmark suite that measures tokenization throughput and compares the Ruby bindings against raw sudachi.rs.

This benchmark uses Wagahai wa Neko de Aru (I Am a Cat) by Natsume Soseki, sourced from Aozora Bunko (public domain) as the source text. ~958 KB of Japanese prose, 2,256 lines as input.

Results

Measured on an AMD Ryzen 7 5800X, full dictionary edition, Ruby 3.4, Rust 1.84:

Single-thread (10 iterations):

Scenario	Rust	Ruby	Ratio
split_sentences	1.550s	1.615s	1.0x
tokenize (mode C)	3.148s	3.395s	1.1x
tokenize (mode A)	3.227s	3.525s	1.1x
tokenize (mode B)	3.226s	3.582s	1.1x
Throughput	2.94 MB/s	2.69 MB/s	1.1x

Multithread (8 threads x 20,000 requests):

Scenario	Rust	Ruby	Ratio
rails-style shared tokenizer	1.475s	2.101s	1.4x
tokenizer per thread	1.381s	2.154s	1.6x
Throughput ST	20.44 MB/s	14.35 MB/s	1.4x
Throughput PT	21.84 MB/s	14.00 MB/s	1.6x

Notes:

shared tokenizer matches Rails-style access where all request threads call one tokenizer instance.
per thread creates one tokenizer per worker thread.
Ratios are Ruby / Rust, and values vary by CPU, Ruby version, and dictionary edition.

To reproduce these results, run:

bundle exec ruby bench/start

To generate flamegraph SVGs alongside the benchmark:

bundle exec ruby bench/start --profile

This records both the Rust and Ruby runs with perf and produces interactive SVGs (bench/flamegraph-rust.svg, bench/flamegraph-ruby.svg). Open them in a browser to explore.

Contributing

bundle install

bundle exec rake kabosu:install # Install Sudachi dictionary

bundle exec rake compile        # Build the native extension  
bundle exec rake test           # Run tests

bench/start                     # Run benchmarks

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
bench		bench
ext/kabosu		ext/kabosu
lib		lib
test		test
.gitignore		.gitignore
Gemfile		Gemfile
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
kabosu.gemspec		kabosu.gemspec
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kabosu

Usage

Installation

Dictionary management

Tokenization modes

Advanced Use Cases

Benchmarks

Results

Contributing

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

davafons/kabosu

Folders and files

Latest commit

History

Repository files navigation

Kabosu

Usage

Installation

Dictionary management

Tokenization modes

Advanced Use Cases

Benchmarks

Results

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages