Ruby bindings for sudachi.rs, a Rust implementation of the Sudachi Japanese morphological analyzer.
require "kabosu"
# Explicit dictionary + tokenizer lifecycle
dict = Kabosu::Dictionary.new(system_dict: Kabosu::Dictionary.path)
tokenizer = dict.create(mode: :c)
# Tokenize Japanese text
morphemes = Kabosu.tokenize("東京都に住んでいる", tokenizer: tokenizer)
# Bulk accessors for quick extraction
morphemes.surfaces # => ["東京都", "に", "住ん", "で", "いる"]
morphemes.readings # => ["トウキョウト", "ニ", "スン", "デ", "イル"]
morphemes.dictionary_forms # => ["東京都", "に", "住む", "で", "居る"]
# Each morpheme exposes rich linguistic detail
morpheme = morphemes.first
morpheme.surface # => "東京都" - surface form (as it appears in text)
morpheme.part_of_speech # => ["名詞", "固有名詞", "地名", "一般"] — part-of-speech tags
morpheme.part_of_speech_id # => 5 - numeric POS id
morpheme.dictionary_form # => "東京都" - base/dictionary form
morpheme.normalized_form # => "東京都" - normalized form
morpheme.reading_form # => "トウキョウト" - phonetic reading
morpheme.oov? # => false - out-of-vocabulary?
morpheme.dictionary_id # => 0 - source dictionary id
morpheme.word_id # => 544373 - internal word id
morpheme.synonym_group_ids # => [] - synonym group ids
morpheme.dictionary_form_word_id # => -1 - dictionary-form word id
morpheme.head_word_length # => 3 - head word length in codepoints
morpheme.a_unit_split # => [123, 456] - split-A word ids
morpheme.b_unit_split # => [] - split-B word ids
morpheme.word_structure # => [123, 456] - word-structure ids
morpheme.total_cost # => 5765 - morphological analysis cost
morpheme.begin # => 0 - start byte offset
morpheme.end # => 9 - end byte offset
morpheme.begin_c # => 0 - start character offset
morpheme.end_c # => 3 - end character offset
morpheme.system? # => true - from system dictionary?
morpheme.user? # => false - from user dictionary?
# Split text into natural Japanese sentence boundaries
Kabosu.split_sentences("東京都に住んでいる。大阪も好きだ。")
# => ["東京都に住んでいる。", "大阪も好きだ。"]- Ruby >= 3.1
- Rust toolchain (for compiling the native extension)
Add to your Gemfile:
gem "kabosu"Then install and download a Sudachi dictionary:
bundle install
bundle exec rake kabosu:install[small] # or core, fullDictionary editions (from smallest to largest): small, core, full. See the SudachiDict documentation for details on the differences between editions.
Rake tasks for managing Sudachi dictionaries:
rake kabosu:install[small] # Install a dictionary (VERSION=YYYYMMDD for a specific version)
rake kabosu:list # List installed dictionaries
rake kabosu:versions # Show available versions from GitHub
rake kabosu:path # Show path to best available dictionary
rake kabosu:remove[small] # Remove a dictionary (VERSION=YYYYMMDD for a specific version)Dictionaries are stored in ~/.kabosu/dict/ by default. Set KABOSU_DICT_DIR to customize.
Sudachi provides three split modes:
| Mode | Description |
|---|---|
A |
Short units (most granular) |
B |
Middle units |
C |
Named entity units (default) |
dict = Kabosu::Dictionary.new(system_dict: Kabosu::Dictionary.path)
tok_a = dict.create(mode: :a)
tok_c = dict.create(mode: :c)
tok_a.tokenize("東京都").surfaces # => ["東京", "都"]
tok_c.tokenize("東京都").surfaces # => ["東京都"]Modes are symbols only (:a, :b, :c or Kabosu::MODE_A/B/C).
# Custom system dictionary + optional user dictionaries
dict = Kabosu::Dictionary.new(
system_dict: "/path/to/custom/system.dic",
user_dicts: ["/path/to/domain.dic", "/path/to/names.dic"]
)
# Create tokenizer with explicit mode/fields
tokenizer = dict.create(mode: :c, fields: %i[surface pos_id reading_form])
# Tokenize (returns MorphemeList; lazily hydrates morphemes)
list = tokenizer.tokenize("国会議事堂前駅")
list.surfaces
list.first.part_of_speech
# Dictionary prefix lookup
dict.lookup("東京都").surfaces
# Morpheme split
m = tokenizer.tokenize("東京都").first
m.split(mode: :a).surfaces
# Sentence splitting
Kabosu.split_sentences("東京都に住んでいる。大阪も好きだ。", ranges: true)
Kabosu.split_sentences("長い文...", limit: 12, with_checker: true)Kabosu ships with a benchmark suite that measures tokenization throughput and compares the Ruby bindings against raw sudachi.rs.
This benchmark uses Wagahai wa Neko de Aru (I Am a Cat) by Natsume Soseki, sourced from Aozora Bunko (public domain) as the source text. ~958 KB of Japanese prose, 2,256 lines as input.
Measured on an AMD Ryzen 7 5800X, full dictionary edition, Ruby 3.4, Rust 1.84:
Single-thread (10 iterations):
| Scenario | Rust | Ruby | Ratio |
|---|---|---|---|
| split_sentences | 1.550s | 1.615s | 1.0x |
| tokenize (mode C) | 3.148s | 3.395s | 1.1x |
| tokenize (mode A) | 3.227s | 3.525s | 1.1x |
| tokenize (mode B) | 3.226s | 3.582s | 1.1x |
| Throughput | 2.94 MB/s | 2.69 MB/s | 1.1x |
Multithread (8 threads x 20,000 requests):
| Scenario | Rust | Ruby | Ratio |
|---|---|---|---|
| rails-style shared tokenizer | 1.475s | 2.101s | 1.4x |
| tokenizer per thread | 1.381s | 2.154s | 1.6x |
| Throughput ST | 20.44 MB/s | 14.35 MB/s | 1.4x |
| Throughput PT | 21.84 MB/s | 14.00 MB/s | 1.6x |
Notes:
shared tokenizermatches Rails-style access where all request threads call one tokenizer instance.per threadcreates one tokenizer per worker thread.- Ratios are
Ruby / Rust, and values vary by CPU, Ruby version, and dictionary edition.
To reproduce these results, run:
bundle exec ruby bench/startTo generate flamegraph SVGs alongside the benchmark:
bundle exec ruby bench/start --profileThis records both the Rust and Ruby runs with perf and produces interactive SVGs (bench/flamegraph-rust.svg, bench/flamegraph-ruby.svg). Open them in a browser to explore.
bundle install
bundle exec rake kabosu:install # Install Sudachi dictionary
bundle exec rake compile # Build the native extension
bundle exec rake test # Run tests
bench/start # Run benchmarks