Skip to content

davafons/kabosu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kabosu

Kabosu

Gem Version CI License Downloads

Ruby bindings for sudachi.rs, a Rust implementation of the Sudachi Japanese morphological analyzer.

Usage

require "kabosu"

# Explicit dictionary + tokenizer lifecycle
dict = Kabosu::Dictionary.new(system_dict: Kabosu::Dictionary.path)
tokenizer = dict.create(mode: :c)

# Tokenize Japanese text
morphemes = Kabosu.tokenize("東京都に住んでいる", tokenizer: tokenizer)

# Bulk accessors for quick extraction
morphemes.surfaces          # => ["東京都", "に", "住ん", "で", "いる"]
morphemes.readings          # => ["トウキョウト", "ニ", "スン", "デ", "イル"]
morphemes.dictionary_forms  # => ["東京都", "に", "住む", "で", "居る"]

# Each morpheme exposes rich linguistic detail
morpheme = morphemes.first
morpheme.surface             # => "東京都"          - surface form (as it appears in text)
morpheme.part_of_speech      # => ["名詞", "固有名詞", "地名", "一般"] — part-of-speech tags
morpheme.part_of_speech_id   # => 5                - numeric POS id
morpheme.dictionary_form     # => "東京都"          - base/dictionary form
morpheme.normalized_form     # => "東京都"          - normalized form
morpheme.reading_form        # => "トウキョウト"     - phonetic reading
morpheme.oov?                # => false            - out-of-vocabulary?
morpheme.dictionary_id       # => 0                - source dictionary id
morpheme.word_id             # => 544373           - internal word id
morpheme.synonym_group_ids   # => []               - synonym group ids
morpheme.dictionary_form_word_id # => -1           - dictionary-form word id
morpheme.head_word_length    # => 3                - head word length in codepoints
morpheme.a_unit_split        # => [123, 456]       - split-A word ids
morpheme.b_unit_split        # => []               - split-B word ids
morpheme.word_structure      # => [123, 456]       - word-structure ids
morpheme.total_cost          # => 5765             - morphological analysis cost
morpheme.begin               # => 0                - start byte offset
morpheme.end                 # => 9                - end byte offset
morpheme.begin_c             # => 0                - start character offset
morpheme.end_c               # => 3                - end character offset
morpheme.system?             # => true             - from system dictionary?
morpheme.user?               # => false            - from user dictionary?

# Split text into natural Japanese sentence boundaries
Kabosu.split_sentences("東京都に住んでいる。大阪も好きだ。")
# => ["東京都に住んでいる。", "大阪も好きだ。"]

Installation

  • Ruby >= 3.1
  • Rust toolchain (for compiling the native extension)

Add to your Gemfile:

gem "kabosu"

Then install and download a Sudachi dictionary:

bundle install
bundle exec rake kabosu:install[small]  # or core, full

Dictionary editions (from smallest to largest): small, core, full. See the SudachiDict documentation for details on the differences between editions.

Dictionary management

Rake tasks for managing Sudachi dictionaries:

rake kabosu:install[small]     # Install a dictionary (VERSION=YYYYMMDD for a specific version)
rake kabosu:list               # List installed dictionaries
rake kabosu:versions           # Show available versions from GitHub
rake kabosu:path               # Show path to best available dictionary
rake kabosu:remove[small]      # Remove a dictionary (VERSION=YYYYMMDD for a specific version)

Dictionaries are stored in ~/.kabosu/dict/ by default. Set KABOSU_DICT_DIR to customize.

Tokenization modes

Sudachi provides three split modes:

Mode Description
A Short units (most granular)
B Middle units
C Named entity units (default)
dict = Kabosu::Dictionary.new(system_dict: Kabosu::Dictionary.path)
tok_a = dict.create(mode: :a)
tok_c = dict.create(mode: :c)
tok_a.tokenize("東京都").surfaces  # => ["東京", "都"]
tok_c.tokenize("東京都").surfaces  # => ["東京都"]

Modes are symbols only (:a, :b, :c or Kabosu::MODE_A/B/C).

Advanced Use Cases

# Custom system dictionary + optional user dictionaries
dict = Kabosu::Dictionary.new(
  system_dict: "/path/to/custom/system.dic",
  user_dicts: ["/path/to/domain.dic", "/path/to/names.dic"]
)

# Create tokenizer with explicit mode/fields
tokenizer = dict.create(mode: :c, fields: %i[surface pos_id reading_form])

# Tokenize (returns MorphemeList; lazily hydrates morphemes)
list = tokenizer.tokenize("国会議事堂前駅")
list.surfaces
list.first.part_of_speech

# Dictionary prefix lookup
dict.lookup("東京都").surfaces

# Morpheme split
m = tokenizer.tokenize("東京都").first
m.split(mode: :a).surfaces

# Sentence splitting
Kabosu.split_sentences("東京都に住んでいる。大阪も好きだ。", ranges: true)
Kabosu.split_sentences("長い文...", limit: 12, with_checker: true)

Benchmarks

Kabosu ships with a benchmark suite that measures tokenization throughput and compares the Ruby bindings against raw sudachi.rs.

This benchmark uses Wagahai wa Neko de Aru (I Am a Cat) by Natsume Soseki, sourced from Aozora Bunko (public domain) as the source text. ~958 KB of Japanese prose, 2,256 lines as input.

Results

Measured on an AMD Ryzen 7 5800X, full dictionary edition, Ruby 3.4, Rust 1.84:

Single-thread (10 iterations):

Scenario Rust Ruby Ratio
split_sentences 1.550s 1.615s 1.0x
tokenize (mode C) 3.148s 3.395s 1.1x
tokenize (mode A) 3.227s 3.525s 1.1x
tokenize (mode B) 3.226s 3.582s 1.1x
Throughput 2.94 MB/s 2.69 MB/s 1.1x

Multithread (8 threads x 20,000 requests):

Scenario Rust Ruby Ratio
rails-style shared tokenizer 1.475s 2.101s 1.4x
tokenizer per thread 1.381s 2.154s 1.6x
Throughput ST 20.44 MB/s 14.35 MB/s 1.4x
Throughput PT 21.84 MB/s 14.00 MB/s 1.6x

Notes:

  • shared tokenizer matches Rails-style access where all request threads call one tokenizer instance.
  • per thread creates one tokenizer per worker thread.
  • Ratios are Ruby / Rust, and values vary by CPU, Ruby version, and dictionary edition.

To reproduce these results, run:

bundle exec ruby bench/start

To generate flamegraph SVGs alongside the benchmark:

bundle exec ruby bench/start --profile

This records both the Rust and Ruby runs with perf and produces interactive SVGs (bench/flamegraph-rust.svg, bench/flamegraph-ruby.svg). Open them in a browser to explore.

Contributing

bundle install

bundle exec rake kabosu:install # Install Sudachi dictionary

bundle exec rake compile        # Build the native extension  
bundle exec rake test           # Run tests

bench/start                     # Run benchmarks

About

Ruby binding for sudachi.rs

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors