Skip to content
Python version of Sudachi, a Japanese morphological analyzer.
Branch: develop
Clone or download
kazuma-t Merge pull request #18 from mana-ysh/fix-default-inhibited-connection
Fix the bug related to default inhibited connection refs #17
Latest commit 09449c1 Nov 1, 2018
Type Name Latest commit message Commit time
Failed to load latest commit information.
resources Add NOOOVBOW Nov 1, 2017
sudachipy Fix for InhibitConnectionPlugin Nov 1, 2018
.gitignore Ignore all .dic files Nov 22, 2017
LICENSE Move license and readme files Oct 18, 2017 Re-structure files. Add and to make it a install… Oct 24, 2017


SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.

Sudachi & SudachiPy are developed in WAP Tokushima Laboratory of AI and NLP, an institute under Works Applications that focuses on Natural Language Processing (NLP).

Warning: SudachiPy is still under development, and some of the functions are still not complete. Please use it at your own risk.


SudachiPy requires Python3.5+.

SudachiPy is not registered to PyPI just yet, so you may not install it via pip command at the moment.

$ pip install -e git+git://

The dictionary file is not included in the repository. You can get the built dictionary from Releases · WorksApplications/Sudachi. Please download either or, unzip and rename it to system.dic, then place it under SudachiPy/resources/. In the end, we would like to make a flow to get these resources via the code, like NLTK (e.g., import nltk; or spaCy (e.g., $python -m spacy download en).


As a command

After installing SudachiPy, you may also use it in the terminal via command sudachipy.

$ sudachipy -h
usage: sudachipy [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d] [-v] ...

Japanese Morphological Analyzer

positional arguments:
  input file(s)

optional arguments:
  -h, --help     show this help message and exit
  -r file        the setting file in JSON format
  -m {A,B,C}     the mode of splitting
  -o file        the output file
  -a             print all of the fields
  -d             print the debug information
  -v, --version  show program's version number and exit

As a Python package

Here is an example usage;

import json

from sudachipy import tokenizer
from sudachipy import dictionary
from sudachipy import config

with open(config.SETTINGFILE, "r", encoding="utf-8") as f:
    settings = json.load(f)
tokenizer_obj = dictionary.Dictionary(settings).create()

# Multi-granular tokenization
# (following results are w/ `system_full.dic`
# you may not be able to replicate this particular example w/ `system_core.dic`)

mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize(mode, "医薬品安全管理責任者")]
# => ['医薬品安全管理責任者']

mode = tokenizer.Tokenizer.SplitMode.B
[m.surface() for m in tokenizer_obj.tokenize(mode, "医薬品安全管理責任者")]
# => ['医薬品', '安全', '管理', '責任者']

mode = tokenizer.Tokenizer.SplitMode.A
[m.surface() for m in tokenizer_obj.tokenize(mode, "医薬品安全管理責任者")]
# => ['医薬', '品', '安全', '管理', '責任', '者']

# Morpheme information

m = tokenizer_obj.tokenize(mode, "食べ")[0]

m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']

# Normalization

tokenizer_obj.tokenize(mode, "附属")[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize(mode, "SUMMER")[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize(mode, "シュミレーション")[0].normalized_form()
# => 'シミュレーション'
You can’t perform that action at this time.