Skip to content
Python version of Sudachi, a Japanese morphological analyzer.
Branch: develop
Clone or download
kazuma-t Merge pull request #18 from mana-ysh/fix-default-inhibited-connection
Fix the bug related to default inhibited connection refs #17
Latest commit 09449c1 Nov 1, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
resources Add NOOOVBOW Nov 1, 2017
sudachipy Fix for InhibitConnectionPlugin Nov 1, 2018
tests
.gitattributes
.gitignore Ignore all .dic files Nov 22, 2017
LICENSE Move license and readme files Oct 18, 2017
MANIFEST.in Re-structure files. Add setup.py and MANIFEST.in to make it a install… Oct 24, 2017
README.md
setup.py

README.md

SudachiPy

SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.

Sudachi & SudachiPy are developed in WAP Tokushima Laboratory of AI and NLP, an institute under Works Applications that focuses on Natural Language Processing (NLP).

Warning: SudachiPy is still under development, and some of the functions are still not complete. Please use it at your own risk.

Setup

SudachiPy requires Python3.5+.

SudachiPy is not registered to PyPI just yet, so you may not install it via pip command at the moment.

$ pip install -e git+git://github.com/WorksApplications/SudachiPy@develop#egg=SudachiPy

The dictionary file is not included in the repository. You can get the built dictionary from Releases · WorksApplications/Sudachi. Please download either sudachi-x.y.z-dictionary-core.zip or sudachi-x.y.z-dictionary-full.zip, unzip and rename it to system.dic, then place it under SudachiPy/resources/. In the end, we would like to make a flow to get these resources via the code, like NLTK (e.g., import nltk; nltk.download()) or spaCy (e.g., $python -m spacy download en).

Usage

As a command

After installing SudachiPy, you may also use it in the terminal via command sudachipy.

$ sudachipy -h
usage: sudachipy [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d] [-v] ...

Japanese Morphological Analyzer

positional arguments:
  input file(s)

optional arguments:
  -h, --help     show this help message and exit
  -r file        the setting file in JSON format
  -m {A,B,C}     the mode of splitting
  -o file        the output file
  -a             print all of the fields
  -d             print the debug information
  -v, --version  show program's version number and exit

As a Python package

Here is an example usage;

import json

from sudachipy import tokenizer
from sudachipy import dictionary
from sudachipy import config

with open(config.SETTINGFILE, "r", encoding="utf-8") as f:
    settings = json.load(f)
tokenizer_obj = dictionary.Dictionary(settings).create()


# Multi-granular tokenization
# (following results are w/ `system_full.dic`
# you may not be able to replicate this particular example w/ `system_core.dic`)


mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize(mode, "医薬品安全管理責任者")]
# => ['医薬品安全管理責任者']

mode = tokenizer.Tokenizer.SplitMode.B
[m.surface() for m in tokenizer_obj.tokenize(mode, "医薬品安全管理責任者")]
# => ['医薬品', '安全', '管理', '責任者']

mode = tokenizer.Tokenizer.SplitMode.A
[m.surface() for m in tokenizer_obj.tokenize(mode, "医薬品安全管理責任者")]
# => ['医薬', '品', '安全', '管理', '責任', '者']


# Morpheme information

m = tokenizer_obj.tokenize(mode, "食べ")[0]

m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']


# Normalization

tokenizer_obj.tokenize(mode, "附属")[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize(mode, "SUMMER")[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize(mode, "シュミレーション")[0].normalized_form()
# => 'シミュレーション'
You can’t perform that action at this time.