とりあえず使ってみよう！

Here's a guide to getting started with MeCab parsing using natto-py.

This requires:

Python 2.7.8 or greater
an existing installation of MeCab with a system dictionary
either:
- use automatic configuration: just make sure that mecab (and mecab-config if you are on Mac OS or *nix) are on your PATH
- or explicit configuration: MECAB_PATH environment variable set to the full filepath to the mecab library

Instantiate a reference to the mecab tagger, and display some details:

 from natto import MeCab

 nm = MeCab()
 print(nm)
 
 # displays details about the MeCab instance
 <natto.mecab.MeCab
  model=<cdata 'mecab_model_t *' 0x801c16300>,
  tagger=<cdata 'mecab_t *' 0x801c17470>,
  lattice=<cdata 'mecab_lattice_t *' 0x801c196c0>,
  libpath="/usr/local/lib/libmecab.so",
  options={},
  dicts=[<natto.dictionary.DictionaryInfo
          dictionary='mecab_dictionary_info_t *' 0x801c19540>,
          filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic",
          charset=utf8,
          type=0],
  version=0.996>

Display details about the mecab system dictionary used:

 sysdic = nm.dicts[0]
 print(sysdic)
 
 # displays the MeCab system dictionary info
 <natto.dictionary.DictionaryInfo
  dictionary='mecab_dictionary_info_t *' 0x801c19540>,
  filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic",
  charset=utf8,
  type=0>

Parse Japanese text and send the MeCab result as a string to stdout:

 print(nm.parse('この星の一等賞になりたいの卓球で俺は、そんだけ！'))

 この    連体詞,*,*,*,*,*,この,コノ,コノ
 星      名詞,一般,*,*,*,*,星,ホシ,ホシ
 の      助詞,連体化,*,*,*,*,の,ノ,ノ
 一等    名詞,一般,*,*,*,*,一等,イットウ,イットー
 賞      名詞,接尾,一般,*,*,*,賞,ショウ,ショー
 に      助詞,格助詞,一般,*,*,*,に,ニ,ニ
 なり    動詞,自立,*,*,五段・ラ行,連用形,なる,ナリ,ナリ
 たい    助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
 の      助詞,連体化,*,*,*,*,の,ノ,ノ
 卓球    名詞,サ変接続,*,*,*,*,卓球,タッキュウ,タッキュー
 で      助詞,格助詞,一般,*,*,*,で,デ,デ
 俺      名詞,代名詞,一般,*,*,*,俺,オレ,オレ
 は      助詞,係助詞,*,*,*,*,は,ハ,ワ
 、      記号,読点,*,*,*,*,、,、,、
 そん    名詞,一般,*,*,*,*,そん,ソン,ソン
 だけ    助詞,副助詞,*,*,*,*,だけ,ダケ,ダケ
 ！      記号,一般,*,*,*,*,！,！,！
 EOS

Parse the given text and use a generator to iterate over the nodes:

 # use a Python with-statement to ensure mecab_destroy is invoked
 # only output normal nodes, ignoring any end-of-sentence and unknown nodes 
 #
 with MeCab() as nm:
     for n in nm.parse('飛べねえ鳥もいるってこった。', as_nodes=True):
 ...     if n.is_nor():
 ...         print('{}\t{}'.format(n.surface, n.cost))
 ...
 飛べ    8101
 ねえ    6416
 鳥      12029
 も      12540
 いる    16477
 って    20631
 こっ    30320
 た      29380
 。      25946

Combine node-parsing with a custom node-format for more interesting processing:

 # use a Python with-statement to ensure mecab_destroy is invoked
 # only output normal nodes, ignoring any end-of-sentence and unknown nodes 
 #
 # -F      ... short-form of --node-format
 # %m      ... morpheme surface
 # %h      ... part-of-speech ID
 # %f[0,1] ... part-of-speech & part-of-speech sub-class 1, tab-delimited
 #
 # NOTE: the \ char itself needs to be escaped
 #       use \\t instead of \t
 #       use \\n instead of \n
 #
 with MeCab('-F%m\\t%h\\t%f[0,1]') as nm:    
     for n in nm.parse('あんたはオイラに飛び方を教えてくれた。', as_nodes=True):
 ...     if n.is_nor():
 ...         print(n.feature)
 ...
 あんた  59      名詞    代名詞
 は      16      助詞    係助詞
 オイラ  59      名詞    代名詞
 に      13      助詞    格助詞
 飛び    31      動詞    自立
 方      57      名詞    接尾
 を      13      助詞    格助詞
 教え    31      動詞    自立
 て      18      助詞    接続助詞
 くれ    33      動詞    非自立
 た      25      助動詞
 。      7       記号    句点

Usage Top | Next

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

とりあえず使ってみよう！

とりあえず使ってみよう！

Clone this wiki locally