とりあえず使ってみよう!
Brooke M. Fujita edited this page Apr 17, 2015
·
4 revisions
Here's a guide to getting started with MeCab parsing using natto-py.
This requires:
- Python 2.7.8 or greater
- an existing installation of MeCab with a system dictionary
- either:
- use automatic configuration: just make sure that
mecab
(andmecab-config
if you are on Mac OS or *nix) are on your PATH - or explicit configuration:
MECAB_PATH
environment variable set to the full filepath to themecab
library
- use automatic configuration: just make sure that
-
Instantiate a reference to the
mecab
tagger, and display some details:from natto import MeCab nm = MeCab() print(nm) # displays details about the MeCab instance <natto.mecab.MeCab model=<cdata 'mecab_model_t *' 0x801c16300>, tagger=<cdata 'mecab_t *' 0x801c17470>, lattice=<cdata 'mecab_lattice_t *' 0x801c196c0>, libpath="/usr/local/lib/libmecab.so", options={}, dicts=[<natto.dictionary.DictionaryInfo dictionary='mecab_dictionary_info_t *' 0x801c19540>, filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic", charset=utf8, type=0], version=0.996>
-
Display details about the
mecab
system dictionary used:sysdic = nm.dicts[0] print(sysdic) # displays the MeCab system dictionary info <natto.dictionary.DictionaryInfo dictionary='mecab_dictionary_info_t *' 0x801c19540>, filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic", charset=utf8, type=0>
-
Parse Japanese text and send the MeCab result as a string to
stdout
:print(nm.parse('この星の一等賞になりたいの卓球で俺は、そんだけ!')) この 連体詞,*,*,*,*,*,この,コノ,コノ 星 名詞,一般,*,*,*,*,星,ホシ,ホシ の 助詞,連体化,*,*,*,*,の,ノ,ノ 一等 名詞,一般,*,*,*,*,一等,イットウ,イットー 賞 名詞,接尾,一般,*,*,*,賞,ショウ,ショー に 助詞,格助詞,一般,*,*,*,に,ニ,ニ なり 動詞,自立,*,*,五段・ラ行,連用形,なる,ナリ,ナリ たい 助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ の 助詞,連体化,*,*,*,*,の,ノ,ノ 卓球 名詞,サ変接続,*,*,*,*,卓球,タッキュウ,タッキュー で 助詞,格助詞,一般,*,*,*,で,デ,デ 俺 名詞,代名詞,一般,*,*,*,俺,オレ,オレ は 助詞,係助詞,*,*,*,*,は,ハ,ワ 、 記号,読点,*,*,*,*,、,、,、 そん 名詞,一般,*,*,*,*,そん,ソン,ソン だけ 助詞,副助詞,*,*,*,*,だけ,ダケ,ダケ ! 記号,一般,*,*,*,*,!,!,! EOS
-
Parse the given text and use a generator to iterate over the nodes:
# use a Python with-statement to ensure mecab_destroy is invoked # only output normal nodes, ignoring any end-of-sentence and unknown nodes # with MeCab() as nm: for n in nm.parse('飛べねえ鳥もいるってこった。', as_nodes=True): ... if n.is_nor(): ... print('{}\t{}'.format(n.surface, n.cost)) ... 飛べ 8101 ねえ 6416 鳥 12029 も 12540 いる 16477 って 20631 こっ 30320 た 29380 。 25946
-
Combine node-parsing with a custom node-format for more interesting processing:
# use a Python with-statement to ensure mecab_destroy is invoked # only output normal nodes, ignoring any end-of-sentence and unknown nodes # # -F ... short-form of --node-format # %m ... morpheme surface # %h ... part-of-speech ID # %f[0,1] ... part-of-speech & part-of-speech sub-class 1, tab-delimited # # NOTE: the \ char itself needs to be escaped # use \\t instead of \t # use \\n instead of \n # with MeCab('-F%m\\t%h\\t%f[0,1]') as nm: for n in nm.parse('あんたはオイラに飛び方を教えてくれた。', as_nodes=True): ... if n.is_nor(): ... print(n.feature) ... あんた 59 名詞 代名詞 は 16 助詞 係助詞 オイラ 59 名詞 代名詞 に 13 助詞 格助詞 飛び 31 動詞 自立 方 57 名詞 接尾 を 13 助詞 格助詞 教え 31 動詞 自立 て 18 助詞 接続助詞 くれ 33 動詞 非自立 た 25 助動詞 。 7 記号 句点