-
Notifications
You must be signed in to change notification settings - Fork 13
わかち書きParsing
Brooke M. Fujita edited this page Apr 17, 2015
·
2 revisions
Tokenize a sentence using the wakati
flag for output-format-type
, breaking down the sentence into its morphemes.
-
Create a new
natto.mecab.MeCab
instance using the-O output-format-type
option:from natto import MeCab nm = MeCab('-Owakati') print(nm) # displays details about the MeCab instance <natto.mecab.MeCab model=<cdata 'mecab_model_t *' 0x801c16300>, tagger=<cdata 'mecab_t *' 0x801c17470>, lattice=<cdata 'mecab_lattice_t *' 0x801c196c0>, libpath="/usr/local/lib/libmecab.so", options={'output_format_type': 'wakati'}, dicts=[<natto.dictionary.DictionaryInfo dictionary='mecab_dictionary_info_t *' 0x801c19540>, filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic", charset=utf8, type=0], version=0.996>
-
Using the default parsing (where the MeCab output is one, large string):
print(nm.parse('卓球に人生かけるなんて、気味悪いです。')) ... 卓球 に 人生 かける なんて 、 気味悪い です 。
-
You can get similar results without
-Owakati
by using node parsing:# collect only output normal nodes, ignoring any end-of-sentence and unknown nodes # MeCabNode.surface is the morpheme itself [n.surface for n in nm.parse('卓球に人生かけるなんて、気味悪いです。', as_nodes=True) if n.is_nor()] ... ['卓球', 'に', '人生', 'かける', 'なんて', '、', '気味悪い', 'です', '。']