わかち書きParsing

わかち書き Parsing

Tokenize a sentence using the wakati flag for output-format-type, breaking down the sentence into its morphemes.

Create a new natto.mecab.MeCab instance using the -O output-format-type option:

 from natto import MeCab

 nm = MeCab('-Owakati')
 print(nm)

 # displays details about the MeCab instance
 <natto.mecab.MeCab
  model=<cdata 'mecab_model_t *' 0x801c16300>,
  tagger=<cdata 'mecab_t *' 0x801c17470>,
  lattice=<cdata 'mecab_lattice_t *' 0x801c196c0>,
  libpath="/usr/local/lib/libmecab.so",
  options={'output_format_type': 'wakati'},
  dicts=[<natto.dictionary.DictionaryInfo
          dictionary='mecab_dictionary_info_t *' 0x801c19540>,
          filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic",
          charset=utf8,
          type=0],
 version=0.996>

Using the default parsing (where the MeCab output is one, large string):

 print(nm.parse('卓球に人生かけるなんて、気味悪いです。'))
 ...     
 卓球 に 人生 かける なんて 、 気味悪い です 。

You can get similar results without -Owakati by using node parsing:

 # collect only output normal nodes, ignoring any end-of-sentence and unknown nodes
 # MeCabNode.surface is the morpheme itself
 [n.surface for n in nm.parse('卓球に人生かけるなんて、気味悪いです。', as_nodes=True) if n.is_nor()]
 ...     
 ['卓球', 'に', '人生', 'かける', 'なんて', '、', '気味悪い', 'です', '。']

Previous | Usage Top | Next

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

わかち書きParsing

わかち書き Parsing

Clone this wiki locally