Skip to content

わかち書きParsing

Brooke M. Fujita edited this page Apr 17, 2015 · 2 revisions

わかち書き Parsing

Tokenize a sentence using the wakati flag for output-format-type, breaking down the sentence into its morphemes.


  1. Create a new natto.mecab.MeCab instance using the -O output-format-type option:

     from natto import MeCab
    
     nm = MeCab('-Owakati')
     print(nm)
    
     # displays details about the MeCab instance
     <natto.mecab.MeCab
      model=<cdata 'mecab_model_t *' 0x801c16300>,
      tagger=<cdata 'mecab_t *' 0x801c17470>,
      lattice=<cdata 'mecab_lattice_t *' 0x801c196c0>,
      libpath="/usr/local/lib/libmecab.so",
      options={'output_format_type': 'wakati'},
      dicts=[<natto.dictionary.DictionaryInfo
              dictionary='mecab_dictionary_info_t *' 0x801c19540>,
              filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic",
              charset=utf8,
              type=0],
     version=0.996>
    
  2. Using the default parsing (where the MeCab output is one, large string):

     print(nm.parse('卓球に人生かけるなんて、気味悪いです。'))
     ...     
     卓球 に 人生 かける なんて 、 気味悪い です 。 
    
  3. You can get similar results without -Owakati by using node parsing:

     # collect only output normal nodes, ignoring any end-of-sentence and unknown nodes
     # MeCabNode.surface is the morpheme itself
     [n.surface for n in nm.parse('卓球に人生かけるなんて、気味悪いです。', as_nodes=True) if n.is_nor()]
     ...     
     ['卓球', 'に', '人生', 'かける', 'なんて', '、', '気味悪い', 'です', '。']
    

Previous | Usage Top | Next

Clone this wiki locally