Skip to content
Brooke M. Fujita edited this page Jan 4, 2018 · 3 revisions

Appendix F: Output Formatting

MeCab's Default Output Format

MeCab's default ChaSen output format take the following form:

surface \t feature 
  • surface is the morpheme itself
  • feature is a comma-delimited string of the following elements:
    • part-of-speech
    • sub-class 1
    • sub-class 2
    • sub-class 3
    • inflection
    • conjugation
    • root-form
    • reading
    • pronunciation

Please refer to MeCab: Yet Another Part-of-Speech and Morphological Analyzer, とりあえず解析してみる.

Output Format Macros

It is possible to override the output format by customizing the MeCabNode node output format using the following macros.

Macro Definition
%s node stat status value: 0 normal, 1 unknown, 2 sentence start, 3 sentence end
%S the input sentence
%L length of input sentence (bytes)
%m morpheme surface
%M morpheme surface including leading whitespace (c.f. %pS)
%h part-of-speech ID
%% % char (escaped)
%c word cost
%H comma-delimited list of POS, conjugation, reading, etc.
%t character type id
%P marginal probability (only with -l2 option)
%pi unique node ID
%pS morpheme including any leading whitespace; same as %pS%m and %M
%ps start position
%pe end position
%pC accumulative cost from previous node to this one
%pw same as %c
%pc accumulative cost + word cost (from sentence start)
%pn accumulative cost + word cost (this morpheme only, %pw + %pC)
%pb * for most optimal path; whitespace otherwise
%pP marginal probability (only with -l2 option)
%pA alpha, forward log probability (only with -l2 option)
%pB beta, backward log probability (only with -l2 option)
%pl length of morpheme (bytes), same as strlen (%m)
%pL length of morpheme including any whitespace (bytes), same as strlen(%M))
%phl left path id
%phr right path id
%f[N] Nth element of MeCab's default output feature
%f[N1,N2,N3...] N1,N2,N3... elements of MeCab's default output feature, tab-separated
%FC[N1,N2,N3...] N1,N2,N3... elements of MeCab's default output feature, delimited with char C; any whitespace elements are not output
\0 \a \b \t \n \v \f \r \\ the usual string formatters
\s ' ' (half-width whitespace)

Using the --*-format Options

You can define custom output formats using the above macros by using the --node-format, --unk-format, --bos-format, --eos-format or --eon-format options when instantiating natto.MeCab.

Example 1: Specifying user-defined formats:

# pseudo-code, you would have to specify the output format macros as STR
# long-format style
nm = MeCab('--node-format=STR --bos-format=STR --eos-format=STR --unk-format=STR')

# short-format style
nm = MeCab('-F STR -B STR -E STR -U STR')

# Python dictionary
nm = MeCab(options={'node_format': 'STR', 'bos_format': 'STR', 'eos_format': 'STR', 'unk_format': 'STR')

Using Configuration File mecabrc

It is also possible to define your custom output formats in the $MECAB_HOME/etc/mecabrc configuration file.

Example 2: Adding user-defined output formats to $MECAB_HOME/etc/mecabrc

# pseudo-code, you would have to specify the output format macros as STR
node-format-KEY = STR
unk-format-KEY = STR
eos-format-KEY = STR
bos-format-KEY = STR
eon-format-KEY = STR

Use the --output-format-type option to specify the user-defined output format KEY.

Example 3. Specifying user-defined output format KEY

# long-format style
nm = MeCab('--output-format-type=KEY')

# short-format style
nm = MeCab('-O KEY')

# Python dictionary
nm = MeCab(options={'output_format_type': 'KEY'})

Further details may be found at MeCab: Yet Another Part-of-Speech and Morphological Analyzer, 出力フォーマットの指定


Previous | Home | Next