Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node-formatting ignored when using Unidic unless -O is set to empty string #99

Closed
buruzaemon opened this issue Nov 28, 2017 · 3 comments
Assignees
Labels
Milestone

Comments

@buruzaemon
Copy link
Owner

As reported by @massongit in pull request #98 , node-formatting seems to be ignored by mecab when using Unidic. Please refer to taku910/mecab#41.

A workaround is to force natto-py to accept an empty string value for output -O.

Steps to reproduce:

  1. Install Unidic 2.1.2
  2. Execute code snippet A below to observe that natto-py will not be able to respect the node-formatting specified, but instead use the default node-format for Unidic
  3. Contrast code snippet A (natto-py)with B and C (using mecab from command-line)
# Snippet A
# Note that node-formatting is ignored and defaults to node-format-unidic
>>> with MeCab(r'-d /opt/mecab/lib/mecab/dic/unidic -F%m\t%t,%f[12]\n') as nm:
...     for n in nm.parse('日本語だよ、これが。', as_nodes=True):
...         print(n.feature)
...
日本    ニッポン        ニッポン        日本    名詞-固有名詞-地名-国
語      ゴ      ゴ      語      名詞-普通名詞-一般
だ      ダ      ダ      だ      助動詞  助動詞-ダ       終止形-一般
よ      ヨ      ヨ      よ      助詞-終助詞
、                      、      補助記号-読点
これ    コレ    コレ    此れ    代名詞
が      ガ      ガ      が      助詞-格助詞
。                      。      補助記号-句点
EOS
# Snippet B
# Note that node-formatting is ignored and defaults to node-format-unidic
$ echo '日本語だよ、これが。' | mecab -d /opt/mecab/lib/mecab/dic/unidic/ -F%m\\t%t,%f[12]\\n
日本    ニッポン        ニッポン        日本    名詞-固有名詞-地名-国
語      ゴ      ゴ      語      名詞-普通名詞-一般
だ      ダ      ダ      だ      助動詞  助動詞-ダ       終止形-一般
よ      ヨ      ヨ      よ      助詞-終助詞
、                      、      補助記号-読点
これ    コレ    コレ    此れ    代名詞
が      ガ      ガ      が      助詞-格助詞
。                      。      補助記号-句点
EOS
# Snippet C
# node-formatting is honored when -O is passed an empty string!
$ echo '日本語だよ、これが。' | mecab -d /opt/mecab/lib/mecab/dic/unidic/ -F%m\\t%t,%f[12]\\n -O ""
日本    2,固
語      2,漢
だ      6,和
よ      6,和
、      3,記号
これ    6,和
が      6,和
。      3,記号
EOS
@buruzaemon
Copy link
Owner Author

The output-format-type option is used in a dictionary's dicrc to specify a default output format type for node-formatting. For example consider the following sample dicrc for Unidic:

output-format-type = unidic2

node-format-unidic = %m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
unk-format-unidic  = %m\t%m\t%m\t%m\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
bos-format-unidic  =
eos-format-unidic  = EOS\n

node-format-chamame = \t%m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
;unk-format-chamame = \t%m\t\t\t%m\tUNK\t\t\n
unk-format-chamame  = \t%m\t\t\t%m\t%F-[0,1,2,3]\t\t\n
bos-format-chamame  = B
eos-format-chamame  = 

node-format-unidic2 = %m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\t%f[12]\n
unk-format-unidic2  = %m\t%m\t%m\t%m\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
bos-format-unidic2  =
eos-format-unidic2  = EOS\n

Here, the default formatting when no other is specified is then *-format-unidic2

MeCab gives preference to output-format-type over node-format, etc., unless output-format-type is explicitly set to be empty. This behavior is consistent across ipadic, jumandic and unidic dictionaries.

@massongit
Copy link
Contributor

massongit commented Dec 1, 2017

MeCab's PR (taku910/mecab#38) maybe solve this problem.

@buruzaemon buruzaemon added this to the 0.9.0 release milestone Jan 4, 2018
@buruzaemon
Copy link
Owner Author

I will close this issue. However, I have updated the output-format-type MeCab option description in the project wiki to describe how to override an existing, default output format by specifying an empty string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants