Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [LUCENE-4056] #5128

Open
asfimport opened this issue May 14, 2012 · 10 comments
Open

Comments

@asfimport
Copy link

asfimport commented May 14, 2012

I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.

The following is my procedure:
Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 'ant build-dict', I got the error as the below.

build-dict:
[java] dictionary builder
[java]
[java] dictionary format: UNIDIC
[java] input directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
[java] output directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
[java] input encoding: utf-8
[java] normalize entries: false
[java]
[java] building tokeninfo dict...
[java] parse...
[java] sort...
[java] Exception in thread "main" java.lang.AssertionError
[java] encode...
[java] at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
[java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
[java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
[java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
[java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)

And the diff of build.xml:

===================================================================
— build.xml (revision 1338023)
+++ build.xml (working copy)
@@ -28,19 +28,31 @@
<property name="maven.dist.dir" location="../../../dist/maven" />

<!-- default configuration: uses mecab-ipadic -->

    • <!-- alternative configuration: uses UniDic -->
    • <property name="ipadic.version" value="unidic-mecab1312src" />
    • <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" />
    • <property name="dict.loc.dir" value="/home/kazu/Work/src/nlp/unidic/_archive"/>
    • <property name="dict.src.dir" value="${build.dir}/${ipadic.version}" />
    • <!--
      <property name="dict.encoding" value="euc-jp"/>
      <property name="dict.format" value="ipadic"/>
    • -->
    • <property name="dict.encoding" value="utf-8"/>
    • <property name="dict.format" value="unidic"/>
    • <property name="dict.normalize" value="false"/>
      <property name="dict.target.dir" location="./src/resources"/>

@@ -58,7 +70,8 @@

<target name="compile-core" depends="jar-analyzers-common, common.compile-core" />
<target name="download-dict" unless="dict.available">

  • &lt;get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/&gt;
    
    • &lt;!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ --&gt;
      
    • &lt;copy file="${dict.loc.dir}/${dict.src.file}" tofile="${build.dir}/${dict.src.file}"/&gt;
      &lt;gunzip src="${build.dir}/${dict.src.file}"/&gt;
      &lt;untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/&gt;
      
      </target>

Migrated from LUCENE-4056 by Kazuaki Hiraga (@hkazuakey), updated Oct 16 2019
Environment:

Solr 3.6
UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)

Attachments: LUCENE-4056.patch
Linked issues:

@asfimport
Copy link
Author

Christian Moen (@cmoen) (migrated from JIRA)

Hello Kazu,

Only the IPA dictionary is currently supported.

Adding support for UniDic shouldn't be a big technical issue, but I think we perhaps should introduce some notion of dictionary concept in Kuromoji if we add support for other dictionaries since weights for the search mode heuristic needs to be tuned on a per-dictionary basis.

Also note that the stop tag set used by JapanesePartOfSpeechStopFilter also needs to be aligned with the relevant POS tag set of UniDic and we might also want to update the stop words list based on UniDic segmentation in order to properly support it end-to-end. (I'd expect it to be the same or nearly the same as with IPA, though.)

Hence, adding UniDic support end-to-end creates a cascade of changes.

Even though UniDic is indeed a good dictionary, its license is quite restrictive and doesn't allow redistribution, requires a license for commercial use, only permits personal use, etc. This is from license.txt (in Japanese):

UniDic ver.1.3.12 利用条件

1. UniDic ver.1.3.12 の著作権は,伝康晴・山田篤・小椋秀樹・小磯花絵・小木曽智信が保持する。
2. UniDic ver.1.3.12 を複製又は改変することは,個人的な利用に限り認める。
3. UniDic ver.1.3.12 及びこれを改変したものを再配布してはならない。
4. UniDic ver.1.3.12 を利用して行った研究等の成果を公表する場合は,UniDic ver.1.3.12 を利用したことを明記すること。
5. 営利を目的として,UniDic ver.1.3.12 を利用する場合は,事前に著作権者と協議すること。
6. UniDic ver.1.3.12 を利用することによって,直接的・間接的に生じたいかなる損害についても,著作権者は賠償する責任を負わない。
7. 本文書に定めのない事項については,著作権者と協議すること。

As a result, we won't be able to redistribute it and support it out-of-the-box with Lucene/Solr, and using it will most likely require a custom dictionary build that you have attempted.

Would fixing the dictionary builder for UniDic be a useful starting point in your case?

Do you have any information you can share regarding what sort of improvements you expect to see with UniDic? Does it relate to compound segmentation in general or katakana compounds? If you can also share some information on how you think we should support UniDic and how big the demand for such support is, that would be very useful. Thanks.

@asfimport
Copy link
Author

Kazuaki Hiraga (@hkazuakey) (migrated from JIRA)

Hi Christian,

Thank you for your comment.

I understand the situation. I didn't expect that UniDic is bundled and shipped with Kuromoji. For the time being, I just want to buiild and use it with Kuromoji for lucene/Solr.

We just started evaluation of UniDic but it's a very early stage, so We don't have any conclusion that We have to or need to use UniDic instead of IPA dictionary. However we haven't finished our evaluation of UniDic, I like the concept and policy of UniDic that strictly define how to specify the tokens. And I am satisfied with the result of tokenization. I think It's better than IPA dictionary regarding the Katakana segmentation and compound segmentation.

On the other hand, I understand there's a license issue that We have to resolve if we decide to use it in our internal services. Thanks for reminding me.

Thanks.

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

Would fixing the dictionary builder for UniDic be a useful starting point in your case?

That assert from the stacktrace would probably be pretty tricky. Its an optimization that works for
ipadic and naist-jdic, and I knew i was making an assumption doing it, but it saves a bunch because
it exploits a redundancy in the model (see LUCENE-3699).

To fix it, this optimization would have to either be conditionalized or pulled into a subclass for
ipadic and naist-jdic, and unidic would have to be encoded with a different strategy.

Still, unidic support seems pretty tricky to maintain because if we want to share any code at all,
there is always the possibility it will break in the future (and due to license, not possible to
integrate into automatic tests).

Anyway, thats the background for that particular assert, its my fault but I don't have an easy fix!

@asfimport
Copy link
Author

Prashant Pol (migrated from JIRA)

Hello @rmuir @cmoen,

I found support of UNIDIC dictionary at,
https://github.com/apache/lucene-solr/blob/53981795fd73e85aae1892c3c72344af7c57083a/lucene/analysis/kuromoji/src/tools/java/org/apache/lucene/analysis/ja/util/DictionaryBuilder.java

Its reading dictionary format as UNIDIC but later not using it while creating UnknownDictionaryWriter.
As a result build is failing with ArrayIndexOutOfBoundsException,
as IPADIC's unk.def is having 11 columns whereas UNIDIC's unk.def is having 10 columns.

Any update on this ?

@asfimport
Copy link
Author

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Hi,

as far as licensing, UniDic is now distributed under GPL, LGPL, and BSD 3-Clause. To my knowledge, the last one is compatible with ALv2.

Please see: https://unidic.ninjal.ac.jp/download and https://unidic.ninjal.ac.jp/copying/BSD

Personally I am looking for using UniDic from kuromoji, because the dictionary is still maintained by researchers and suitable for search purpose than current search mode based on mecab-ipadic.

If there is possibility to proceed this issue I'd like to help with this issue.

 

@asfimport
Copy link
Author

Kazuaki Hiraga (@hkazuakey) (migrated from JIRA)

I agree with @Tomoko Uchida and I believe that UniDis is more suitable for Japanese full-text information retrieval since the dictionary is well maintained by researchers of Japanese government funded institute and it applies stricter rules than IPA dictionary that intends to produce consistent tokenization results. 

@asfimport
Copy link
Author

Tomoko Uchida (@mocobeta) (migrated from JIRA)

@hkazuakey: thanks, do you have a patch for this? I think we can work together. Even if it isn't merged to the Lucene master, it would be valuable for users that the patch is available here.

While the mecab-ipadic dictionary will go obsolete (this fact sometimes affects search quality so the search engineers in Japan often suffer from this,) UniDic or their extension is still actively maintained to adopt the changes of language. Of course it is much better if we provide substantial evidence here.

 

@asfimport
Copy link
Author

Kazuaki Hiraga (@hkazuakey) (migrated from JIRA)

@Tomoko Uchida I am going to prepare a patch. So, let's work together to fix the issue.

@asfimport
Copy link
Author

Jun Ohtani (@johtani) (migrated from JIRA)

I succeeded to build "unidic-mecab-2.1.2_src.zip" with attached patch file.

Unfortunately, my patch has my local directory path, but we can start to discuss with it :)

@asfimport
Copy link
Author

Jun Ohtani (@johtani) (migrated from JIRA)

I made a pull request on github repo.

apache/lucene-solr#935

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant