LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #935

johtani · 2019-10-09T15:30:46Z

Description

UniDic has a bit different column and unk.def from ipadic.
Currently if we use UniDic, then we get some error for building dictionary.
We can build UniDic after changing build.xml. I left some comments in build.xml.

If we use UniDic, we should change stoptags.txt and also add an example for part of speech filter.
UniDic has different part of speech tags from ipadic.

Solution

Added some logic for adjusting UniDic csv and unk.def.
And also changing build.xml for downloading and building UniDic.

Tests

Unfortunately there is no tests for DictionaryBuilder.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I am authorized to contribute this code to the ASF and have removed any code I do not have a license to distribute.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the master branch.
I have run ant precommit and the appropriate test suite.
I have added tests for my changes.
I have added documentation for the Ref Guide (for Solr changes only).

msokolov · 2019-10-10T21:28:12Z

lucene/analysis/kuromoji/ivy.xml

      <artifact name="ipadic" type=".tar.gz" url="https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz"/>
    </dependency>
    <dependency org="mecab" name="mecab-naist-jdic" rev="${/mecab/mecab-naist-jdic}" conf="naist">
      <artifact name="mecab-naist-jdic" type=".tar.gz" url=" https://rwthaachen.dl.osdn.jp/naist-jdic/53500/mecab-naist-jdic-0.6.3b-20111013.tar.gz"/>
    </dependency>
+    <dependency org="mecab" name="mecab-unidic" rev="${/mecab/mecab-unidic}" conf="unidic">
+      <artifact name="unidic" type=".zip" url=" http://ja.osdn.net/frs/redir.php?m=iij&amp;f=unidic%2F58338%2Funidic-mecab-2.1.2_src.zip"/>


If we download external files we must somewhere include a license demonstrating that those files are positively licensed for re-use, ideally with some recognized open source license. Do you know what the status is of this source for unidic? What is osdn.net?

About licensing, UniDic is distributed under GPL/LGPL/BSD(3-clause) triple licenses: https://unidic.ninjal.ac.jp/back_number#unidic_bccwj (Note: in Japanese)
According to my knowledge, the last one is compatible with ASF's policy.

OK, great. We should add a reference to the unidic license to lucene/NOTICE.txt then, where we have some statement about ipadic already, and possibly we should copy the license into lucene/licenses/ - I'm not sure if that's needed.

Thanks @msokolov and @mocobeta .
I'll add the dictionary info in NOTICE.txt.

msokolov · 2019-10-10T21:32:07Z

I understand there are no tests. Could you at least state for the record that you were able to build both UNIDIC and IPADIC-based dictionaries after your change using ant build-dict?

mocobeta · 2019-10-14T12:06:45Z

Unfortunately there is no tests for DictionaryBuilder.

Yes it's a problem for future maintenance, I think we may need some kind of validator for binary encoded dictionaries rather than directly writing unit tests for the builder.

johtani · 2019-10-16T05:42:01Z

At least, I should paste the messages build successful for Unidic and Ipadic.
And I will think tests for binary dictionaries.

johtani · 2019-10-16T15:11:26Z

Here is the message for ant clean; ant build-dict with ipadic.
https://gist.github.com/johtani/b53e9e241e5b98519fb3ffe12b4164eb

And also the message with unidic and build.xml
https://gist.github.com/johtani/91cfd2753aba2e001c1d39f47666ada7

…nary Fix some errors during building UniDic 2.1.2 dictionary

…nary Fix precommit error

…nary Add unidic license to NOTICE.txt

azagniotov · 2023-08-21T08:30:08Z

Hello Team,

May I inquire where are we on this?

TL;DR

In the meanwhile, I attempted and succeeded to build the following dictionaries. I am using the tweaks that @johtani added in his PR three years ago, plus a few minor tweaks of my own:

unidic-cwj-202302_full (NINJAL)
unidic-cwj-3.1.1-full (NINJAL) (See Caveat below 👇🏼 )
unidic-mecab-2.1.2_src (NINJAL)
mecab-ipadic-2.7.0-20070801

See the attached screenshots displaying the generated files👇🏼 I have also tested the built dictionaries, see Testing section below 👇🏼

Shall I try make a new PR under https://github.com/apache/lucene in order to get a conversation re-started on this? cc: @mocobeta 🙇🏼‍♀️

Building

Caveat

RE: Building the unidic-cwj-3.1.1-full:

I had to increase the the length check of the baseForm from 16 to 35
I had to stop throwing an exception for multiple entries of LeftID. Instead, I printed them for curiosity:

...
...
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=14172: existing=[動詞-一般,五段-バ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-バ行,連用形-融合]
Multiple entries found for leftID=1121: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=1121: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=5187: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=5187: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=8269: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=8269: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=9019: existing=[動詞-一般,五段-バ行,連用形-融合], new fullPOSData=[動詞-一般,五段-バ行,仮定形-融合]
Multiple entries found for leftID=9019: existing=[動詞-一般,五段-バ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-バ行,連用形-融合]
Multiple entries found for leftID=11995: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=11995: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=7343: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=7343: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=9019: existing=[動詞-一般,五段-バ行,連用形-融合], new fullPOSData=[動詞-一般,五段-バ行,仮定形-融合]
Multiple entries found for leftID=9019: existing=[動詞-一般,五段-バ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-バ行,連用形-融合]
Multiple entries found for leftID=13457: existing=[動詞-一般,上一段-ガ行,連体形-撥音便], new fullPOSData=[動詞-一般,上一段-ガ行,仮定形-融合]
...
...

Gradle commands

The following has been performed on the fresh clone of https://github.com/apache/lucene:

My build command leveraged the new Gradle setup and the DictionaryBuilder JavaDoc comment about how to do it.

I added in lucene/analysis/kuromoji/build.gradle a default run task from the Gradle's application plugin apply plugin: 'application':

application {
  mainModule = 'org.apache.lucene.analysis.kuromoji' // name defined in module-info.java
  mainClass = 'org.apache.lucene.analysis.ja.dict.DictionaryBuilder'
}

My shell Gradle commands to build the dictionaries are as follows. The commands were executed under the root directory lucene, where the gradlew file is.

Building unidic-cwj-202302_full command

./gradlew -p lucene/analysis/kuromoji run --args='unidic "/Users/azagniotov/Downloads/unidic-cwj-202302_full" "/Users/azagniotov/Downloads/unidic-cwj-202302_full/lucene-kuromoji-built" "UTF-8" false'

Building unidic-cwj-3.1.1-full command

./gradlew -p lucene/analysis/kuromoji run --args='unidic "/Users/azagniotov/Downloads/unidic-cwj-3.1.1-full" "/Users/azagniotov/Downloads/unidic-cwj-3.1.1-full/lucene-kuromoji-built" "UTF-8" false'

Building unidic-mecab-2.1.2_src command

./gradlew -p lucene/analysis/kuromoji run --args='unidic "/Users/azagniotov/Downloads/unidic-mecab-2.1.2_src" "/Users/azagniotov/Downloads/unidic-mecab-2.1.2_src/lucene-kuromoji-built" "UTF-8" false'

Building mecab-ipadic-2.7.0-20070801 command

./gradlew -p lucene/analysis/kuromoji run --args='ipadic "/Users/azagniotov/Downloads/mecab-ipadic-2.7.0-20070801" "/Users/azagniotov/Downloads/mecab-ipadic-2.7.0-20070801/lucene-kuromoji-built" "EUC-JP" false'

Testing

The built dictionaries were tested using the following Japanese strings:

"にじさんじ"
"ちいかわ"
"桃太郎電鉄"
"聖川真斗"

The dictionaries metadata (dictionary by dictionary) were places under the lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict and a few unit test cases were added:

Existing default dictionary

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"に", "じさ", "ん", "じ"}, new int[] {1, 1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖川", "真", "斗"}, new int[] {1, 1, 1});

Built unidic-cwj-202302_full

I need to increase memory before running the tests, as ConnectionCosts.dat is ~700MB

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さん", "じ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "かわ"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃", "桃太郎", "太郎", "電鉄"}, new int[] {1, 0, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真", "斗"}, new int[] {1, 1, 1, 1});

Built unidic-cwj-3.1.1-full

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さ", "ん", "じ"}, new int[] {1, 1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "かわ"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真斗"}, new int[] {1, 1, 1});

Built unidic-mecab-2.1.2_src

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さん", "じ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃", "桃太郎", "太郎", "電鉄"}, new int[] {1, 0, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真", "斗"}, new int[] {1, 1, 1, 1});

Built mecab-ipadic-2.7.0-20070801

 assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"に", "じさ", "ん", "じ"}, new int[] {1, 1, 1, 1});
 assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
 assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
 assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖川", "真", "斗"}, new int[] {1, 1, 1});

Screenshots

Built unidic-cwj-202302_full

Built unidic-mecab-2.1.2_src

Built mecab-ipadic-2.7.0-20070801

msokolov · 2023-08-27T20:38:45Z

Yes if you want to revive the discussion, please move the PR over to the lucene git repo. I'm a little unclear on the future of this though. So far it's a pretty expert feature. To use it you have to edit the gradle build script and rebuild Lucene, right?

azagniotov · 2023-08-28T01:34:12Z

Yes if you want to revive the discussion, please move the PR over to the lucene git repo. I'm a little unclear on the future of this though. So far it's a pretty expert feature. To use it you have to edit the gradle build script and rebuild Lucene, right?

@msokolov hello!

Thank you. Yes, I figured that PR will need to be exported eventually under the right repo, thus, I went ahead and did it:

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary lucene#12517

To use this feature, yes, Lucene analysis Kuromoji JAR will have to be rebuilt after rebuilding a new dictionary. For example, while under the root lucene/ directory where the gradlew file is:

Run the: ./gradlew compileUnidic (added in my PR). This will download and compile unidic 3.1.1 and will put the generated *.dat files under the resources directory of the kuromoji package
To create a new Lucene analysis Kuromoji JAR with the new *.dat files, just run ./gradlew assemble.

To extend this example a little further, more specifically - using this together with Solr

My aforementioned PR under Lucene adds the new changes using the latest state of code from the main branch.

If you are running Solr in Docker, then the PR changes must be adapted to the Lucene v9.7.0 branch, which is what the latest Solr is running on. This is something that I have done when I was rebuilding and testing the Lucene library. I have a patch lucene_9.7.0_kuromoji_unidic_3_compatibility that adapts the PR changes to the v9.7.0 branch.

Thus, doing the above steps inside the Dockerfile, you could then replace the default Solr's lucene-analysis-kuromoji-9.7.0 JAR after rebuilding it:

...
...
ENV SOLR_JAVA_MEM="-Xms2g -Xmx2g"
ENV SOLR_LIB_HOME=/opt/solr/server/solr-webapp/webapp/WEB-INF/lib

USER root
RUN rm $SOLR_LIB_HOME/lucene-analysis-kuromoji-*.jar
COPY ./lucene/analysis/kuromoji/build/libs/lucene-analysis-kuromoji-9.7.0-SNAPSHOT.jar $SOLR_LIB_HOME/
...
...
USER solr

msokolov reviewed Oct 10, 2019

View reviewed changes

johtani added 3 commits June 12, 2020 15:21

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictio…

fdfe02c

…nary Fix some errors during building UniDic 2.1.2 dictionary

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictio…

4dc5b5e

…nary Fix precommit error

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictio…

bcc0288

…nary Add unidic license to NOTICE.txt

johtani force-pushed the fix-4056 branch from f32c211 to bcc0288 Compare June 12, 2020 07:50

azagniotov mentioned this pull request Aug 22, 2023

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary apache/lucene#12517

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #935

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #935

johtani commented Oct 9, 2019 •

edited

Loading

msokolov Oct 10, 2019

mocobeta Oct 14, 2019 •

edited

Loading

msokolov Oct 15, 2019

johtani Oct 16, 2019

msokolov commented Oct 10, 2019

mocobeta commented Oct 14, 2019

johtani commented Oct 16, 2019

johtani commented Oct 16, 2019

azagniotov commented Aug 21, 2023 •

edited

Loading

msokolov commented Aug 27, 2023

azagniotov commented Aug 28, 2023

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #935

Are you sure you want to change the base?

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #935

Conversation

johtani commented Oct 9, 2019 • edited Loading

Description

Solution

Tests

Checklist

msokolov Oct 10, 2019

Choose a reason for hiding this comment

mocobeta Oct 14, 2019 • edited Loading

Choose a reason for hiding this comment

msokolov Oct 15, 2019

Choose a reason for hiding this comment

johtani Oct 16, 2019

Choose a reason for hiding this comment

msokolov commented Oct 10, 2019

mocobeta commented Oct 14, 2019

johtani commented Oct 16, 2019

johtani commented Oct 16, 2019

azagniotov commented Aug 21, 2023 • edited Loading

TL;DR

Building

Caveat

Gradle commands

Building unidic-cwj-202302_full command

Building unidic-cwj-3.1.1-full command

Building unidic-mecab-2.1.2_src command

Building mecab-ipadic-2.7.0-20070801 command

Testing

Existing default dictionary

Built unidic-cwj-202302_full

Built unidic-cwj-3.1.1-full

Built unidic-mecab-2.1.2_src

Built mecab-ipadic-2.7.0-20070801

Screenshots

Built unidic-cwj-202302_full

Built unidic-mecab-2.1.2_src

Built mecab-ipadic-2.7.0-20070801

msokolov commented Aug 27, 2023

azagniotov commented Aug 28, 2023

To extend this example a little further, more specifically - using this together with Solr

johtani commented Oct 9, 2019 •

edited

Loading

mocobeta Oct 14, 2019 •

edited

Loading

azagniotov commented Aug 21, 2023 •

edited

Loading