Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #935

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

johtani
Copy link

@johtani johtani commented Oct 9, 2019

Description

UniDic has a bit different column and unk.def from ipadic.
Currently if we use UniDic, then we get some error for building dictionary.
We can build UniDic after changing build.xml. I left some comments in build.xml.

If we use UniDic, we should change stoptags.txt and also add an example for part of speech filter.
UniDic has different part of speech tags from ipadic.

Solution

Added some logic for adjusting UniDic csv and unk.def.
And also changing build.xml for downloading and building UniDic.

Tests

Unfortunately there is no tests for DictionaryBuilder.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I am authorized to contribute this code to the ASF and have removed any code I do not have a license to distribute.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the master branch.
  • I have run ant precommit and the appropriate test suite.
  • I have added tests for my changes.
  • I have added documentation for the Ref Guide (for Solr changes only).

<artifact name="ipadic" type=".tar.gz" url="https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz"/>
</dependency>
<dependency org="mecab" name="mecab-naist-jdic" rev="${/mecab/mecab-naist-jdic}" conf="naist">
<artifact name="mecab-naist-jdic" type=".tar.gz" url=" https://rwthaachen.dl.osdn.jp/naist-jdic/53500/mecab-naist-jdic-0.6.3b-20111013.tar.gz"/>
</dependency>
<dependency org="mecab" name="mecab-unidic" rev="${/mecab/mecab-unidic}" conf="unidic">
<artifact name="unidic" type=".zip" url=" http://ja.osdn.net/frs/redir.php?m=iij&amp;f=unidic%2F58338%2Funidic-mecab-2.1.2_src.zip"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we download external files we must somewhere include a license demonstrating that those files are positively licensed for re-use, ideally with some recognized open source license. Do you know what the status is of this source for unidic? What is osdn.net?

Copy link
Contributor

@mocobeta mocobeta Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About licensing, UniDic is distributed under GPL/LGPL/BSD(3-clause) triple licenses: https://unidic.ninjal.ac.jp/back_number#unidic_bccwj (Note: in Japanese)
According to my knowledge, the last one is compatible with ASF's policy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, great. We should add a reference to the unidic license to lucene/NOTICE.txt then, where we have some statement about ipadic already, and possibly we should copy the license into lucene/licenses/ - I'm not sure if that's needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @msokolov and @mocobeta .
I'll add the dictionary info in NOTICE.txt.

@msokolov
Copy link
Contributor

I understand there are no tests. Could you at least state for the record that you were able to build both UNIDIC and IPADIC-based dictionaries after your change using ant build-dict?

@mocobeta
Copy link
Contributor

Unfortunately there is no tests for DictionaryBuilder.

Yes it's a problem for future maintenance, I think we may need some kind of validator for binary encoded dictionaries rather than directly writing unit tests for the builder.

@johtani
Copy link
Author

johtani commented Oct 16, 2019

At least, I should paste the messages build successful for Unidic and Ipadic.
And I will think tests for binary dictionaries.

@johtani
Copy link
Author

johtani commented Oct 16, 2019

Here is the message for ant clean; ant build-dict with ipadic.
https://gist.github.com/johtani/b53e9e241e5b98519fb3ffe12b4164eb

And also the message with unidic and build.xml
https://gist.github.com/johtani/91cfd2753aba2e001c1d39f47666ada7

@azagniotov
Copy link

azagniotov commented Aug 21, 2023

Hello Team,

May I inquire where are we on this?

TL;DR

In the meanwhile, I attempted and succeeded to build the following dictionaries. I am using the tweaks that @johtani added in his PR three years ago, plus a few minor tweaks of my own:

See the attached screenshots displaying the generated files👇🏼 I have also tested the built dictionaries, see Testing section below 👇🏼

Shall I try make a new PR under https://github.com/apache/lucene in order to get a conversation re-started on this? cc: @mocobeta 🙇🏼‍♀️

Building

Caveat

RE: Building the unidic-cwj-3.1.1-full:

  1. I had to increase the the length check of the baseForm from 16 to 35
  2. I had to stop throwing an exception for multiple entries of LeftID. Instead, I printed them for curiosity:
...
...
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=14172: existing=[動詞-一般,五段-バ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-バ行,連用形-融合]
Multiple entries found for leftID=1121: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=1121: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=5187: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=5187: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=8269: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=8269: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=9019: existing=[動詞-一般,五段-バ行,連用形-融合], new fullPOSData=[動詞-一般,五段-バ行,仮定形-融合]
Multiple entries found for leftID=9019: existing=[動詞-一般,五段-バ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-バ行,連用形-融合]
Multiple entries found for leftID=11995: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=11995: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=7343: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=7343: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=9019: existing=[動詞-一般,五段-バ行,連用形-融合], new fullPOSData=[動詞-一般,五段-バ行,仮定形-融合]
Multiple entries found for leftID=9019: existing=[動詞-一般,五段-バ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-バ行,連用形-融合]
Multiple entries found for leftID=13457: existing=[動詞-一般,上一段-ガ行,連体形-撥音便], new fullPOSData=[動詞-一般,上一段-ガ行,仮定形-融合]
...
...

Gradle commands

The following has been performed on the fresh clone of https://github.com/apache/lucene:

My build command leveraged the new Gradle setup and the DictionaryBuilder JavaDoc comment about how to do it.

I added in lucene/analysis/kuromoji/build.gradle a default run task from the Gradle's application plugin apply plugin: 'application':

application {
  mainModule = 'org.apache.lucene.analysis.kuromoji' // name defined in module-info.java
  mainClass = 'org.apache.lucene.analysis.ja.dict.DictionaryBuilder'
}

My shell Gradle commands to build the dictionaries are as follows. The commands were executed under the root directory lucene, where the gradlew file is.

Building unidic-cwj-202302_full command

./gradlew -p lucene/analysis/kuromoji run --args='unidic "/Users/azagniotov/Downloads/unidic-cwj-202302_full" "/Users/azagniotov/Downloads/unidic-cwj-202302_full/lucene-kuromoji-built" "UTF-8" false'

Building unidic-cwj-3.1.1-full command

./gradlew -p lucene/analysis/kuromoji run --args='unidic "/Users/azagniotov/Downloads/unidic-cwj-3.1.1-full" "/Users/azagniotov/Downloads/unidic-cwj-3.1.1-full/lucene-kuromoji-built" "UTF-8" false'

Building unidic-mecab-2.1.2_src command

./gradlew -p lucene/analysis/kuromoji run --args='unidic "/Users/azagniotov/Downloads/unidic-mecab-2.1.2_src" "/Users/azagniotov/Downloads/unidic-mecab-2.1.2_src/lucene-kuromoji-built" "UTF-8" false'

Building mecab-ipadic-2.7.0-20070801 command

./gradlew -p lucene/analysis/kuromoji run --args='ipadic "/Users/azagniotov/Downloads/mecab-ipadic-2.7.0-20070801" "/Users/azagniotov/Downloads/mecab-ipadic-2.7.0-20070801/lucene-kuromoji-built" "EUC-JP" false'

Testing

The built dictionaries were tested using the following Japanese strings:

  • "にじさんじ"
  • "ちいかわ"
  • "桃太郎電鉄"
  • "聖川真斗"

The dictionaries metadata (dictionary by dictionary) were places under the lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict and a few unit test cases were added:

Existing default dictionary

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"に", "じさ", "ん", "じ"}, new int[] {1, 1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖川", "真", "斗"}, new int[] {1, 1, 1});

Built unidic-cwj-202302_full

I need to increase memory before running the tests, as ConnectionCosts.dat is ~700MB

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さん", "じ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "かわ"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃", "桃太郎", "太郎", "電鉄"}, new int[] {1, 0, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真", "斗"}, new int[] {1, 1, 1, 1});

Built unidic-cwj-3.1.1-full

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さ", "ん", "じ"}, new int[] {1, 1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "かわ"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真斗"}, new int[] {1, 1, 1});

Built unidic-mecab-2.1.2_src

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さん", "じ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃", "桃太郎", "太郎", "電鉄"}, new int[] {1, 0, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真", "斗"}, new int[] {1, 1, 1, 1});

Built mecab-ipadic-2.7.0-20070801

 assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"に", "じさ", "ん", "じ"}, new int[] {1, 1, 1, 1});
 assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
 assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
 assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖川", "真", "斗"}, new int[] {1, 1, 1});

Screenshots

Built unidic-cwj-202302_full

Screen Shot 2023-08-21 at 16 26 06

Built unidic-mecab-2.1.2_src

Screen Shot 2023-08-21 at 21 25 11

Built mecab-ipadic-2.7.0-20070801

Screen Shot 2023-08-21 at 17 36 04

@msokolov
Copy link
Contributor

Yes if you want to revive the discussion, please move the PR over to the lucene git repo. I'm a little unclear on the future of this though. So far it's a pretty expert feature. To use it you have to edit the gradle build script and rebuild Lucene, right?

@azagniotov
Copy link

Yes if you want to revive the discussion, please move the PR over to the lucene git repo. I'm a little unclear on the future of this though. So far it's a pretty expert feature. To use it you have to edit the gradle build script and rebuild Lucene, right?

@msokolov hello!

Thank you. Yes, I figured that PR will need to be exported eventually under the right repo, thus, I went ahead and did it:

To use this feature, yes, Lucene analysis Kuromoji JAR will have to be rebuilt after rebuilding a new dictionary. For example, while under the root lucene/ directory where the gradlew file is:

  1. Run the: ./gradlew compileUnidic (added in my PR). This will download and compile unidic 3.1.1 and will put the generated *.dat files under the resources directory of the kuromoji package
  2. To create a new Lucene analysis Kuromoji JAR with the new *.dat files, just run ./gradlew assemble.

To extend this example a little further, more specifically - using this together with Solr

My aforementioned PR under Lucene adds the new changes using the latest state of code from the main branch.

If you are running Solr in Docker, then the PR changes must be adapted to the Lucene v9.7.0 branch, which is what the latest Solr is running on. This is something that I have done when I was rebuilding and testing the Lucene library. I have a patch lucene_9.7.0_kuromoji_unidic_3_compatibility that adapts the PR changes to the v9.7.0 branch.

Thus, doing the above steps inside the Dockerfile, you could then replace the default Solr's lucene-analysis-kuromoji-9.7.0 JAR after rebuilding it:

...
...
ENV SOLR_JAVA_MEM="-Xms2g -Xmx2g"
ENV SOLR_LIB_HOME=/opt/solr/server/solr-webapp/webapp/WEB-INF/lib

USER root
RUN rm $SOLR_LIB_HOME/lucene-analysis-kuromoji-*.jar
COPY ./lucene/analysis/kuromoji/build/libs/lucene-analysis-kuromoji-9.7.0-SNAPSHOT.jar $SOLR_LIB_HOME/
...
...
USER solr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants