Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #12517

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

azagniotov
Copy link

@azagniotov azagniotov commented Aug 22, 2023

TL;DR

The current PR attempts to remediate https://issues.apache.org/jira/browse/LUCENE-4056 (Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary)

Description

The current PR builds up upon @johtani's past PR: apache/lucene-solr#935 and attempts to bring the code into a mergeable state, or at least to re-ignite the conversation about building a UniDic dictionary.

Before exporting the current PR, I verified the @johtani's added behavior in the aforementioned PR (plus some changes of my own) by successfully building a number of dictionaries outlined below and posted my findings in the comment: apache/lucene-solr#935 (comment)

Philosophy of changes

I tried to pick up @johtani's added behavior as-is, while minimizing the amount of changes that I added additionally.

To be honest, I do not really like how DictionaryBuilder.DictionaryFormat format is being passed everywhere, as it creates these small decision trees if IPADIC do this else do that. If this PR gets merged, I would be happy to refactor in order to make the code more modular (where possible) and perhaps with a better separation of concerns. But, I defer to the maintainers of the Lucene library on this one. 🙇🏼‍♀️

The commits are fairly atomic and I hope this can make the reviewing experience more easier

Built Dictionaries

Caveat

RE: Building the unidic-cwj-3.1.1-full:

  1. I had to increase the the length check of the baseForm from 16 to 35
  2. I had to stop throwing an exception for multiple entries of LeftID, which, to be honest I do not understand the full ramifications of. I did print some of the values for the same left IDs instead of throwing when debugging and that's how they look like:
    Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
    Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
    Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
    Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
    Multiple entries found for leftID=14172: existing=[動詞-一般,五段-バ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-バ行,連用形-融合]
    Multiple entries found for leftID=1121: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
    Multiple entries found for leftID=1121: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
    Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
    Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
    

Some Inflection form values are different for the same leftID, also, why do we have duplicated IDs? ^. Thus, I would value the input of the Subject matter experts here 🙇🏼‍♀️

Building a dictionary

Gradle command

My build command leverages the new Gradle setup and the DictionaryBuilder JavaDoc comment about how to do it:

I added in lucene/analysis/kuromoji/build.gradle a default run task from the Gradle's application plugin, which allows to build a dictionary are as follows. The command should be executed under the root directory lucene, where the gradlew file is.

For example, the following is my command when building unidic-cwj-3.1.1-full dictionary without the NFKD normalization:

./gradlew -p lucene/analysis/kuromoji run --args='unidic "/Users/azagniotov/Downloads/unidic-cwj-3.1.1-full" "/Users/azagniotov/Downloads/unidic-cwj-3.1.1-full/build" "UTF-8" false'

Unit testing

Unfortunately, the current PR does not include unit tests because the built dictionaries files are very big, e.g.: built unidic-cwj-202302_full connection costs .DAT file is around ~700MB, while the unidic-cwj-3.1.1-full connection costs .DAT file is around ~450 MB . Thus, the following unit tests were added and ran locally on my machine to verify that the built dictionaries can be used at runtime.

I did see there there is a main/gradle/generation/kuromoji.gradle that downloads dictionaries and compiles them, but for now, I resorted not to add changes there (bc9d2e3).

I also think a dictionary should be decoupled (this is related to the existing discussion in:

The built dictionaries were tested using the following Japanese strings (no particular reason, I just picked these four strings):

  • "にじさんじ"
  • "ちいかわ"
  • "桃太郎電鉄"
  • "聖川真斗"

The dictionaries metadata were placed (dictionary after dictionary) under the lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict and a few unit test cases were added:

Existing default dictionary already included in Lucene

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"に", "じさ", "ん", "じ"}, new int[] {1, 1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖川", "真", "斗"}, new int[] {1, 1, 1});

Built unidic-cwj-202302_full

I needed to increase memory before running the tests, as the ConnectionCosts.dat is ~700MB: in main/gradle/testing/defaults-tests.gradle#L50

-           value: { -> propertyOrEnvOrDefault("tests.jvmargs", "TEST_JVM_ARGS", isCIBuild ? "" : "-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1") },
+           value: { -> propertyOrEnvOrDefault("tests.jvmargs", "TEST_JVM_ARGS", isCIBuild ? "" : "-XX:MaxDirectMemorySize=2g -XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1") },
assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さん", "じ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "かわ"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃", "桃太郎", "太郎", "電鉄"}, new int[] {1, 0, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真", "斗"}, new int[] {1, 1, 1, 1});

Built unidic-cwj-3.1.1-full

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さ", "ん", "じ"}, new int[] {1, 1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "かわ"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真斗"}, new int[] {1, 1, 1});

Built unidic-mecab-2.1.2_src

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さん", "じ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃", "桃太郎", "太郎", "電鉄"}, new int[] {1, 0, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真", "斗"}, new int[] {1, 1, 1, 1});

Built mecab-ipadic-2.7.0-20070801

 assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"に", "じさ", "ん", "じ"}, new int[] {1, 1, 1, 1});
 assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
 assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
 assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖川", "真", "斗"}, new int[] {1, 1, 1});

@mikemccand
Copy link
Member

I am far from knowing much about Kuromoji and its dictionaries but this sounds like a great change (being able to load a new (UniDic) dictionary format, and the PR still cleanly applies. Are there any concerns with this? @rmuir had mentioned some difficult assert on the Jira issue long ago -- is that resolved / worked around?

@uschindler
Copy link
Contributor

Hi,
Please give me some time to review.

Please don't add the application plugin. Instead just add a plain java runner task. The result of the project is a library jar, so please don't change this as it could have effects on the resulting maven pom.

@azagniotov
Copy link
Author

@uschindler thank you for also taking a look, in addition to others. Understood. Please allow me a few days to export a commit that would add a plain java runner task instead of the application plugin. Thank you

Copy link

github-actions bot commented Jan 8, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 8, 2024
@azagniotov
Copy link
Author

azagniotov commented Jan 13, 2024

Please don't add the application plugin. Instead just add a plain java runner task. The result of the project is a library jar, so please don't change this as it could have effects on the resulting maven pom.

Hi @uschindler, I ended up simply reverting the 8d52f66 commit. The current gradle/generation/kuromoji.gradle already contains def recompileDictionary(...) which is used by all the task compileXXXX(..) and can be used as an example. Thus, the task and the application plugin that I added were not necessary, after all.

@github-actions github-actions bot removed the Stale label Jan 14, 2024
Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 29, 2024
@azagniotov
Copy link
Author

Hi @uschindler , I wanted to ping you and see if you have any thoughts on this

@github-actions github-actions bot removed the Stale label Feb 4, 2024
Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Feb 18, 2024
@uschindler
Copy link
Contributor

I can't tell you anything about the internals touched here, so I can't review it from the language specific point of view.

The Gradle changes look fine.

@uschindler
Copy link
Contributor

One thing! Have you checked that a simple recompile of the original mecab (gradlew regenerate) produced same files (working copy clean afterwards)?

@github-actions github-actions bot removed the Stale label Feb 19, 2024
Leveraging Gradle to execute the builder makes the command
configuration much easier as the human engineer now does not
to set the Lucene classpath manually before running the command.
UnknownDictionaryBuilder uses dictionary format to decide
how the read CSV line should be parsed. The UniDic line has one
more column than IpaDic.
I do not know if this was the right choice, but what prompted
me to remove the check that was throwing in the two following cases:

- When building any UNIDIC 3 with normalizeEntries == true
- When building a unidic-cwj-3.1.1-full (even with normalizeEntries == false)
This allows to download and build a UniDic dictionary, which
then can be assembled into the Lucene Kuromoji JAR by running
the: ./gradlew compileUnidic followed by ./gradlew assemble
@azagniotov
Copy link
Author

@uschindler yes: I have rebased from the latest main branch and ran the ./gradlew clean regenerate. The following is the (partial) output:

...
...
> Task :lucene:analysis:nori:compileMecabKo
Download https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz

> Task :lucene:analysis:kuromoji:compileMecab
Download https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
Automaton regenerated from dictionary: mecab-ipadic-2.7.0-20070801

> Task :lucene:analysis:nori:compileMecabKo
Automaton regenerated from dictionary: mecab-ko-dic-2.1.1-20180720

BUILD SUCCESSFUL in 39s
127 actionable tasks: 121 executed, 6 up-to-date

I also confirmed that the dictionary files *.dat were regenerated under the lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja based on the modified date of the files.

@mocobeta What do you think with regards to next steps with this PR? Is there anything I can add?

@azagniotov azagniotov force-pushed the lucene_kuromoji_unidic_3_compatibility branch from d24e7bb to 15e2ccc Compare February 22, 2024 18:22
@mocobeta
Copy link
Contributor

@azagniotov Sorry, I've not been available for a while. Let me take a look; I will try to find time next week...

@azagniotov
Copy link
Author

Thank you, @mocobeta 🙇🏼

Copy link

github-actions bot commented Mar 8, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Mar 8, 2024
@mocobeta
Copy link
Contributor

Hi, sorry for my late reply.
I quickly checked the built dictionary size. The latest Unidic is fairly (to me, insanely) large - its total size is 1.6G.
https://clrd.ninjal.ac.jp/unidic/back_number.html#unidic_cwj

The built kuromoji jar with unidic-cwj-3.1.1-full eventually becomes 442M. Besides the size, I think we should consider performance. I'm worried that there can be a significant impact on analysis/indexing speed. Do you have any benchmark result on that?

@github-actions github-actions bot removed the Stale label Mar 25, 2024
@mikemccand
Copy link
Member

The built kuromoji jar with unidic-cwj-3.1.1-full eventually becomes 442M. Besides the size, I think we should consider performance. I'm worried that there can be a significant impact on analysis/indexing speed. Do you have any benchmark result on that?

luceneutil has a runAnalyzerPerf.py benchmark, and the nightly script runs it and posts the results here, including Kuromoji. We could maybe test tokens/sec throughput using that benchy for this PR running on UniDic? I'm not sure how we would conclude whether it's fast enough since current Kuromoji (main) can't build UniDic. Maybe we could defer performance testing / optimizing on UniDic to a later issue...

@azagniotov
Copy link
Author

@mocobeta thank you. I have not done any benchmarks, thus, I cannot comment on potential performance implications. One thing that probably be certain that a larger dictionary will require more memory allocated. Btw, have you had a chance to evaluate the correctness of tokenization?

@mikemccand this sounds interesting. It seems like this is a treaded path. How easy/hard will it be to point runAnalyzerPerf.py to the current PR branch?

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants