LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #12517

azagniotov · 2023-08-22T08:31:04Z

TL;DR

The current PR attempts to remediate https://issues.apache.org/jira/browse/LUCENE-4056 (Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary)

Description

The current PR builds up upon @johtani's past PR: apache/lucene-solr#935 and attempts to bring the code into a mergeable state, or at least to re-ignite the conversation about building a UniDic dictionary.

Before exporting the current PR, I verified the @johtani's added behavior in the aforementioned PR (plus some changes of my own) by successfully building a number of dictionaries outlined below and posted my findings in the comment: apache/lucene-solr#935 (comment)

Philosophy of changes

I tried to pick up @johtani's added behavior as-is, while minimizing the amount of changes that I added additionally.

To be honest, I do not really like how DictionaryBuilder.DictionaryFormat format is being passed everywhere, as it creates these small decision trees if IPADIC do this else do that. If this PR gets merged, I would be happy to refactor in order to make the code more modular (where possible) and perhaps with a better separation of concerns. But, I defer to the maintainers of the Lucene library on this one. 🙇🏼‍♀️

The commits are fairly atomic and I hope this can make the reviewing experience more easier

Built Dictionaries

unidic-cwj-202302_full (NINJAL)
unidic-cwj-3.1.1-full (NINJAL) (See Caveat below 👇🏼 )
unidic-mecab-2.1.2_src (NINJAL)
mecab-ipadic-2.7.0-20070801

Caveat

RE: Building the unidic-cwj-3.1.1-full:

I had to increase the the length check of the baseForm from 16 to 35

I had to stop throwing an exception for multiple entries of LeftID, which, to be honest I do not understand the full ramifications of. I did print some of the values for the same left IDs instead of throwing when debugging and that's how they look like:

Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=14172: existing=[動詞-一般,五段-バ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-バ行,連用形-融合]
Multiple entries found for leftID=1121: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=1121: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,連用形-融合], new fullPOSData=[動詞-一般,五段-ガ行,仮定形-融合]
Multiple entries found for leftID=4644: existing=[動詞-一般,五段-ガ行,仮定形-融合], new fullPOSData=[動詞-一般,五段-ガ行,連用形-融合]

Some Inflection form values are different for the same leftID, also, why do we have duplicated IDs? ^. Thus, I would value the input of the Subject matter experts here 🙇🏼‍♀️

Building a dictionary

Gradle command

My build command leverages the new Gradle setup and the DictionaryBuilder JavaDoc comment about how to do it:

I added in lucene/analysis/kuromoji/build.gradle a default run task from the Gradle's application plugin, which allows to build a dictionary are as follows. The command should be executed under the root directory lucene, where the gradlew file is.

For example, the following is my command when building unidic-cwj-3.1.1-full dictionary without the NFKD normalization:

./gradlew -p lucene/analysis/kuromoji run --args='unidic "/Users/azagniotov/Downloads/unidic-cwj-3.1.1-full" "/Users/azagniotov/Downloads/unidic-cwj-3.1.1-full/build" "UTF-8" false'

Unit testing

Unfortunately, the current PR does not include unit tests because the built dictionaries files are very big, e.g.: built unidic-cwj-202302_full connection costs .DAT file is around ~700MB, while the unidic-cwj-3.1.1-full connection costs .DAT file is around ~450 MB . Thus, the following unit tests were added and ran locally on my machine to verify that the built dictionaries can be used at runtime.

I did see there there is a main/gradle/generation/kuromoji.gradle that downloads dictionaries and compiles them, ~~but for now, I resorted not to add changes there~~ (bc9d2e3).

I also think a dictionary should be decoupled (this is related to the existing discussion in:

Decouple Kuromoji's morphological analyser and its dictionary [LUCENE-8816] #9860.

The built dictionaries were tested using the following Japanese strings (no particular reason, I just picked these four strings):

"にじさんじ"
"ちいかわ"
"桃太郎電鉄"
"聖川真斗"

The dictionaries metadata were placed (dictionary after dictionary) under the lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict and a few unit test cases were added:

Existing default dictionary already included in Lucene

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"に", "じさ", "ん", "じ"}, new int[] {1, 1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖川", "真", "斗"}, new int[] {1, 1, 1});

Built unidic-cwj-202302_full

I needed to increase memory before running the tests, as the ConnectionCosts.dat is ~700MB: in main/gradle/testing/defaults-tests.gradle#L50

-           value: { -> propertyOrEnvOrDefault("tests.jvmargs", "TEST_JVM_ARGS", isCIBuild ? "" : "-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1") },
+           value: { -> propertyOrEnvOrDefault("tests.jvmargs", "TEST_JVM_ARGS", isCIBuild ? "" : "-XX:MaxDirectMemorySize=2g -XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1") },

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さん", "じ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "かわ"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃", "桃太郎", "太郎", "電鉄"}, new int[] {1, 0, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真", "斗"}, new int[] {1, 1, 1, 1});

Built unidic-cwj-3.1.1-full

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さ", "ん", "じ"}, new int[] {1, 1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "かわ"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真斗"}, new int[] {1, 1, 1});

Built unidic-mecab-2.1.2_src

assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"にじ", "さん", "じ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃", "桃太郎", "太郎", "電鉄"}, new int[] {1, 0, 1, 1});
assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖", "川", "真", "斗"}, new int[] {1, 1, 1, 1});

Built mecab-ipadic-2.7.0-20070801

 assertAnalyzesTo(analyzerNoPunct, "にじさんじ", new String[] {"に", "じさ", "ん", "じ"}, new int[] {1, 1, 1, 1});
 assertAnalyzesTo(analyzerNoPunct, "ちいかわ", new String[] {"ちい", "か", "わ"}, new int[] {1, 1, 1});
 assertAnalyzesTo(analyzerNoPunct, "桃太郎電鉄", new String[] {"桃太郎", "電鉄"}, new int[] {1, 1});
 assertAnalyzesTo(analyzerNoPunct, "聖川真斗", new String[] {"聖川", "真", "斗"}, new int[] {1, 1, 1});

mikemccand · 2023-12-18T13:59:26Z

I am far from knowing much about Kuromoji and its dictionaries but this sounds like a great change (being able to load a new (UniDic) dictionary format, and the PR still cleanly applies. Are there any concerns with this? @rmuir had mentioned some difficult assert on the Jira issue long ago -- is that resolved / worked around?

uschindler · 2023-12-18T21:00:23Z

Hi,
Please give me some time to review.

Please don't add the application plugin. Instead just add a plain java runner task. The result of the project is a library jar, so please don't change this as it could have effects on the resulting maven pom.

azagniotov · 2023-12-19T05:55:52Z

@uschindler thank you for also taking a look, in addition to others. Understood. Please allow me a few days to export a commit that would add a plain java runner task instead of the application plugin. Thank you

github-actions · 2024-01-08T12:23:49Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

azagniotov · 2024-01-13T07:31:41Z

Please don't add the application plugin. Instead just add a plain java runner task. The result of the project is a library jar, so please don't change this as it could have effects on the resulting maven pom.

Hi @uschindler, I ended up simply reverting the 8d52f66 commit. The current gradle/generation/kuromoji.gradle already contains def recompileDictionary(...) which is used by all the task compileXXXX(..) and can be used as an example. Thus, the task and the application plugin that I added were not necessary, after all.

github-actions · 2024-01-29T00:13:35Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

azagniotov · 2024-02-03T16:24:43Z

Hi @uschindler , I wanted to ping you and see if you have any thoughts on this

github-actions · 2024-02-18T00:18:54Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

uschindler · 2024-02-18T11:25:40Z

I can't tell you anything about the internals touched here, so I can't review it from the language specific point of view.

The Gradle changes look fine.

uschindler · 2024-02-18T11:29:11Z

One thing! Have you checked that a simple recompile of the original mecab (gradlew regenerate) produced same files (working copy clean afterwards)?

Leveraging Gradle to execute the builder makes the command configuration much easier as the human engineer now does not to set the Lucene classpath manually before running the command.

UnknownDictionaryBuilder uses dictionary format to decide how the read CSV line should be parsed. The UniDic line has one more column than IpaDic.

I do not know if this was the right choice, but what prompted me to remove the check that was throwing in the two following cases: - When building any UNIDIC 3 with normalizeEntries == true - When building a unidic-cwj-3.1.1-full (even with normalizeEntries == false)

This allows to download and build a UniDic dictionary, which then can be assembled into the Lucene Kuromoji JAR by running the: ./gradlew compileUnidic followed by ./gradlew assemble

This reverts commit 8d52f66.

azagniotov · 2024-02-22T18:21:16Z

@uschindler yes: I have rebased from the latest main branch and ran the ./gradlew clean regenerate. The following is the (partial) output:

...
...
> Task :lucene:analysis:nori:compileMecabKo
Download https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz

> Task :lucene:analysis:kuromoji:compileMecab
Download https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
Automaton regenerated from dictionary: mecab-ipadic-2.7.0-20070801

> Task :lucene:analysis:nori:compileMecabKo
Automaton regenerated from dictionary: mecab-ko-dic-2.1.1-20180720

BUILD SUCCESSFUL in 39s
127 actionable tasks: 121 executed, 6 up-to-date

I also confirmed that the dictionary files *.dat were regenerated under the lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja based on the modified date of the files.

@mocobeta What do you think with regards to next steps with this PR? Is there anything I can add?

mocobeta · 2024-02-22T23:21:43Z

@azagniotov Sorry, I've not been available for a while. Let me take a look; I will try to find time next week...

azagniotov · 2024-02-22T23:27:05Z

Thank you, @mocobeta 🙇🏼

github-actions · 2024-03-08T00:18:08Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

mocobeta · 2024-03-24T10:52:22Z

Hi, sorry for my late reply.
I quickly checked the built dictionary size. The latest Unidic is fairly (to me, insanely) large - its total size is 1.6G.
https://clrd.ninjal.ac.jp/unidic/back_number.html#unidic_cwj

The built kuromoji jar with unidic-cwj-3.1.1-full eventually becomes 442M. Besides the size, I think we should consider performance. I'm worried that there can be a significant impact on analysis/indexing speed. Do you have any benchmark result on that?

mikemccand · 2024-03-28T13:52:10Z

The built kuromoji jar with unidic-cwj-3.1.1-full eventually becomes 442M. Besides the size, I think we should consider performance. I'm worried that there can be a significant impact on analysis/indexing speed. Do you have any benchmark result on that?

luceneutil has a runAnalyzerPerf.py benchmark, and the nightly script runs it and posts the results here, including Kuromoji. We could maybe test tokens/sec throughput using that benchy for this PR running on UniDic? I'm not sure how we would conclude whether it's fast enough since current Kuromoji (main) can't build UniDic. Maybe we could defer performance testing / optimizing on UniDic to a later issue...

azagniotov · 2024-03-29T00:13:07Z

@mocobeta thank you. I have not done any benchmarks, thus, I cannot comment on potential performance implications. One thing that probably be certain that a larger dictionary will require more memory allocated. Btw, have you had a chance to evaluate the correctness of tokenization?

@mikemccand this sounds interesting. It seems like this is a treaded path. How easy/hard will it be to point runAnalyzerPerf.py to the current PR branch?

github-actions · 2024-04-12T00:18:01Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

azagniotov mentioned this pull request Aug 28, 2023

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary apache/lucene-solr#935

Open

8 tasks

mocobeta self-assigned this Oct 18, 2023

mocobeta assigned mocobeta and unassigned mocobeta Dec 16, 2023

github-actions bot added the Stale label Jan 8, 2024

github-actions bot removed the Stale label Jan 14, 2024

github-actions bot added the Stale label Jan 29, 2024

github-actions bot removed the Stale label Feb 4, 2024

github-actions bot added the Stale label Feb 18, 2024

github-actions bot removed the Stale label Feb 19, 2024

azagniotov added 10 commits February 22, 2024 09:54

Fixed a typo in dict builder usage command

02533c7

Added ability to execute dict builder via Gradle

d9b4141

Leveraging Gradle to execute the builder makes the command configuration much easier as the human engineer now does not to set the Lucene classpath manually before running the command.

Unknown dict builder uses dict format in CSV parse

243b66d

UnknownDictionaryBuilder uses dictionary format to decide how the read CSV line should be parsed. The UniDic line has one more column than IpaDic.

Validating CSV entry length

5d5be00

Formatting UniDic entry features

b5aea1e

Passing DictionaryFormat to classes that need it

35762b9

Renamed existing ID_LIMIT const

dd966ca

Updated JavaDoc

eae0576

Validating UNIDIC left & right IDs

23c3d9c

Added UNIDIC base form length check

98d250f

azagniotov added 4 commits February 22, 2024 09:54

Added to CHANGES.txt entry about LUCENE-4056

3d43243

Added Gradle task compileUnidic

e0efe8f

This allows to download and build a UniDic dictionary, which then can be assembled into the Lucene Kuromoji JAR by running the: ./gradlew compileUnidic followed by ./gradlew assemble

Revert "Added ability to execute dict builder via Gradle"

15e2ccc

This reverts commit 8d52f66.

azagniotov force-pushed the lucene_kuromoji_unidic_3_compatibility branch from d24e7bb to 15e2ccc Compare February 22, 2024 18:22

github-actions bot added the Stale label Mar 8, 2024

github-actions bot removed the Stale label Mar 25, 2024

github-actions bot added the Stale label Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #12517

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #12517

azagniotov commented Aug 22, 2023 •

edited

mikemccand commented Dec 18, 2023

uschindler commented Dec 18, 2023

azagniotov commented Dec 19, 2023

github-actions bot commented Jan 8, 2024

azagniotov commented Jan 13, 2024 •

edited

github-actions bot commented Jan 29, 2024

azagniotov commented Feb 3, 2024

github-actions bot commented Feb 18, 2024

uschindler commented Feb 18, 2024

uschindler commented Feb 18, 2024

azagniotov commented Feb 22, 2024

mocobeta commented Feb 22, 2024

azagniotov commented Feb 22, 2024

github-actions bot commented Mar 8, 2024

mocobeta commented Mar 24, 2024

mikemccand commented Mar 28, 2024

azagniotov commented Mar 29, 2024

github-actions bot commented Apr 12, 2024

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #12517

Are you sure you want to change the base?

LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #12517

Conversation

azagniotov commented Aug 22, 2023 • edited

TL;DR

Description

Philosophy of changes

Built Dictionaries

Caveat

Building a dictionary

Gradle command

Unit testing

Existing default dictionary already included in Lucene

Built unidic-cwj-202302_full

Built unidic-cwj-3.1.1-full

Built unidic-mecab-2.1.2_src

Built mecab-ipadic-2.7.0-20070801

mikemccand commented Dec 18, 2023

uschindler commented Dec 18, 2023

azagniotov commented Dec 19, 2023

github-actions bot commented Jan 8, 2024

azagniotov commented Jan 13, 2024 • edited

github-actions bot commented Jan 29, 2024

azagniotov commented Feb 3, 2024

github-actions bot commented Feb 18, 2024

uschindler commented Feb 18, 2024

uschindler commented Feb 18, 2024

azagniotov commented Feb 22, 2024

mocobeta commented Feb 22, 2024

azagniotov commented Feb 22, 2024

github-actions bot commented Mar 8, 2024

mocobeta commented Mar 24, 2024

mikemccand commented Mar 28, 2024

azagniotov commented Mar 29, 2024

github-actions bot commented Apr 12, 2024

azagniotov commented Aug 22, 2023 •

edited

azagniotov commented Jan 13, 2024 •

edited