New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #12517
base: main
Are you sure you want to change the base?
LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #12517
Conversation
I am far from knowing much about Kuromoji and its dictionaries but this sounds like a great change (being able to load a new (UniDic) dictionary format, and the PR still cleanly applies. Are there any concerns with this? @rmuir had mentioned some difficult |
Hi, Please don't add the application plugin. Instead just add a plain java runner task. The result of the project is a library jar, so please don't change this as it could have effects on the resulting maven pom. |
@uschindler thank you for also taking a look, in addition to others. Understood. Please allow me a few days to export a commit that would add a plain java runner task instead of the application plugin. Thank you |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Hi @uschindler, I ended up simply reverting the 8d52f66 commit. The current |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Hi @uschindler , I wanted to ping you and see if you have any thoughts on this |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
I can't tell you anything about the internals touched here, so I can't review it from the language specific point of view. The Gradle changes look fine. |
One thing! Have you checked that a simple recompile of the original mecab (gradlew regenerate) produced same files (working copy clean afterwards)? |
Leveraging Gradle to execute the builder makes the command configuration much easier as the human engineer now does not to set the Lucene classpath manually before running the command.
UnknownDictionaryBuilder uses dictionary format to decide how the read CSV line should be parsed. The UniDic line has one more column than IpaDic.
I do not know if this was the right choice, but what prompted me to remove the check that was throwing in the two following cases: - When building any UNIDIC 3 with normalizeEntries == true - When building a unidic-cwj-3.1.1-full (even with normalizeEntries == false)
This allows to download and build a UniDic dictionary, which then can be assembled into the Lucene Kuromoji JAR by running the: ./gradlew compileUnidic followed by ./gradlew assemble
This reverts commit 8d52f66.
@uschindler yes: I have rebased from the latest
I also confirmed that the dictionary files @mocobeta What do you think with regards to next steps with this PR? Is there anything I can add? |
d24e7bb
to
15e2ccc
Compare
@azagniotov Sorry, I've not been available for a while. Let me take a look; I will try to find time next week... |
Thank you, @mocobeta 🙇🏼 |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Hi, sorry for my late reply. The built kuromoji jar with unidic-cwj-3.1.1-full eventually becomes 442M. Besides the size, I think we should consider performance. I'm worried that there can be a significant impact on analysis/indexing speed. Do you have any benchmark result on that? |
luceneutil has a |
@mocobeta thank you. I have not done any benchmarks, thus, I cannot comment on potential performance implications. One thing that probably be certain that a larger dictionary will require more memory allocated. Btw, have you had a chance to evaluate the correctness of tokenization? @mikemccand this sounds interesting. It seems like this is a treaded path. How easy/hard will it be to point |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
TL;DR
The current PR attempts to remediate https://issues.apache.org/jira/browse/LUCENE-4056 (
Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
)Description
The current PR builds up upon @johtani's past PR: apache/lucene-solr#935 and attempts to bring the code into a mergeable state, or at least to re-ignite the conversation about building a UniDic dictionary.
Before exporting the current PR, I verified the @johtani's added behavior in the aforementioned PR (plus some changes of my own) by successfully building a number of dictionaries outlined below and posted my findings in the comment: apache/lucene-solr#935 (comment)
Philosophy of changes
I tried to pick up @johtani's added behavior as-is, while minimizing the amount of changes that I added additionally.
To be honest, I do not really like how
DictionaryBuilder.DictionaryFormat format
is being passed everywhere, as it creates these small decision treesif IPADIC do this else do that
. If this PR gets merged, I would be happy to refactor in order to make the code more modular (where possible) and perhaps with a better separation of concerns. But, I defer to the maintainers of the Lucene library on this one. 🙇🏼♀️The commits are fairly atomic and I hope this can make the reviewing experience more easier
Built Dictionaries
Caveat
RE: Building the unidic-cwj-3.1.1-full:
16
to35
Some
Inflection form
values are different for the sameleftID
, also, why do we have duplicated IDs? ^. Thus, I would value the input of the Subject matter experts here 🙇🏼♀️Building a dictionary
Gradle command
My build command leverages the new Gradle setup and the DictionaryBuilder JavaDoc comment about how to do it:
I added in
lucene/analysis/kuromoji/build.gradle
a default run task from the Gradle's application plugin, which allows to build a dictionary are as follows. The command should be executed under the root directorylucene
, where thegradlew
file is.For example, the following is my command when building
unidic-cwj-3.1.1-full
dictionary without the NFKD normalization:Unit testing
Unfortunately, the current PR does not include unit tests because the built dictionaries files are very big, e.g.: built
unidic-cwj-202302_full
connection costs .DAT file is around ~700MB, while theunidic-cwj-3.1.1-full
connection costs .DAT file is around ~450 MB . Thus, the following unit tests were added and ran locally on my machine to verify that the built dictionaries can be used at runtime.I did see there there is a main/gradle/generation/kuromoji.gradle that downloads dictionaries and compiles them,
but for now, I resorted not to add changes there(bc9d2e3).I also think a dictionary should be decoupled (this is related to the existing discussion in:
The built dictionaries were tested using the following Japanese strings (no particular reason, I just picked these four strings):
"にじさんじ"
"ちいかわ"
"桃太郎電鉄"
"聖川真斗"
The dictionaries metadata were placed (dictionary after dictionary) under the lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict and a few unit test cases were added:
Existing default dictionary already included in Lucene
Built unidic-cwj-202302_full
I needed to increase memory before running the tests, as the
ConnectionCosts.dat
is ~700MB: in main/gradle/testing/defaults-tests.gradle#L50Built unidic-cwj-3.1.1-full
Built unidic-mecab-2.1.2_src
Built mecab-ipadic-2.7.0-20070801