-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary #935
base: master
Are you sure you want to change the base?
Conversation
<artifact name="ipadic" type=".tar.gz" url="https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz"/> | ||
</dependency> | ||
<dependency org="mecab" name="mecab-naist-jdic" rev="${/mecab/mecab-naist-jdic}" conf="naist"> | ||
<artifact name="mecab-naist-jdic" type=".tar.gz" url=" https://rwthaachen.dl.osdn.jp/naist-jdic/53500/mecab-naist-jdic-0.6.3b-20111013.tar.gz"/> | ||
</dependency> | ||
<dependency org="mecab" name="mecab-unidic" rev="${/mecab/mecab-unidic}" conf="unidic"> | ||
<artifact name="unidic" type=".zip" url=" http://ja.osdn.net/frs/redir.php?m=iij&f=unidic%2F58338%2Funidic-mecab-2.1.2_src.zip"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we download external files we must somewhere include a license demonstrating that those files are positively licensed for re-use, ideally with some recognized open source license. Do you know what the status is of this source for unidic? What is osdn.net?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About licensing, UniDic is distributed under GPL/LGPL/BSD(3-clause) triple licenses: https://unidic.ninjal.ac.jp/back_number#unidic_bccwj (Note: in Japanese)
According to my knowledge, the last one is compatible with ASF's policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, great. We should add a reference to the unidic license to lucene/NOTICE.txt then, where we have some statement about ipadic already, and possibly we should copy the license into lucene/licenses/ - I'm not sure if that's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand there are no tests. Could you at least state for the record that you were able to build both UNIDIC and IPADIC-based dictionaries after your change using |
Yes it's a problem for future maintenance, I think we may need some kind of validator for binary encoded dictionaries rather than directly writing unit tests for the builder. |
At least, I should paste the messages build successful for Unidic and Ipadic. |
Here is the message for And also the message with unidic and |
…nary Fix some errors during building UniDic 2.1.2 dictionary
…nary Fix precommit error
…nary Add unidic license to NOTICE.txt
Hello Team, May I inquire where are we on this? TL;DRIn the meanwhile, I attempted and succeeded to build the following dictionaries. I am using the tweaks that @johtani added in his PR three years ago, plus a few minor tweaks of my own:
See the attached screenshots displaying the generated files👇🏼 I have also tested the built dictionaries, see Testing section below 👇🏼 Shall I try make a new PR under https://github.com/apache/lucene in order to get a conversation re-started on this? cc: @mocobeta 🙇🏼♀️ BuildingCaveatRE: Building the unidic-cwj-3.1.1-full:
Gradle commandsThe following has been performed on the fresh clone of https://github.com/apache/lucene: My build command leveraged the new Gradle setup and the DictionaryBuilder JavaDoc comment about how to do it. I added in
My shell Gradle commands to build the dictionaries are as follows. The commands were executed under the root directory Building unidic-cwj-202302_full command
Building unidic-cwj-3.1.1-full command
Building unidic-mecab-2.1.2_src command
Building mecab-ipadic-2.7.0-20070801 command
TestingThe built dictionaries were tested using the following Japanese strings:
The dictionaries metadata (dictionary by dictionary) were places under the lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict and a few unit test cases were added: Existing default dictionary
Built unidic-cwj-202302_fullI need to increase memory before running the tests, as
Built unidic-cwj-3.1.1-full
Built unidic-mecab-2.1.2_src
Built mecab-ipadic-2.7.0-20070801
ScreenshotsBuilt unidic-cwj-202302_fullBuilt unidic-mecab-2.1.2_srcBuilt mecab-ipadic-2.7.0-20070801 |
Yes if you want to revive the discussion, please move the PR over to the lucene git repo. I'm a little unclear on the future of this though. So far it's a pretty expert feature. To use it you have to edit the gradle build script and rebuild Lucene, right? |
@msokolov hello! Thank you. Yes, I figured that PR will need to be exported eventually under the right repo, thus, I went ahead and did it: To use this feature, yes, Lucene analysis Kuromoji JAR will have to be rebuilt after rebuilding a new dictionary. For example, while under the root lucene/ directory where the
To extend this example a little further, more specifically - using this together with SolrMy aforementioned PR under Lucene adds the new changes using the latest state of code from the If you are running Solr in Docker, then the PR changes must be adapted to the Lucene v9.7.0 branch, which is what the latest Solr is running on. This is something that I have done when I was rebuilding and testing the Lucene library. I have a patch lucene_9.7.0_kuromoji_unidic_3_compatibility that adapts the PR changes to the v9.7.0 branch. Thus, doing the above steps inside the Dockerfile, you could then replace the default Solr's
|
Description
UniDic has a bit different column and unk.def from ipadic.
Currently if we use UniDic, then we get some error for building dictionary.
We can build UniDic after changing build.xml. I left some comments in build.xml.
If we use UniDic, we should change stoptags.txt and also add an example for part of speech filter.
UniDic has different part of speech tags from ipadic.
Solution
Added some logic for adjusting UniDic csv and unk.def.
And also changing build.xml for downloading and building UniDic.
Tests
Unfortunately there is no tests for DictionaryBuilder.
Checklist
Please review the following and check all that apply:
master
branch.ant precommit
and the appropriate test suite.