-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9253: Support custom dictionaries in KoreanTokenizer #1296
Conversation
…nizer Signed-off-by: Namgyu Kim <namgyu@apache.org>
Signed-off-by: Namgyu Kim <kng0828@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look's fine to me - just one javadoc nit
* @param connectionCosts custom token transition costs | ||
* @param userDictionary Optional: if non-null, user dictionary. | ||
* @param mode Decompound mode. | ||
* @param outputUnknownUnigrams If true outputs unigrams for unknown words. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't capitalize "If"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I did that because it was capitalized before.
I'll change the other constructors as well.
@@ -150,7 +150,18 @@ protected final InputStream getResource(String suffix) throws IOException { | |||
throw new IllegalStateException("unknown resource scheme " + resourceScheme); | |||
} | |||
} | |||
|
|||
|
|||
public static InputStream getResource(ResourceScheme scheme, String path) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so .. this basically follows the pattern from JapaneseTokenizer, I think. .. but somehow I don't see where we defined ResourceScheme? We're not referencing the one in kuromoji, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same as Kuromoji's pattern.
But there is no relation with Kuromoji.
It's already written as an enum in this class.
lucene-solr/lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/dict/BinaryDictionary.java
Lines 45 to 47 in d5e44e9
public enum ResourceScheme { | |
CLASSPATH, FILE | |
} |
Signed-off-by: Namgyu Kim <namgyu@apache.org>
Signed-off-by: Namgyu Kim <namgyu@apache.org>
Thanks for your review! @msokolov I changed some upper cases in Javadoc. |
@jimczi |
KoreanTokenizer does not support custom dictionaries(system, unknown) now, even though Nori provides DictionaryBuilder that creates custom dictionary.
In the current state, it is very difficult for Nori users to use a custom dictionary.
Therefore, we need to open a new constructor that uses it.
JIRA : https://issues.apache.org/jira/browse/LUCENE-9253