Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9253: Support custom dictionaries in KoreanTokenizer #1296

Merged
merged 5 commits into from
Mar 2, 2020

Conversation

danmuzi
Copy link
Member

@danmuzi danmuzi commented Feb 27, 2020

KoreanTokenizer does not support custom dictionaries(system, unknown) now, even though Nori provides DictionaryBuilder that creates custom dictionary.
In the current state, it is very difficult for Nori users to use a custom dictionary.
Therefore, we need to open a new constructor that uses it.

JIRA : https://issues.apache.org/jira/browse/LUCENE-9253

…nizer

Signed-off-by: Namgyu Kim <namgyu@apache.org>
Signed-off-by: Namgyu Kim <kng0828@gmail.com>
Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look's fine to me - just one javadoc nit

* @param connectionCosts custom token transition costs
* @param userDictionary Optional: if non-null, user dictionary.
* @param mode Decompound mode.
* @param outputUnknownUnigrams If true outputs unigrams for unknown words.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't capitalize "If"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I did that because it was capitalized before.
I'll change the other constructors as well.

@@ -150,7 +150,18 @@ protected final InputStream getResource(String suffix) throws IOException {
throw new IllegalStateException("unknown resource scheme " + resourceScheme);
}
}


public static InputStream getResource(ResourceScheme scheme, String path) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so .. this basically follows the pattern from JapaneseTokenizer, I think. .. but somehow I don't see where we defined ResourceScheme? We're not referencing the one in kuromoji, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same as Kuromoji's pattern.

But there is no relation with Kuromoji.
It's already written as an enum in this class.

Signed-off-by: Namgyu Kim <namgyu@apache.org>
Signed-off-by: Namgyu Kim <namgyu@apache.org>
@danmuzi
Copy link
Member Author

danmuzi commented Mar 1, 2020

Thanks for your review! @msokolov

I changed some upper cases in Javadoc.
And I added a change log for this patch.

@danmuzi
Copy link
Member Author

danmuzi commented Mar 2, 2020

@jimczi
I'll push it to master and 8.x branch because the branch 8.5 cut date is almost up.
Please let me know if there is a problem :)

@danmuzi danmuzi merged commit b2dbd18 into apache:master Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants