Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for inlined user dictionary in Nori #36123

Merged
merged 12 commits into from Dec 7, 2018

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Nov 30, 2018

This change adds a new option called user_dictionary_rules to
Nori's tokenizer. It can be used to set additional tokenization rules
to the Korean tokenizer directly in the settings (instead of using a file).
We should do the same for the kuromoji tokenizer in a follow up.

Closes #35842

This change adds a new option called `user_dictionary_rules` to the
Nori a tokenizer`. It can be used to set additional tokenization rules
to the Korean tokenizer directly in the settings (instead of using a file).

Closes elastic#35842
@jimczi jimczi added >feature :Search/Analysis How text is split into tokens v7.0.0 v6.6.0 labels Nov 30, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

if (settings.get(USER_DICT_PATH_OPTION) != null && settings.get(USER_DICT_RULES_OPTION) != null) {
throw new ElasticsearchException("It is not allowed to use [" + USER_DICT_PATH_OPTION + "] in conjunction" +
" with [" + USER_DICT_RULES_OPTION + "]");

Copy link
Member

@cbuescher cbuescher Dec 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty line. But I don't think its worth changing this if there are no other changes and CI is green, only if anything else needs changing anyway.

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then change itself looks good to me but I couldn't get the additional part on the docs to be displayed locally. Maybe you can recheck how this is supposed to work. Left another nit which probably only makes sense to change if the PR needs additions anyway.

@@ -154,6 +154,40 @@ The above `analyze` request returns the following:

<1> This is a compound token that spans two positions (`mixed` mode).

`user_dictionary_rules`::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason this whole sections doesn't render when I build the docs locally. I played around with it a bit but couldn't get it to work but its probably worth taking another look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. I forgot to add the end of section (e.g. --) so the whole section was not displayed. I pushed adcee29 to fix this.

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jimczi, I took a look at the recent changes and left a few questions for you, mind to take another look?

@@ -48,32 +50,24 @@ public NoriTokenizerFactory(IndexSettings indexSettings, Environment env, String
}

public static UserDictionary getUserDictionary(Environment env, Settings settings) {
if (settings.get(USER_DICT_PATH_OPTION) != null && settings.get(USER_DICT_RULES_OPTION) != null) {
throw new ElasticsearchException("It is not allowed to use [" + USER_DICT_PATH_OPTION + "] in conjunction" +
" with [" + USER_DICT_RULES_OPTION + "]");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason you decided to not use this check anymore? I cannot find it in the refactored method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch thanks, I restored this check in 5fcfad4

return UserDictionary.open(rulesReader);
} catch (IOException e) {
throw new ElasticsearchException("failed to load nori user dictionary", e);
// check for duplicate terms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the requirement for checking for duplicates come from? Would it make sense to simply ignore them if they happen or to log it only instead of throwing an error? I can imagine ppl compiling larger dictionaries where duplicates might sneak in over time, maybe this shouldn't block loading the whole analyzer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to align the Kuromoji and Nori regarding duplicates in the user dictionary. The Japanese tokenizer doesn't accept duplicates (#36100) while Nori just ignores them. However I agree that this is not the scope of this pr so I removed the duplicate detection and will open a new pr in a follow up.

for (String line : ruleList) {
String[] split = line.split("\\s+");
if (terms.add(split[0]) == false) {
throw new IllegalArgumentException("Found duplicate term: [" + split[0] + "] in user dictionary. ");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this stays an error or maybe gets converted to a log message, the line number would be a helpful debugging information for the user the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll make sure we report the line number in the follow up pr.

Copy link
Contributor Author

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews @cbuescher . I pushed a commit to address your latest comments.

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@jimczi jimczi merged commit a53e865 into elastic:master Dec 7, 2018
@jimczi jimczi deleted the nori_inlined_user_dict branch December 7, 2018 14:26
jimczi added a commit that referenced this pull request Dec 7, 2018
Add support for inlined user dictionary in Nori

This change adds a new option called `user_dictionary_rules` to the
Nori a tokenizer`. It can be used to set additional tokenization rules
to the Korean tokenizer directly in the settings (instead of using a file).

Closes #35842
@kimjmin
Copy link

kimjmin commented Jan 7, 2019

I can't thank more @jimczi for your work. This will solve many problems of our customers, especially with Elastic Cloud and ECE. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants