Add support for inlined user dictionary in Nori #36123

jimczi · 2018-11-30T17:01:29Z

This change adds a new option called user_dictionary_rules to
Nori's tokenizer. It can be used to set additional tokenization rules
to the Korean tokenizer directly in the settings (instead of using a file).
We should do the same for the kuromoji tokenizer in a follow up.

Closes #35842

This change adds a new option called `user_dictionary_rules` to the Nori a tokenizer`. It can be used to set additional tokenization rules to the Korean tokenizer directly in the settings (instead of using a file). Closes elastic#35842

elasticmachine · 2018-11-30T17:01:31Z

Pinging @elastic/es-search

cbuescher · 2018-12-02T20:38:06Z

plugins/analysis-nori/src/main/java/org/elasticsearch/index/analysis/NoriTokenizerFactory.java

+        if (settings.get(USER_DICT_PATH_OPTION) != null && settings.get(USER_DICT_RULES_OPTION) != null) {
+            throw new ElasticsearchException("It is not allowed to use [" + USER_DICT_PATH_OPTION + "] in conjunction" +
+                " with [" + USER_DICT_RULES_OPTION + "]");
+


nit: remove empty line. But I don't think its worth changing this if there are no other changes and CI is green, only if anything else needs changing anyway.

cbuescher

Then change itself looks good to me but I couldn't get the additional part on the docs to be displayed locally. Maybe you can recheck how this is supposed to work. Left another nit which probably only makes sense to change if the PR needs additions anyway.

cbuescher · 2018-12-02T21:19:15Z

docs/plugins/analysis-nori.asciidoc

@@ -154,6 +154,40 @@ The above `analyze` request returns the following:

 <1> This is a compound token that spans two positions (`mixed` mode).

+`user_dictionary_rules`::


For some reason this whole sections doesn't render when I build the docs locally. I played around with it a bit but couldn't get it to work but its probably worth taking another look.

Good catch, thanks. I forgot to add the end of section (e.g. --) so the whole section was not displayed. I pushed adcee29 to fix this.

cbuescher

LGTM

…tes in the user rules

cbuescher

Hi @jimczi, I took a look at the recent changes and left a few questions for you, mind to take another look?

cbuescher · 2018-12-07T10:04:24Z

plugins/analysis-nori/src/main/java/org/elasticsearch/index/analysis/NoriTokenizerFactory.java

@@ -48,32 +50,24 @@ public NoriTokenizerFactory(IndexSettings indexSettings, Environment env, String
    }

    public static UserDictionary getUserDictionary(Environment env, Settings settings) {
-        if (settings.get(USER_DICT_PATH_OPTION) != null && settings.get(USER_DICT_RULES_OPTION) != null) {
-            throw new ElasticsearchException("It is not allowed to use [" + USER_DICT_PATH_OPTION + "] in conjunction" +
-                " with [" + USER_DICT_RULES_OPTION + "]");


What is the reason you decided to not use this check anymore? I cannot find it in the refactored method.

good catch thanks, I restored this check in 5fcfad4

cbuescher · 2018-12-07T10:12:33Z

plugins/analysis-nori/src/main/java/org/elasticsearch/index/analysis/NoriTokenizerFactory.java

-                return UserDictionary.open(rulesReader);
-            } catch (IOException e) {
-                throw new ElasticsearchException("failed to load nori user dictionary", e);
+        // check for duplicate terms


Where does the requirement for checking for duplicates come from? Would it make sense to simply ignore them if they happen or to log it only instead of throwing an error? I can imagine ppl compiling larger dictionaries where duplicates might sneak in over time, maybe this shouldn't block loading the whole analyzer?

I wanted to align the Kuromoji and Nori regarding duplicates in the user dictionary. The Japanese tokenizer doesn't accept duplicates (#36100) while Nori just ignores them. However I agree that this is not the scope of this pr so I removed the duplicate detection and will open a new pr in a follow up.

cbuescher · 2018-12-07T10:13:18Z

plugins/analysis-nori/src/main/java/org/elasticsearch/index/analysis/NoriTokenizerFactory.java

+        for (String line : ruleList) {
+            String[] split = line.split("\\s+");
+            if (terms.add(split[0]) == false) {
+                throw new IllegalArgumentException("Found duplicate term: [" + split[0] + "] in user dictionary. ");


If this stays an error or maybe gets converted to a log message, the line number would be a helpful debugging information for the user the file.

Agreed, I'll make sure we report the line number in the follow up pr.

jimczi

Thanks for the reviews @cbuescher . I pushed a commit to address your latest comments.

cbuescher

Thanks, LGTM

Add support for inlined user dictionary in Nori This change adds a new option called `user_dictionary_rules` to the Nori a tokenizer`. It can be used to set additional tokenization rules to the Korean tokenizer directly in the settings (instead of using a file). Closes #35842

kimjmin · 2019-01-07T02:16:18Z

I can't thank more @jimczi for your work. This will solve many problems of our customers, especially with Elastic Cloud and ECE. 👍

Add support for inlined user dictionary in Nori

fd4a36a

This change adds a new option called `user_dictionary_rules` to the Nori a tokenizer`. It can be used to set additional tokenization rules to the Korean tokenizer directly in the settings (instead of using a file). Closes elastic#35842

jimczi added >feature :Search/Analysis How text is split into tokens v7.0.0 v6.6.0 labels Nov 30, 2018

jimczi added 2 commits November 30, 2018 18:08

checkstyle

41babe7

add missing CONSOLE in docs

450a578

cbuescher reviewed Dec 2, 2018

View reviewed changes

jimczi added 3 commits December 3, 2018 11:33

Merge branch 'master' into nori_inlined_user_dict

adecda4

fix docs section

adcee29

fix redundant end of section

e618198

cbuescher approved these changes Dec 3, 2018

View reviewed changes

jimczi added 4 commits December 4, 2018 08:40

Merge branch 'master' into nori_inlined_user_dict

4ac0894

Add an helper in Analysis to load the word list and check for duplica…

3a97e16

…tes in the user rules

unused import

cb3ca2a

fix error message

39db1d9

cbuescher reviewed Dec 7, 2018

View reviewed changes

jimczi added 2 commits December 7, 2018 11:45

Merge branch 'master' into nori_inlined_user_dict

a2af275

address review

5fcfad4

jimczi commented Dec 7, 2018

View reviewed changes

cbuescher approved these changes Dec 7, 2018

View reviewed changes

jimczi merged commit a53e865 into elastic:master Dec 7, 2018

jimczi deleted the nori_inlined_user_dict branch December 7, 2018 14:26

codebrain mentioned this pull request Jan 25, 2019

[meta] 6.6.0 Release elastic/elasticsearch-net#3552

Closed

48 tasks

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

codebrain mentioned this pull request Mar 19, 2019

[meta] 6.7.0 Release elastic/elasticsearch-net#3615

Closed

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for inlined user dictionary in Nori #36123

Add support for inlined user dictionary in Nori #36123

jimczi commented Nov 30, 2018 •

edited

elasticmachine commented Nov 30, 2018

cbuescher Dec 2, 2018 •

edited

cbuescher left a comment

cbuescher Dec 2, 2018

jimczi Dec 3, 2018

cbuescher left a comment

cbuescher left a comment

cbuescher Dec 7, 2018

jimczi Dec 7, 2018

cbuescher Dec 7, 2018

jimczi Dec 7, 2018

cbuescher Dec 7, 2018

jimczi Dec 7, 2018

jimczi left a comment

cbuescher left a comment

kimjmin commented Jan 7, 2019

		@@ -154,6 +154,40 @@ The above `analyze` request returns the following:

		<1> This is a compound token that spans two positions (`mixed` mode).

		`user_dictionary_rules`::

Add support for inlined user dictionary in Nori #36123

Add support for inlined user dictionary in Nori #36123

Conversation

jimczi commented Nov 30, 2018 • edited

elasticmachine commented Nov 30, 2018

cbuescher Dec 2, 2018 • edited

Choose a reason for hiding this comment

cbuescher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbuescher left a comment

Choose a reason for hiding this comment

cbuescher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

cbuescher left a comment

Choose a reason for hiding this comment

kimjmin commented Jan 7, 2019

jimczi commented Nov 30, 2018 •

edited

cbuescher Dec 2, 2018 •

edited