Allow custom characters in token_chars of ngram tokenizers #49250

cbuescher · 2019-11-18T15:35:52Z

Currently the token_chars setting in both edgeNGram and ngram tokenizers
only allows for a list of predefined character classes, which might not fit
every use case. For example, including underscore "_" in a token would currently
require the punctuation class which comes with a lot of other characters.
This change adds an additional "custom" option to the token_chars setting,
which requires an additional custom_token_chars setting to be present and
which will be interpreted as a set of characters to inlcude into a token.

Closes #25894

elasticmachine · 2019-11-18T15:35:55Z

Pinging @elastic/es-search (:Search/Analysis)

Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers only allows for a list of predefined character classes, which might not fit every use case. For example, including underscore "_" in a token would currently require the `punctuation` class which comes with a lot of other characters. This change adds an additional "custom" option to the `token_chars` setting, which requires an additional `custom_token_chars` setting to be present and which will be interpreted as a set of characters to inlcude into a token. Closes elastic#25894

cbuescher · 2019-11-18T16:15:16Z

@elasticmachine run elasticsearch-ci/packaging-sample-matrix

romseygeek

LGTM, one nit about the exception message but no need for another go-round.

romseygeek · 2019-11-19T14:55:25Z

...s/analysis-common/src/main/java/org/elasticsearch/analysis/common/NGramTokenizerFactory.java

@@ -76,7 +79,22 @@ static CharMatcher parseTokenChars(List<String> characterClasses) {
            characterClass = characterClass.toLowerCase(Locale.ROOT).trim();
            CharMatcher matcher = MATCHERS.get(characterClass);
            if (matcher == null) {
-                throw new IllegalArgumentException("Unknown token type: '" + characterClass + "', must be one of " + MATCHERS.keySet());
+                if (characterClass.equals("custom") == false) {
+                    throw new IllegalArgumentException("Unknown token type: '" + characterClass + "', must be one of " + MATCHERS.keySet());


I think we need to include custom in the list here as well?

Sure, although the message here is a bit unwieldy as it is already. It seems to include a ton of unicode categories from the the java.lang.Character class via the MATCHERS, e.g. currently:

Unknown token type: 'letters', must be one of [symbol, private_use, paragraph_separator, start_punctuation, unassigned, enclosing_mark, connector_punctuation, letter_number, other_number, math_symbol, lowercase_letter, space_separator, surrogate, initial_quote_punctuation, decimal_digit_number, digit, other_punctuation, dash_punctuation, currency_symbol, non_spacing_mark, format, modifier_letter, control, uppercase_letter, other_symbol, end_punctuation, modifier_symbol, other_letter, line_separator, titlecase_letter, letter, punctuation, combining_spacing_mark, final_quote_punctuation, whitespace]

So the "custom" will be burried somewhere in there. If you have any ideas to improve this let me know, otherwise I'll merge this after I pushed the changes and tests passed.

LGTM, thanks

Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers only allows for a list of predefined character classes, which might not fit every use case. For example, including underscore "_" in a token would currently require the `punctuation` class which comes with a lot of other characters. This change adds an additional "custom" option to the `token_chars` setting, which requires an additional `custom_token_chars` setting to be present and which will be interpreted as a set of characters to inlcude into a token. Closes #25894

cbuescher added >enhancement :Search Relevance/Analysis How text is split into tokens v8.0.0 v7.6.0 labels Nov 18, 2019

cbuescher force-pushed the fix-25894 branch from 623f536 to ccf2913 Compare November 18, 2019 15:42

cbuescher requested a review from romseygeek November 19, 2019 14:06

romseygeek approved these changes Nov 19, 2019

View reviewed changes

Adressing review comments

84a92ac

cbuescher merged commit ed86750 into elastic:master Nov 20, 2019

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow custom characters in token_chars of ngram tokenizers #49250

Allow custom characters in token_chars of ngram tokenizers #49250

cbuescher commented Nov 18, 2019

elasticmachine commented Nov 18, 2019

cbuescher commented Nov 18, 2019

romseygeek left a comment

romseygeek Nov 19, 2019

cbuescher Nov 19, 2019

romseygeek Nov 20, 2019

Allow custom characters in token_chars of ngram tokenizers #49250

Allow custom characters in token_chars of ngram tokenizers #49250

Conversation

cbuescher commented Nov 18, 2019

elasticmachine commented Nov 18, 2019

cbuescher commented Nov 18, 2019

romseygeek left a comment

Choose a reason for hiding this comment

romseygeek Nov 19, 2019

Choose a reason for hiding this comment

cbuescher Nov 19, 2019

Choose a reason for hiding this comment

romseygeek Nov 20, 2019

Choose a reason for hiding this comment