Ensure TokenFilters only produce single tokens when parsing synonyms #34331

romseygeek · 2018-10-05T13:03:06Z

A number of tokenfilters can produce multiple tokens at the same position. This
is a problem when using token chains to parse synonym files, as the SynonymMap
requires that there are no stacked tokens in its input.

This commit ensures that these tokenfilters either produce a single version of
their input token, or that they are ignored entirely for synonym processing.

asciifolding and cjk_bigram produce only the folded or bigrammed token
decompounders, grams, synonyms, wdf, fingerprint and phonetic are ignored

Fixes #34298

…synonyms A number of tokenfilters can produce multiple tokens at the same position. This is a problem when using token chains to parse synonym files, as the SynonymMap requires that there are no stacked tokens in its input. This commit ensures that these tokenfilters either produce a single version of their input token, or that they are ignored entirely for synonym processing. * asciifolding and cjk_bigram produce only the folded or bigrammed token * decompounders, grams, synonyms, wdf, fingerprint and phonetic are ignored Fixes elastic#34298

elasticmachine · 2018-10-05T13:03:08Z

Pinging @elastic/es-search-aggs

romseygeek · 2018-10-05T13:52:43Z

Another option, of course, would be to extend SynonymMap in lucene to allow multiple paths through a token stream to be added to the map.

jimczi

Wouldn't this be similar to what the lenient option does ? This pr eliminates filters that can produce invalid stream for the synonym map but it will also prevent the rules to match a regular stream ? We could throw an error for some of them because we know they don't work when set before synonyms.
Silently ignoring the non compatible filter by default is a source of problem IMO so we need to be clear about what's acceptable and what's not. What's after a synonym filter is also problematic but that's another discussion ;)

romseygeek · 2018-10-09T12:04:42Z

I think it's better than lenient in that it won't throw out entries in the synonym map. I do agree that some of these would work better by throwing exceptions rather than returning an identity filter - fingerprint and phonetic in particular. Decompounder, grams and wdf all emit the original token as well, so synonym matching should still work in those cases. What to do with chained synonym filters is a more complicated question.

johtani

How is multiplexer token filter? I think it shouldn't apply this PR. If possible, we may have some explanation on synonym tokenfilter's doc.

And how about tokenizers? e.g. Kuromoji Tokenizer produces multi-tokens if mode=search and it's default.

romseygeek · 2018-10-16T09:14:20Z

Multiplexer is ignored, and there's a note in the multiplexer docs explaining why and how to work around it. Maybe this should be moved to the synonym docs though? Tokenizers aren't dealt with at all currently.

Having thought some more about this, I think we should do the following:

multiplexer, ngram, edgengram, wdf, fingerprint and phonetic should throw exceptions. To make upgrading easier, we should only emit a warning if the index version is below 7.0
decompounders should be ignored for synonyms (synonyms in general are going to apply to whole words, and applying them to only parts of words will likely produce nonsense)
we should also allow users to specify an analyzer for parsing synonyms, so if they really do need to use a synonym filter after a fingerprint, they can do so. This will allow work arounds for things like the kuromoji tokenizer

@johtani is there a good way of dealing with Japanese synonyms currently, or does the kuromoji tokenizer just break things?

johtani · 2018-10-16T09:32:24Z

@romseygeek Yes, 3rd option will work. If user want to use mode=search with synonyms, they configure mode=normal or mode=extended for analyzing synonym file.

Only exception is nbest_cost option in Kuromoji, but it is fine in this time.
We need to implement another way if we want to usenbest_cost and symonyms... e.g. kuromoji_tokenizer can extend synonyms during tokenizing...

romseygeek · 2018-10-16T13:00:01Z

e.g. kuromoji_tokenizer can extend synonyms during tokenizing...

This might end up being a better way of doing things entirely. The current infrastructure allows tokenfilters to refer to other arbitrary tokenfilters, but not to charfilters, tokenizers or analyzers, and it would be tricky to implement this fully.

…xceptions or warnings from bad filters

romseygeek · 2018-10-17T12:42:12Z

I've updated based on my comment above:

you can specify a set of filters to be used for parsing (this automatically sets lenient to true)
filters that we know won't work in general throw exceptions if used after 7.0, and emit warnings and
bypass themselves in 6.x

jimczi · 2018-10-18T08:06:15Z

multiplexer, ngram, edgengram, wdf, fingerprint and phonetic should throw exceptions. To make upgrading easier, we should only emit a warning if the index version is below 7.0

+1

decompounders should be ignored for synonyms (synonyms in general are going to apply to whole words, and applying them to only parts of words will likely produce nonsense)

Should we throw an exception in 7 if a decompounder is used ?

we should also allow users to specify an analyzer for parsing synonyms, so if they really do need to use a synonym filter after a fingerprint, they can do so. This will allow work arounds for things like the kuromoji tokenizer

This seems trappy, how is it possible to ensure that the list of filters (or the analyzer) is compatible with the filters that are defined in the main chain ? If you define a filter that alters the form of the token it needs to be applied to all terms otherwise synonyms will never match anything ?

romseygeek · 2018-10-18T10:33:39Z

Should we throw an exception in 7 if a decompounder is used ?

Decompounders always emit their original token as well, so I think bypassing is the correct thing to do here? We don't want to expand part of a synonym, we only want to match the entire thing.

This seems trappy

It's definitely an expert feature - it basically says "I know what I'm doing, override the various safety checks you have". I can see it being useful for people who want to combine WDF and synonyms, for example. Although thinking about it more, maybe lenient should do this work. If we enable lenient, then filters that would otherwise throw an exception are applied?

…synonym analysis

romseygeek · 2018-10-24T13:36:58Z

I've removed the ability to set an arbitrary set of filters, instead lenient is now passed down to getSynonymFilter and filters can choose how to deal with things themselves. multiplexer and ngram return an identity, wdf will try and apply itself.

jimczi

I left more comments. I am not sure about the handling of the lenient option. If it is not set some token filters will throw an exception but they will be discarded if it is set to true so there is no way to know which rules failed ?

jimczi · 2018-10-24T21:39:31Z

...s-common/src/main/java/org/elasticsearch/analysis/common/ASCIIFoldingTokenFilterFactory.java

+
+    @Override
+    public Object getMultiTermComponent() {
+        return getSynonymFilter(true);


I wonder if this work since the synonym filter checks only the first token of each position so if the first token is the original one there is no chance that the modified one can match.

The folded token is always emitted first, so that's the one we need to match against the synonym map. Hence, we return a filter that only emits folded tokens.

.../analysis-common/src/main/java/org/elasticsearch/analysis/common/CJKBigramFilterFactory.java

...is-common/src/main/java/org/elasticsearch/analysis/common/CommonGramsTokenFilterFactory.java

...alysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymTokenFilterFactory.java

server/src/main/java/org/elasticsearch/index/analysis/ShingleTokenFilterFactory.java

romseygeek · 2018-11-14T14:19:49Z

Are you happy with the latest changes @jimczi?

jimczi

I left more comments. I like the fact that we can control the type of filters that we allow to put before a synonym filter but I wonder if we should keep the new lenient option. I don't see why we should continue to support ngram, shingle and all these crazy filters before a synonym filter and adding lenient to support these cases just hides the fact that all rules are broken. Maybe we can start throwing exceptions in 7 and deprecate in 6x ? This way we don't need the lenient option and we can focus on good support for the filters that we allow ?

...ysis-common/src/main/java/org/elasticsearch/analysis/common/EdgeNGramTokenFilterFactory.java

...is-common/src/main/java/org/elasticsearch/analysis/common/FingerprintTokenFilterFactory.java

...is-common/src/main/java/org/elasticsearch/analysis/common/MultiplexerTokenFilterFactory.java

...analysis-common/src/main/java/org/elasticsearch/analysis/common/NGramTokenFilterFactory.java

...on/src/main/java/org/elasticsearch/analysis/common/WordDelimiterGraphTokenFilterFactory.java

...-common/src/main/java/org/elasticsearch/analysis/common/WordDelimiterTokenFilterFactory.java

romseygeek · 2018-11-16T11:37:49Z

I removed the lenient boolean.

Something to consider in future is maybe making it easier to apply synonyms without having to build your own tokenizer chain - for example, we could add a synonyms settings to the various language analyzers, and that way users wouldn't need to worry about the ordering of filters or anything like that.

romseygeek · 2018-11-19T15:51:52Z

@elasticmachine test this please

jimczi

I left two questions, LGTM otherwise

...is-common/src/main/java/org/elasticsearch/analysis/common/MultiplexerTokenFilterFactory.java

romseygeek · 2018-11-28T17:27:05Z

@elasticmachine please run the gradle build tests 1

…34331) A number of tokenfilters can produce multiple tokens at the same position. This is a problem when using token chains to parse synonym files, as the SynonymMap requires that there are no stacked tokens in its input. This commit ensures that when used to parse synonyms, these tokenfilters either produce a single version of their input token, or that they throw an error when mappings are generated. In indexes created in elasticsearch 6.x deprecation warnings are emitted in place of the error. * asciifolding and cjk_bigram produce only the folded or bigrammed token * decompounders, synonyms and keyword_repeat are skipped * n-grams, word-delimiter-filter, multiplexer, fingerprint and phonetic throw errors Fixes #34298

romseygeek added >enhancement :Search/Analysis How text is split into tokens v7.0.0 v6.5.0 labels Oct 5, 2018

romseygeek self-assigned this Oct 5, 2018

romseygeek requested review from johtani and jimczi October 5, 2018 13:03

jimczi reviewed Oct 5, 2018

View reviewed changes

Merge remote-tracking branch 'origin/master' into synonymfilters

b5ca5fe

johtani reviewed Oct 16, 2018

View reviewed changes

Allow a specific set of filters to be used to parse synonyms; throw e…

075f2e3

…xceptions or warnings from bad filters

checkstyle

3bf6a1f

romseygeek added 3 commits October 24, 2018 13:29

Merge remote-tracking branch 'origin/master' into synonymfilters

dd90dbd

Depend on leniency to determine whether or not to include filter for …

7a444ce

…synonym analysis

Use to choose whether or not to apply filters

9225478

checkstyle

61391a4

jimczi requested changes Oct 24, 2018

View reviewed changes

colings86 added v6.6.0 and removed v6.5.0 labels Oct 25, 2018

romseygeek added 3 commits October 29, 2018 12:44

feedback

087419c

feedback

83b63d5

Merge remote-tracking branch 'origin/master' into synonymfilters

906f747

jimczi reviewed Nov 14, 2018

View reviewed changes

romseygeek added 2 commits November 16, 2018 11:02

Merge remote-tracking branch 'origin/master' into synonymfilters

b135bd7

Remove lenient option; add phonetic filter suppression

3f99aca

Merge remote-tracking branch 'origin/master' into synonymfilters

bf0273a

romseygeek mentioned this pull request Nov 16, 2018

Avoid misplacing positions in unique filter #35420

Open

romseygeek added 4 commits November 16, 2018 16:16

Allow multiple synonym filters - WIP needs tests

ac17329

Merge remote-tracking branch 'origin/master' into synonymfilters

b482d7e

checkstyle

97e96d9

Allow chained synonym filters

6e0ecb8

Merge remote-tracking branch 'origin/master' into synonymfilters

bf78b6a

jimczi approved these changes Nov 22, 2018

View reviewed changes

...is-common/src/main/java/org/elasticsearch/analysis/common/MultiplexerTokenFilterFactory.java Outdated Show resolved Hide resolved

...is-common/src/main/java/org/elasticsearch/analysis/common/MultiplexerTokenFilterFactory.java Outdated Show resolved Hide resolved

romseygeek added 2 commits November 28, 2018 16:23

Multiplexer only returns IDENTITY if preserve_original=true

20acfcd

Merge remote-tracking branch 'origin/master' into synonymfilters

4cb6a30

romseygeek merged commit a646f85 into elastic:master Nov 29, 2018

romseygeek deleted the synonymfilters branch November 29, 2018 10:35

jimczi mentioned this pull request Dec 3, 2018

A shingle filter before synonym filter causes index creation/open failure when multi-word synonyms are used #36129

Closed

jimczi mentioned this pull request Dec 10, 2018

Error using two synonym file filters with overlapping tokens in same chain #36433

Closed

jimczi mentioned this pull request Jan 23, 2019

Korean (nori) Analysis Synonym Filter build failed #37751

Closed

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

jimczi mentioned this pull request Mar 13, 2019

Synonym building fails if filter proceeded by compound word filter #40000

Closed

chie8842 mentioned this pull request Sep 24, 2019

Proposal for the implementation of analyze considering kuromoji_tokenenizer #46955

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure TokenFilters only produce single tokens when parsing synonyms #34331

Ensure TokenFilters only produce single tokens when parsing synonyms #34331

romseygeek commented Oct 5, 2018

elasticmachine commented Oct 5, 2018

romseygeek commented Oct 5, 2018

jimczi left a comment

romseygeek commented Oct 9, 2018

johtani left a comment

romseygeek commented Oct 16, 2018

johtani commented Oct 16, 2018

romseygeek commented Oct 16, 2018

romseygeek commented Oct 17, 2018

jimczi commented Oct 18, 2018

romseygeek commented Oct 18, 2018

romseygeek commented Oct 24, 2018

jimczi left a comment

jimczi Oct 24, 2018

romseygeek Oct 29, 2018

romseygeek commented Nov 14, 2018

jimczi left a comment

romseygeek commented Nov 16, 2018

romseygeek commented Nov 19, 2018

jimczi left a comment

romseygeek commented Nov 28, 2018

Ensure TokenFilters only produce single tokens when parsing synonyms #34331

Ensure TokenFilters only produce single tokens when parsing synonyms #34331

Conversation

romseygeek commented Oct 5, 2018

elasticmachine commented Oct 5, 2018

romseygeek commented Oct 5, 2018

jimczi left a comment

Choose a reason for hiding this comment

romseygeek commented Oct 9, 2018

johtani left a comment

Choose a reason for hiding this comment

romseygeek commented Oct 16, 2018

johtani commented Oct 16, 2018

romseygeek commented Oct 16, 2018

romseygeek commented Oct 17, 2018

jimczi commented Oct 18, 2018

romseygeek commented Oct 18, 2018

romseygeek commented Oct 24, 2018

jimczi left a comment

Choose a reason for hiding this comment

jimczi Oct 24, 2018

Choose a reason for hiding this comment

romseygeek Oct 29, 2018

Choose a reason for hiding this comment

romseygeek commented Nov 14, 2018

jimczi left a comment

Choose a reason for hiding this comment

romseygeek commented Nov 16, 2018

romseygeek commented Nov 19, 2018

jimczi left a comment

Choose a reason for hiding this comment

romseygeek commented Nov 28, 2018