New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure TokenFilters only produce single tokens when parsing synonyms #34331
Conversation
…synonyms A number of tokenfilters can produce multiple tokens at the same position. This is a problem when using token chains to parse synonym files, as the SynonymMap requires that there are no stacked tokens in its input. This commit ensures that these tokenfilters either produce a single version of their input token, or that they are ignored entirely for synonym processing. * asciifolding and cjk_bigram produce only the folded or bigrammed token * decompounders, grams, synonyms, wdf, fingerprint and phonetic are ignored Fixes elastic#34298
Pinging @elastic/es-search-aggs |
Another option, of course, would be to extend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be similar to what the lenient
option does ? This pr eliminates filters that can produce invalid stream for the synonym map but it will also prevent the rules to match a regular stream ? We could throw an error for some of them because we know they don't work when set before synonyms.
Silently ignoring the non compatible filter by default is a source of problem IMO so we need to be clear about what's acceptable and what's not. What's after a synonym filter is also problematic but that's another discussion ;)
I think it's better than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is multiplexer token filter? I think it shouldn't apply this PR. If possible, we may have some explanation on synonym tokenfilter's doc.
And how about tokenizers? e.g. Kuromoji Tokenizer produces multi-tokens if mode=search
and it's default.
Multiplexer is ignored, and there's a note in the multiplexer docs explaining why and how to work around it. Maybe this should be moved to the synonym docs though? Tokenizers aren't dealt with at all currently. Having thought some more about this, I think we should do the following:
@johtani is there a good way of dealing with Japanese synonyms currently, or does the kuromoji tokenizer just break things? |
@romseygeek Yes, 3rd option will work. If user want to use Only exception is |
This might end up being a better way of doing things entirely. The current infrastructure allows tokenfilters to refer to other arbitrary tokenfilters, but not to charfilters, tokenizers or analyzers, and it would be tricky to implement this fully. |
…xceptions or warnings from bad filters
I've updated based on my comment above:
|
+1
Should we throw an exception in 7 if a decompounder is used ?
This seems trappy, how is it possible to ensure that the list of filters (or the analyzer) is compatible with the filters that are defined in the main chain ? If you define a filter that alters the form of the token it needs to be applied to all terms otherwise synonyms will never match anything ? |
Decompounders always emit their original token as well, so I think bypassing is the correct thing to do here? We don't want to expand part of a synonym, we only want to match the entire thing.
It's definitely an expert feature - it basically says "I know what I'm doing, override the various safety checks you have". I can see it being useful for people who want to combine WDF and synonyms, for example. Although thinking about it more, maybe |
I've removed the ability to set an arbitrary set of filters, instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left more comments. I am not sure about the handling of the lenient
option. If it is not set some token filters will throw an exception but they will be discarded if it is set to true
so there is no way to know which rules failed ?
|
||
@Override | ||
public Object getMultiTermComponent() { | ||
return getSynonymFilter(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this work since the synonym filter checks only the first token of each position so if the first token is the original one there is no chance that the modified one can match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The folded token is always emitted first, so that's the one we need to match against the synonym map. Hence, we return a filter that only emits folded tokens.
.../analysis-common/src/main/java/org/elasticsearch/analysis/common/CJKBigramFilterFactory.java
Outdated
Show resolved
Hide resolved
...is-common/src/main/java/org/elasticsearch/analysis/common/CommonGramsTokenFilterFactory.java
Outdated
Show resolved
Hide resolved
...alysis-common/src/main/java/org/elasticsearch/analysis/common/SynonymTokenFilterFactory.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/analysis/ShingleTokenFilterFactory.java
Outdated
Show resolved
Hide resolved
Are you happy with the latest changes @jimczi? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left more comments. I like the fact that we can control the type of filters that we allow to put before a synonym
filter but I wonder if we should keep the new lenient
option. I don't see why we should continue to support ngram
, shingle
and all these crazy filters before a synonym
filter and adding lenient
to support these cases just hides the fact that all rules are broken. Maybe we can start throwing exceptions in 7 and deprecate in 6x ? This way we don't need the lenient
option and we can focus on good support for the filters that we allow ?
...ysis-common/src/main/java/org/elasticsearch/analysis/common/EdgeNGramTokenFilterFactory.java
Outdated
Show resolved
Hide resolved
...ysis-common/src/main/java/org/elasticsearch/analysis/common/EdgeNGramTokenFilterFactory.java
Outdated
Show resolved
Hide resolved
...is-common/src/main/java/org/elasticsearch/analysis/common/FingerprintTokenFilterFactory.java
Outdated
Show resolved
Hide resolved
...is-common/src/main/java/org/elasticsearch/analysis/common/MultiplexerTokenFilterFactory.java
Outdated
Show resolved
Hide resolved
...analysis-common/src/main/java/org/elasticsearch/analysis/common/NGramTokenFilterFactory.java
Outdated
Show resolved
Hide resolved
...on/src/main/java/org/elasticsearch/analysis/common/WordDelimiterGraphTokenFilterFactory.java
Outdated
Show resolved
Hide resolved
...-common/src/main/java/org/elasticsearch/analysis/common/WordDelimiterTokenFilterFactory.java
Outdated
Show resolved
Hide resolved
I removed the Something to consider in future is maybe making it easier to apply synonyms without having to build your own tokenizer chain - for example, we could add a |
@elasticmachine test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two questions, LGTM otherwise
...is-common/src/main/java/org/elasticsearch/analysis/common/MultiplexerTokenFilterFactory.java
Outdated
Show resolved
Hide resolved
...is-common/src/main/java/org/elasticsearch/analysis/common/MultiplexerTokenFilterFactory.java
Outdated
Show resolved
Hide resolved
@elasticmachine please run the gradle build tests 1 |
…34331) A number of tokenfilters can produce multiple tokens at the same position. This is a problem when using token chains to parse synonym files, as the SynonymMap requires that there are no stacked tokens in its input. This commit ensures that when used to parse synonyms, these tokenfilters either produce a single version of their input token, or that they throw an error when mappings are generated. In indexes created in elasticsearch 6.x deprecation warnings are emitted in place of the error. * asciifolding and cjk_bigram produce only the folded or bigrammed token * decompounders, synonyms and keyword_repeat are skipped * n-grams, word-delimiter-filter, multiplexer, fingerprint and phonetic throw errors Fixes #34298
A number of tokenfilters can produce multiple tokens at the same position. This
is a problem when using token chains to parse synonym files, as the SynonymMap
requires that there are no stacked tokens in its input.
This commit ensures that these tokenfilters either produce a single version of
their input token, or that they are ignored entirely for synonym processing.
asciifolding
andcjk_bigram
produce only the folded or bigrammed tokenfingerprint
andphonetic
are ignoredFixes #34298