Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure TokenFilters only produce single tokens when parsing synonyms #34331

Merged
merged 21 commits into from Nov 29, 2018

Conversation

romseygeek
Copy link
Contributor

A number of tokenfilters can produce multiple tokens at the same position. This
is a problem when using token chains to parse synonym files, as the SynonymMap
requires that there are no stacked tokens in its input.

This commit ensures that these tokenfilters either produce a single version of
their input token, or that they are ignored entirely for synonym processing.

  • asciifolding and cjk_bigram produce only the folded or bigrammed token
  • decompounders, grams, synonyms, wdf, fingerprint and phonetic are ignored

Fixes #34298

…synonyms

A number of tokenfilters can produce multiple tokens at the same position.  This
is a problem when using token chains to parse synonym files, as the SynonymMap
requires that there are no stacked tokens in its input.

This commit ensures that these tokenfilters either produce a single version of
their input token, or that they are ignored entirely for synonym processing.

* asciifolding and cjk_bigram produce only the folded or bigrammed token
* decompounders, grams, synonyms, wdf, fingerprint and phonetic are ignored

Fixes elastic#34298
@romseygeek romseygeek self-assigned this Oct 5, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@romseygeek
Copy link
Contributor Author

Another option, of course, would be to extend SynonymMap in lucene to allow multiple paths through a token stream to be added to the map.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be similar to what the lenient option does ? This pr eliminates filters that can produce invalid stream for the synonym map but it will also prevent the rules to match a regular stream ? We could throw an error for some of them because we know they don't work when set before synonyms.
Silently ignoring the non compatible filter by default is a source of problem IMO so we need to be clear about what's acceptable and what's not. What's after a synonym filter is also problematic but that's another discussion ;)

@romseygeek
Copy link
Contributor Author

I think it's better than lenient in that it won't throw out entries in the synonym map. I do agree that some of these would work better by throwing exceptions rather than returning an identity filter - fingerprint and phonetic in particular. Decompounder, grams and wdf all emit the original token as well, so synonym matching should still work in those cases. What to do with chained synonym filters is a more complicated question.

Copy link
Contributor

@johtani johtani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is multiplexer token filter? I think it shouldn't apply this PR. If possible, we may have some explanation on synonym tokenfilter's doc.

And how about tokenizers? e.g. Kuromoji Tokenizer produces multi-tokens if mode=search and it's default.

@romseygeek
Copy link
Contributor Author

Multiplexer is ignored, and there's a note in the multiplexer docs explaining why and how to work around it. Maybe this should be moved to the synonym docs though? Tokenizers aren't dealt with at all currently.

Having thought some more about this, I think we should do the following:

  • multiplexer, ngram, edgengram, wdf, fingerprint and phonetic should throw exceptions. To make upgrading easier, we should only emit a warning if the index version is below 7.0
  • decompounders should be ignored for synonyms (synonyms in general are going to apply to whole words, and applying them to only parts of words will likely produce nonsense)
  • we should also allow users to specify an analyzer for parsing synonyms, so if they really do need to use a synonym filter after a fingerprint, they can do so. This will allow work arounds for things like the kuromoji tokenizer

@johtani is there a good way of dealing with Japanese synonyms currently, or does the kuromoji tokenizer just break things?

@johtani
Copy link
Contributor

johtani commented Oct 16, 2018

@romseygeek Yes, 3rd option will work. If user want to use mode=search with synonyms, they configure mode=normal or mode=extended for analyzing synonym file.

Only exception is nbest_cost option in Kuromoji, but it is fine in this time.
We need to implement another way if we want to usenbest_cost and symonyms... e.g. kuromoji_tokenizer can extend synonyms during tokenizing...

@romseygeek
Copy link
Contributor Author

e.g. kuromoji_tokenizer can extend synonyms during tokenizing...

This might end up being a better way of doing things entirely. The current infrastructure allows tokenfilters to refer to other arbitrary tokenfilters, but not to charfilters, tokenizers or analyzers, and it would be tricky to implement this fully.

@romseygeek
Copy link
Contributor Author

I've updated based on my comment above:

  • you can specify a set of filters to be used for parsing (this automatically sets lenient to true)
  • filters that we know won't work in general throw exceptions if used after 7.0, and emit warnings and
    bypass themselves in 6.x

@jimczi
Copy link
Contributor

jimczi commented Oct 18, 2018

multiplexer, ngram, edgengram, wdf, fingerprint and phonetic should throw exceptions. To make upgrading easier, we should only emit a warning if the index version is below 7.0

+1

decompounders should be ignored for synonyms (synonyms in general are going to apply to whole words, and applying them to only parts of words will likely produce nonsense)

Should we throw an exception in 7 if a decompounder is used ?

we should also allow users to specify an analyzer for parsing synonyms, so if they really do need to use a synonym filter after a fingerprint, they can do so. This will allow work arounds for things like the kuromoji tokenizer

This seems trappy, how is it possible to ensure that the list of filters (or the analyzer) is compatible with the filters that are defined in the main chain ? If you define a filter that alters the form of the token it needs to be applied to all terms otherwise synonyms will never match anything ?

@romseygeek
Copy link
Contributor Author

Should we throw an exception in 7 if a decompounder is used ?

Decompounders always emit their original token as well, so I think bypassing is the correct thing to do here? We don't want to expand part of a synonym, we only want to match the entire thing.

This seems trappy

It's definitely an expert feature - it basically says "I know what I'm doing, override the various safety checks you have". I can see it being useful for people who want to combine WDF and synonyms, for example. Although thinking about it more, maybe lenient should do this work. If we enable lenient, then filters that would otherwise throw an exception are applied?

@romseygeek
Copy link
Contributor Author

I've removed the ability to set an arbitrary set of filters, instead lenient is now passed down to getSynonymFilter and filters can choose how to deal with things themselves. multiplexer and ngram return an identity, wdf will try and apply itself.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left more comments. I am not sure about the handling of the lenient option. If it is not set some token filters will throw an exception but they will be discarded if it is set to true so there is no way to know which rules failed ?


@Override
public Object getMultiTermComponent() {
return getSynonymFilter(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this work since the synonym filter checks only the first token of each position so if the first token is the original one there is no chance that the modified one can match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The folded token is always emitted first, so that's the one we need to match against the synonym map. Hence, we return a filter that only emits folded tokens.

@colings86 colings86 added v6.6.0 and removed v6.5.0 labels Oct 25, 2018
@romseygeek
Copy link
Contributor Author

Are you happy with the latest changes @jimczi?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left more comments. I like the fact that we can control the type of filters that we allow to put before a synonym filter but I wonder if we should keep the new lenient option. I don't see why we should continue to support ngram, shingle and all these crazy filters before a synonym filter and adding lenient to support these cases just hides the fact that all rules are broken. Maybe we can start throwing exceptions in 7 and deprecate in 6x ? This way we don't need the lenient option and we can focus on good support for the filters that we allow ?

@romseygeek
Copy link
Contributor Author

I removed the lenient boolean.

Something to consider in future is maybe making it easier to apply synonyms without having to build your own tokenizer chain - for example, we could add a synonyms settings to the various language analyzers, and that way users wouldn't need to worry about the ordering of filters or anything like that.

@romseygeek
Copy link
Contributor Author

@elasticmachine test this please

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two questions, LGTM otherwise

@romseygeek
Copy link
Contributor Author

@elasticmachine please run the gradle build tests 1

@romseygeek romseygeek merged commit a646f85 into elastic:master Nov 29, 2018
@romseygeek romseygeek deleted the synonymfilters branch November 29, 2018 10:35
romseygeek added a commit that referenced this pull request Nov 29, 2018
…34331)

A number of tokenfilters can produce multiple tokens at the same position.  This
is a problem when using token chains to parse synonym files, as the SynonymMap
requires that there are no stacked tokens in its input.

This commit ensures that when used to parse synonyms, these tokenfilters either produce
a single version of their input token, or that they throw an error when mappings are
generated.  In indexes created in elasticsearch 6.x deprecation warnings are emitted in place
of the error.

* asciifolding and cjk_bigram produce only the folded or bigrammed token
* decompounders, synonyms and keyword_repeat are skipped
* n-grams, word-delimiter-filter, multiplexer, fingerprint and phonetic throw errors

Fixes #34298
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants