New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for token filter/char filter that cleanups word boundaries (remove or replace with space) according to Unicode Text Segmentation algorithm #34402

Open
Tiuser4567 opened this Issue Oct 11, 2018 · 3 comments

Comments

Projects
None yet
4 participants
@Tiuser4567

Tiuser4567 commented Oct 11, 2018

Describe the feature:
Version used: Elasticsearch 6.0

Hello,
I would like to ask if it is possible to provide support for a token filter/char filter that would clean-up (remove or replace with space) word boundaries (according to the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29, the one used by the standard tokenizer)
which can be used in a normalizer for the keyword data type.

Scenario:
We have 2 types of fields

  1. full text search (type: text + standard analyzer)
  2. exact match (type: keyword + keyword analyzer)

e.g. given
'can't see you' (enclosing single quotes are part of the text content)
the ff is currently tokenized as the ff:
type:text (standard analyzer) -> can't, see, you
type:keyword (keyword analyzer) -> 'can't see you'

What we would like is for the exact match field to also not consider the apostrophes that were considered as word boundaries in the output tokens in the full text search (still includes the apostrophe that is not considered as word boundary e.g in "can't")
so that we can search "can't see you" (without enclosing single quotes) in the full text search and exact match and get the same result.

e.g.
type:keyword (keyword analyzer + normalizer w/ "standard" char filter) -> can't see you

I know there is a pattern_replace character filter but I think it is not straightforward and easy to implement the algorithm as the condition for the word boundaries is diverse which will lead to a long regular expression and also need to consider the possible performance impact (due to using regular expression).

Thanks

@elasticmachine

This comment has been minimized.

Show comment
Hide comment
@elasticmachine

elasticmachine commented Oct 12, 2018

@cbuescher

This comment has been minimized.

Show comment
Hide comment
@cbuescher

cbuescher Oct 17, 2018

Member

Hi @Tiuser4567

thanks for opening this issue and for your requesting this new token filter.

I know there is a pattern_replace character filter but I think it is not straightforward and easy to implement the algorithm as the condition for the word boundaries is diverse which will lead to a long regular expression and also need to consider the possible performance impact (due to using regular expression).

I agree with you that the current way of doing what you are asking for (via patter_replace) is a bit cumbersome, but I am wondering if a dedicated "word boundary" filter would be easier to use. After looking a bit at the Unicode Text Segmentation specs you refered to I doubt that an implementation would be straight forward for general use cases. Instead I fear that any implementation will again be language and/or use-case specific, will require quite some additional configuration and in the end will also suffer from irritating edge cases that are difficult to understand for the user without understanding the details of the algorithm. I might be wrong on this, its just an initial gut feeling of mine having worked with text segmentation a bit in the past.

I will label this issue for internal team discussion to get some further input on this.

Member

cbuescher commented Oct 17, 2018

Hi @Tiuser4567

thanks for opening this issue and for your requesting this new token filter.

I know there is a pattern_replace character filter but I think it is not straightforward and easy to implement the algorithm as the condition for the word boundaries is diverse which will lead to a long regular expression and also need to consider the possible performance impact (due to using regular expression).

I agree with you that the current way of doing what you are asking for (via patter_replace) is a bit cumbersome, but I am wondering if a dedicated "word boundary" filter would be easier to use. After looking a bit at the Unicode Text Segmentation specs you refered to I doubt that an implementation would be straight forward for general use cases. Instead I fear that any implementation will again be language and/or use-case specific, will require quite some additional configuration and in the end will also suffer from irritating edge cases that are difficult to understand for the user without understanding the details of the algorithm. I might be wrong on this, its just an initial gut feeling of mine having worked with text segmentation a bit in the past.

I will label this issue for internal team discussion to get some further input on this.

@cbuescher cbuescher closed this Oct 17, 2018

@cbuescher

This comment has been minimized.

Show comment
Hide comment
@cbuescher

cbuescher Oct 17, 2018

Member

Sorry, didn't mean to close, reopening for team discussion.

Member

cbuescher commented Oct 17, 2018

Sorry, didn't mean to close, reopening for team discussion.

@cbuescher cbuescher reopened this Oct 17, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment