Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Support for token filter/char filter that cleanups word boundaries (remove or replace with space) according to Unicode Text Segmentation algorithm #34402
Describe the feature:
What we would like is for the exact match field to also not consider the apostrophes that were considered as word boundaries in the output tokens in the full text search (still includes the apostrophe that is not considered as word boundary e.g in "can't")
I know there is a pattern_replace character filter but I think it is not straightforward and easy to implement the algorithm as the condition for the word boundaries is diverse which will lead to a long regular expression and also need to consider the possible performance impact (due to using regular expression).
thanks for opening this issue and for your requesting this new token filter.
I agree with you that the current way of doing what you are asking for (via patter_replace) is a bit cumbersome, but I am wondering if a dedicated "word boundary" filter would be easier to use. After looking a bit at the Unicode Text Segmentation specs you refered to I doubt that an implementation would be straight forward for general use cases. Instead I fear that any implementation will again be language and/or use-case specific, will require quite some additional configuration and in the end will also suffer from irritating edge cases that are difficult to understand for the user without understanding the details of the algorithm. I might be wrong on this, its just an initial gut feeling of mine having worked with text segmentation a bit in the past.
I will label this issue for internal team discussion to get some further input on this.