-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trim char filter #16489
Comments
Love it. Makes sense when you have common things like DNS names that technically trail with |
@nik9000 What use-cases do you have in mind? We were discussing this issue in fixit friday and @jimferenczi made the note that we already have a trim token filter, so we were wondering what use-cases would be covered by a trim char filter that would not be covered with a trim token filter? The trim token filter only supports removing whitespaces but if this is the only issue then maybe it would be easier to just add the ability to trim arbitrary chars to the trim token filter? |
That'd probably be enough.
That'd be simple to fix, yeah. I believe the thing mostly comes up in the path_hierarchy tokenizer. Say you want to use it for a directory like in the docs but you want users to be a bit more fluid. You want searching for "foo/bar" or "/foo/bar/" to find
I haven't tried it myself, but was talking with @pickypg who was trying it - the leading / was getting in the way. I reached for a char_filter because I figured it'd be simpler to think about. In our case we were talking about dns which is the same but with the reverse flag. DNS is worse because some systems like to spit out the trailing dot very few people are used to it. |
As @nik9000 said, it was due to DNS. In DNS, you really want the reversed path hierarchy and actual DNS has a goofy
So it would be ideal to trim that off because otherwise the first token is:
I certainly wouldn't expect to search for that with a term filter that wants to only find the |
I'm not convinced that we need to expand the The flexibility required is easily supported by regexes - the |
Just to add a datapoint, similar use case came up in the forums last week: https://discuss.elastic.co/t/solved-question-about-custom-analyzer/48680 |
Is this something we still want to do or can we close this issue? Adding the ability to trim arbitrary chars to the trim token filter seems to solve most of the above use cases and should be easy to do. Is there anything that speaks against doing this? |
cc @elastic/es-search-aggs |
Like Clint I'm on the fence about expanding the set of characters that can be trimmed. I feel like |
It'd be nice to have a trim char filter - you can totally do it with pattern_replace right now but reaching for regexes seems like overkill when you just what to strip trailing or leading characters.
The text was updated successfully, but these errors were encountered: