Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim char filter #16489

Closed
nik9000 opened this issue Feb 6, 2016 · 9 comments
Closed

Trim char filter #16489

nik9000 opened this issue Feb 6, 2016 · 9 comments
Labels
discuss :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@nik9000
Copy link
Member

nik9000 commented Feb 6, 2016

It'd be nice to have a trim char filter - you can totally do it with pattern_replace right now but reaching for regexes seems like overkill when you just what to strip trailing or leading characters.

@nik9000 nik9000 added the discuss label Feb 6, 2016
@pickypg
Copy link
Member

pickypg commented Feb 6, 2016

Love it. Makes sense when you have common things like DNS names that technically trail with . even though no one expects them too.

@jpountz
Copy link
Contributor

jpountz commented Feb 12, 2016

@nik9000 What use-cases do you have in mind? We were discussing this issue in fixit friday and @jimferenczi made the note that we already have a trim token filter, so we were wondering what use-cases would be covered by a trim char filter that would not be covered with a trim token filter? The trim token filter only supports removing whitespaces but if this is the only issue then maybe it would be easier to just add the ability to trim arbitrary chars to the trim token filter?

@nik9000
Copy link
Member Author

nik9000 commented Feb 12, 2016

already have a trim token filter

That'd probably be enough.

only supports removing whitespaces

That'd be simple to fix, yeah.

I believe the thing mostly comes up in the path_hierarchy tokenizer. Say you want to use it for a directory like in the docs but you want users to be a bit more fluid. You want searching for "foo/bar" or "/foo/bar/" to find

  • "/foo/bar"
  • "/foo/bar/baz"

I haven't tried it myself, but was talking with @pickypg who was trying it - the leading / was getting in the way. I reached for a char_filter because I figured it'd be simpler to think about.

In our case we were talking about dns which is the same but with the reverse flag. DNS is worse because some systems like to spit out the trailing dot very few people are used to it.

@pickypg
Copy link
Member

pickypg commented Feb 12, 2016

As @nik9000 said, it was due to DNS. In DNS, you really want the reversed path hierarchy and actual DNS has a goofy . trailing at the end:

elastic.co.
google.com.
apple.com.

So it would be ideal to trim that off because otherwise the first token is:

co.
com.
com.

I certainly wouldn't expect to search for that with a term filter that wants to only find the .com TLD. I'd either try .com or com, but never com..

@clintongormley clintongormley added the :Search Relevance/Analysis How text is split into tokens label Feb 13, 2016
@clintongormley
Copy link

I'm not convinced that we need to expand the trim character filter. The examples provided are actually quite complex... perhaps you want to trim front and end, or just front or just end... The char filter works on the whole field value, not on individual terms, which makes it less flexible.

The flexibility required is easily supported by regexes - the pattern_replace token filter looks like the perfect solution here.

@cbuescher
Copy link
Member

Just to add a datapoint, similar use case came up in the forums last week: https://discuss.elastic.co/t/solved-question-about-custom-analyzer/48680
This user was trying to search on ip adress prefixes (like 192.168.1.1), so I suggested using the path_hierarchy tokenizer, but for similar reasons as @nik9000 stated above it was necessary to remove trailing dots from the search input. pattern_replace was doing the trick but I was also suprised that trim is limited to whitespace.

@cbuescher
Copy link
Member

Is this something we still want to do or can we close this issue?

Adding the ability to trim arbitrary chars to the trim token filter seems to solve most of the above use cases and should be easy to do. Is there anything that speaks against doing this?

@romseygeek
Copy link
Contributor

cc @elastic/es-search-aggs

@jpountz
Copy link
Contributor

jpountz commented Mar 15, 2018

Like Clint I'm on the fence about expanding the set of characters that can be trimmed. I feel like pattern_replace is a fine solution. I'm closing this issue since we haven't reached consensus over the 2+ years that this issue has been open. Please reopen if you feel strongly that we should do something here.

@jpountz jpountz closed this as completed Mar 15, 2018
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

7 participants