Trim char filter #16489

nik9000 · 2016-02-06T21:10:20Z

It'd be nice to have a trim char filter - you can totally do it with pattern_replace right now but reaching for regexes seems like overkill when you just what to strip trailing or leading characters.

pickypg · 2016-02-06T21:16:54Z

Love it. Makes sense when you have common things like DNS names that technically trail with . even though no one expects them too.

jpountz · 2016-02-12T10:50:20Z

@nik9000 What use-cases do you have in mind? We were discussing this issue in fixit friday and @jimferenczi made the note that we already have a trim token filter, so we were wondering what use-cases would be covered by a trim char filter that would not be covered with a trim token filter? The trim token filter only supports removing whitespaces but if this is the only issue then maybe it would be easier to just add the ability to trim arbitrary chars to the trim token filter?

nik9000 · 2016-02-12T13:30:44Z

already have a trim token filter

That'd probably be enough.

only supports removing whitespaces

That'd be simple to fix, yeah.

I believe the thing mostly comes up in the path_hierarchy tokenizer. Say you want to use it for a directory like in the docs but you want users to be a bit more fluid. You want searching for "foo/bar" or "/foo/bar/" to find

"/foo/bar"
"/foo/bar/baz"

I haven't tried it myself, but was talking with @pickypg who was trying it - the leading / was getting in the way. I reached for a char_filter because I figured it'd be simpler to think about.

In our case we were talking about dns which is the same but with the reverse flag. DNS is worse because some systems like to spit out the trailing dot very few people are used to it.

pickypg · 2016-02-12T16:13:59Z

As @nik9000 said, it was due to DNS. In DNS, you really want the reversed path hierarchy and actual DNS has a goofy . trailing at the end:

elastic.co.
google.com.
apple.com.

So it would be ideal to trim that off because otherwise the first token is:

co.
com.
com.

I certainly wouldn't expect to search for that with a term filter that wants to only find the .com TLD. I'd either try .com or com, but never com..

clintongormley · 2016-02-13T11:45:10Z

I'm not convinced that we need to expand the trim character filter. The examples provided are actually quite complex... perhaps you want to trim front and end, or just front or just end... The char filter works on the whole field value, not on individual terms, which makes it less flexible.

The flexibility required is easily supported by regexes - the pattern_replace token filter looks like the perfect solution here.

cbuescher · 2016-05-02T13:19:16Z

Just to add a datapoint, similar use case came up in the forums last week: https://discuss.elastic.co/t/solved-question-about-custom-analyzer/48680
This user was trying to search on ip adress prefixes (like 192.168.1.1), so I suggested using the path_hierarchy tokenizer, but for similar reasons as @nik9000 stated above it was necessary to remove trailing dots from the search input. pattern_replace was doing the trick but I was also suprised that trim is limited to whitespace.

cbuescher · 2017-08-17T14:55:16Z

Is this something we still want to do or can we close this issue?

Adding the ability to trim arbitrary chars to the trim token filter seems to solve most of the above use cases and should be easy to do. Is there anything that speaks against doing this?

romseygeek · 2018-03-14T10:56:15Z

cc @elastic/es-search-aggs

jpountz · 2018-03-15T17:17:07Z

Like Clint I'm on the fence about expanding the set of characters that can be trimmed. I feel like pattern_replace is a fine solution. I'm closing this issue since we haven't reached consensus over the 2+ years that this issue has been open. Please reopen if you feel strongly that we should do something here.

nik9000 added the discuss label Feb 6, 2016

clintongormley added the :Search Relevance/Analysis How text is split into tokens label Feb 13, 2016

jpountz closed this as completed Mar 15, 2018

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trim char filter #16489

Trim char filter #16489

nik9000 commented Feb 6, 2016

pickypg commented Feb 6, 2016

jpountz commented Feb 12, 2016

nik9000 commented Feb 12, 2016

pickypg commented Feb 12, 2016

clintongormley commented Feb 13, 2016

cbuescher commented May 2, 2016

cbuescher commented Aug 17, 2017

romseygeek commented Mar 14, 2018

jpountz commented Mar 15, 2018

Trim char filter #16489

Trim char filter #16489

Comments

nik9000 commented Feb 6, 2016

pickypg commented Feb 6, 2016

jpountz commented Feb 12, 2016

nik9000 commented Feb 12, 2016

pickypg commented Feb 12, 2016

clintongormley commented Feb 13, 2016

cbuescher commented May 2, 2016

cbuescher commented Aug 17, 2017

romseygeek commented Mar 14, 2018

jpountz commented Mar 15, 2018