Analysis: protecting tokens based on their length #4877

loren · 2014-01-23T22:00:25Z

I would like to be able to tell the keyword marker to protect tokens 1-4 characters in length, or tell the minimal english stemmer to ignore tokens shorter than 5 characters.

Perhaps the more generic thing to have would be a Minimum Length Keyword Marker that could go in front of the other filters.

Based on discussion at https://groups.google.com/forum/#!msg/elasticsearch/uFlKWq2HvQk/mM8KjaItPH0J

ilanrivers · 2014-12-03T12:18:43Z

Is there an expected date when this will be implemented?

clintongormley · 2016-12-23T10:41:33Z

This should be done in Lucene - nice easy one to adopt

mikemccand · 2016-12-23T19:22:35Z

This should be done in Lucene - nice easy one to adopt

+1

This commit adds support for the pattern keyword marker filter in Lucene. Previously, the keyword marker filter in Elasticsearch supported specifying a keywords set or a path to a set of keywords. This commit exposes the regular expression pattern based keyword marker filter also available in Lucene, so that any token matching the pattern specified by the `keywords_pattern` setting is excluded from being stemmed by any stemming filters. Closes elastic#4877

This commit adds support for the pattern keyword marker filter in Lucene. Previously, the keyword marker filter in Elasticsearch supported specifying a keywords set or a path to a set of keywords. This commit exposes the regular expression pattern based keyword marker filter also available in Lucene, so that any token matching the pattern specified by the `keywords_pattern` setting is excluded from being stemmed by any stemming filters. Closes #4877

abeyad · 2017-03-28T15:26:09Z

@loren we have exposed the pattern keyword marker token filter from Lucene in ES: #23600

With this, you could specify your minimum token length as a regular expression pattern

loren · 2017-03-28T15:28:53Z

Excellent! And a much more widely useful solution than I had expected. Thanks!

ghost assigned jpountz Jan 23, 2014

clintongormley added the help wanted adoptme label Oct 17, 2014

clintongormley unassigned jpountz Oct 17, 2014

clintongormley added :Search/Analysis How text is split into tokens >feature labels Dec 24, 2014

abeyad self-assigned this Feb 27, 2017

abeyad mentioned this issue Mar 15, 2017

Adds pattern keyword marker filter support #23600

Merged

abeyad closed this as completed in #23600 Mar 28, 2017

jimczi removed this from Search & Aggs in Background tasks Apr 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis: protecting tokens based on their length #4877

Analysis: protecting tokens based on their length #4877

loren commented Jan 23, 2014

ilanrivers commented Dec 3, 2014

clintongormley commented Dec 23, 2016

mikemccand commented Dec 23, 2016

abeyad commented Mar 28, 2017

loren commented Mar 28, 2017

Analysis: protecting tokens based on their length #4877

Analysis: protecting tokens based on their length #4877

Comments

loren commented Jan 23, 2014

ilanrivers commented Dec 3, 2014

clintongormley commented Dec 23, 2016

mikemccand commented Dec 23, 2016

abeyad commented Mar 28, 2017

loren commented Mar 28, 2017