Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis: protecting tokens based on their length #4877

Closed
loren opened this issue Jan 23, 2014 · 5 comments
Closed

Analysis: protecting tokens based on their length #4877

loren opened this issue Jan 23, 2014 · 5 comments
Assignees
Labels
>feature help wanted adoptme :Search/Analysis How text is split into tokens

Comments

@loren
Copy link

loren commented Jan 23, 2014

I would like to be able to tell the keyword marker to protect tokens 1-4 characters in length, or tell the minimal english stemmer to ignore tokens shorter than 5 characters.

Perhaps the more generic thing to have would be a Minimum Length Keyword Marker that could go in front of the other filters.

Based on discussion at https://groups.google.com/forum/#!msg/elasticsearch/uFlKWq2HvQk/mM8KjaItPH0J

@ilanrivers
Copy link

Is there an expected date when this will be implemented?

@clintongormley clintongormley added :Search/Analysis How text is split into tokens >feature labels Dec 24, 2014
@clintongormley
Copy link

This should be done in Lucene - nice easy one to adopt

@mikemccand
Copy link
Contributor

This should be done in Lucene - nice easy one to adopt

+1

@abeyad abeyad self-assigned this Feb 27, 2017
abeyad pushed a commit to abeyad/elasticsearch that referenced this issue Mar 15, 2017
This commit adds support for the pattern keyword marker filter in
Lucene.  Previously, the keyword marker filter in Elasticsearch
supported specifying a keywords set or a path to a set of keywords.
This commit exposes the regular expression pattern based keyword marker
filter also available in Lucene, so that any token matching the pattern
specified by the `keywords_pattern` setting is excluded from being
stemmed by any stemming filters.

Closes elastic#4877
abeyad pushed a commit that referenced this issue Mar 28, 2017
This commit adds support for the pattern keyword marker filter in
Lucene.  Previously, the keyword marker filter in Elasticsearch
supported specifying a keywords set or a path to a set of keywords.
This commit exposes the regular expression pattern based keyword marker
filter also available in Lucene, so that any token matching the pattern
specified by the `keywords_pattern` setting is excluded from being
stemmed by any stemming filters.

Closes #4877
abeyad pushed a commit that referenced this issue Mar 28, 2017
This commit adds support for the pattern keyword marker filter in
Lucene.  Previously, the keyword marker filter in Elasticsearch
supported specifying a keywords set or a path to a set of keywords.
This commit exposes the regular expression pattern based keyword marker
filter also available in Lucene, so that any token matching the pattern
specified by the `keywords_pattern` setting is excluded from being
stemmed by any stemming filters.

Closes #4877
@abeyad
Copy link

abeyad commented Mar 28, 2017

@loren we have exposed the pattern keyword marker token filter from Lucene in ES: #23600

With this, you could specify your minimum token length as a regular expression pattern

@loren
Copy link
Author

loren commented Mar 28, 2017

Excellent! And a much more widely useful solution than I had expected. Thanks!

@jimczi jimczi removed this from Search & Aggs in Background tasks Apr 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature help wanted adoptme :Search/Analysis How text is split into tokens
Projects
None yet
Development

No branches or pull requests

6 participants