-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds pattern keyword marker filter support #23600
Conversation
This commit adds support for the pattern keyword marker filter in Lucene. Previously, the keyword marker filter in Elasticsearch supported specifying a keywords set or a path to a set of keywords. This commit exposes the regular expression pattern based keyword marker filter also available in Lucene, so that any token matching the pattern specified by the `keywords_pattern` setting is excluded from being stemmed by any stemming filters. Closes elastic#4877
* | ||
* The {@link SetKeywordMarkerFilter} uses a set of keywords to denote which tokens | ||
* should be excluded from stemming. This filter is created if the settings include | ||
* `keywords`, which contains the list of keywords, or `keywords_path`, which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{@code keywords}
is The Javadoc Way To Do This (TM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
throw new IllegalArgumentException( | ||
"cannot specify both `keywords_pattern` and `keywords` or `keywords_path`"); | ||
} | ||
keywordPattern = Pattern.compile(patternString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to be exposing more java.util.regex stuff? We've been trying to be careful with that package. Is there a version of this that take Lucene's regexes? You can limit the time and space complexity of those....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. However, there is no version of this filter that takes Lucene's regexes. We would have to create a new filter in Lucene or change the PatternKeywordMarkerFilter
to take Lucene regexes instead of the Java regexes.
Another option is a special filter to just prevent stemming on words that are under a certain length (as requested in #4877), but I imagine a regex filter here would be more generally useful as a solution, plus Lucene already has this. The downside is that we could end up killing nodes processing strangely formatted documents with ill-conceived regexes. Yet another option is to use the Lucene PatternKeywordMarkerFilter
under the covers as is, but only expose the option to pass in a minimum keyword length in ES, and we convert that to a regular expression and use the PatternKeywordMarkerFilter
to achieve the desired functionality.
@jpountz what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have other analysis components that also use Java regexps, I don't think we need to block this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
throw new IllegalArgumentException( | ||
"cannot specify both `keywords_pattern` and `keywords` or `keywords_path`"); | ||
} | ||
keywordPattern = Pattern.compile(patternString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have other analysis components that also use Java regexps, I don't think we need to block this one.
if (settings.get("keywords") != null || settings.get("keywords_path") != null) { | ||
throw new IllegalArgumentException( | ||
"cannot specify both `keywords_pattern` and `keywords` or `keywords_path`"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding this
This commit adds support for the pattern keyword marker filter in Lucene. Previously, the keyword marker filter in Elasticsearch supported specifying a keywords set or a path to a set of keywords. This commit exposes the regular expression pattern based keyword marker filter also available in Lucene, so that any token matching the pattern specified by the `keywords_pattern` setting is excluded from being stemmed by any stemming filters. Closes #4877
5.x commit: 4ecb1ba |
This commit adds support for the pattern keyword marker filter in
Lucene. Previously, the keyword marker filter in Elasticsearch
supported specifying a keywords set or a path to a set of keywords.
This commit exposes the regular expression pattern based keyword marker
filter also available in Lucene, so that any token matching the pattern
specified by the
keywords_pattern
setting is excluded from beingstemmed by any stemming filters.
Closes #4877