Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds pattern keyword marker filter support #23600

Merged
merged 3 commits into from
Mar 28, 2017

Conversation

abeyad
Copy link

@abeyad abeyad commented Mar 15, 2017

This commit adds support for the pattern keyword marker filter in
Lucene. Previously, the keyword marker filter in Elasticsearch
supported specifying a keywords set or a path to a set of keywords.
This commit exposes the regular expression pattern based keyword marker
filter also available in Lucene, so that any token matching the pattern
specified by the keywords_pattern setting is excluded from being
stemmed by any stemming filters.

Closes #4877

This commit adds support for the pattern keyword marker filter in
Lucene.  Previously, the keyword marker filter in Elasticsearch
supported specifying a keywords set or a path to a set of keywords.
This commit exposes the regular expression pattern based keyword marker
filter also available in Lucene, so that any token matching the pattern
specified by the `keywords_pattern` setting is excluded from being
stemmed by any stemming filters.

Closes elastic#4877
*
* The {@link SetKeywordMarkerFilter} uses a set of keywords to denote which tokens
* should be excluded from stemming. This filter is created if the settings include
* `keywords`, which contains the list of keywords, or `keywords_path`, which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{@code keywords} is The Javadoc Way To Do This (TM).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

throw new IllegalArgumentException(
"cannot specify both `keywords_pattern` and `keywords` or `keywords_path`");
}
keywordPattern = Pattern.compile(patternString);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to be exposing more java.util.regex stuff? We've been trying to be careful with that package. Is there a version of this that take Lucene's regexes? You can limit the time and space complexity of those....

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. However, there is no version of this filter that takes Lucene's regexes. We would have to create a new filter in Lucene or change the PatternKeywordMarkerFilter to take Lucene regexes instead of the Java regexes.

Another option is a special filter to just prevent stemming on words that are under a certain length (as requested in #4877), but I imagine a regex filter here would be more generally useful as a solution, plus Lucene already has this. The downside is that we could end up killing nodes processing strangely formatted documents with ill-conceived regexes. Yet another option is to use the Lucene PatternKeywordMarkerFilter under the covers as is, but only expose the option to pass in a minimum keyword length in ES, and we convert that to a regular expression and use the PatternKeywordMarkerFilter to achieve the desired functionality.

@jpountz what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have other analysis components that also use Java regexps, I don't think we need to block this one.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

throw new IllegalArgumentException(
"cannot specify both `keywords_pattern` and `keywords` or `keywords_path`");
}
keywordPattern = Pattern.compile(patternString);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have other analysis components that also use Java regexps, I don't think we need to block this one.

if (settings.get("keywords") != null || settings.get("keywords_path") != null) {
throw new IllegalArgumentException(
"cannot specify both `keywords_pattern` and `keywords` or `keywords_path`");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this

@abeyad abeyad merged commit 2120086 into elastic:master Mar 28, 2017
@abeyad abeyad deleted the keyword_pattern_filter branch March 28, 2017 15:13
abeyad pushed a commit that referenced this pull request Mar 28, 2017
This commit adds support for the pattern keyword marker filter in
Lucene.  Previously, the keyword marker filter in Elasticsearch
supported specifying a keywords set or a path to a set of keywords.
This commit exposes the regular expression pattern based keyword marker
filter also available in Lucene, so that any token matching the pattern
specified by the `keywords_pattern` setting is excluded from being
stemmed by any stemming filters.

Closes #4877
@abeyad
Copy link
Author

abeyad commented Mar 28, 2017

5.x commit: 4ecb1ba

@abeyad
Copy link
Author

abeyad commented Mar 28, 2017

thanks for the reviews @jpountz @nik9000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants