Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow all token/char filters in normalizers #43803

Closed
wants to merge 3 commits into from

Conversation

romseygeek
Copy link
Contributor

Normalizers are analyzers defined on keyword fields, using either keyword
or whitespace tokenizers. There is currently an additional restriction, in that
only character-by-character char_filters and token filters are permitted as
part of a normalizer tokenization chain; however, this restriction only really
makes sense for wildcard normalization, not for keyword fields. This commit
allows all char_filter and token filter types to be used in a normalizer.

Closes #43758

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a first look and left a few comments, just some additional question: This PR looks like it changes mostly the internal implementation of normalizers, what does it change from the users side? Sorry if I missed that part when reading though #43758, not that familiar with the details mentioned there yet. Also I looked at NormalizingCharFilterFactory and NormalizingTokenFilterFactory mentioned in that issue, to me it looks like they are pure marker interfaces at this point and might be potential candidates for removal? If so we could probably add this to this PR, otherwise let me know what I'm missing.


import static java.util.Collections.singletonList;
import static java.util.Collections.singletonMap;

public class CustomNormalizerTests extends ESTokenStreamTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this test when CustomNormalizerProvider is gone? I saw there is also no CustomAnalyzerTests, maybe this could be renamed and a few tests added that test the CustomAnalyzer parts that are not covered here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a distinction in settings between analyzers and normalizers, so I think this test is still useful?

Map<String, TokenFilterFactory> tokenFilters,
Map<String, CharFilterFactory> charFilters) {
if (tokenizerFactory == null) {
throw new IllegalStateException("keyword tokenizer factory is null, normalizers require analysis-common module");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep this check and move it inside the function further down that calls tokenizerSupplier.get()? I'm not really sure its useful, but it looks like it might provide useful hints about a misconfigured cluster (thats the only way I can see analysis-common missing). In case you re-add this, this should get a test too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're missing analysis-common then I don't think anything is going to work? The error message here is out of date anyway, as we build normalizers with both keyword and whitespace tokenizers, but I'll change the caller to pass KeywordTokenizer::new rather than calling into the tokenizer map to ensure that this would never be a problem.

final Map<String, CharFilterFactory> charFilters, final Map<String, TokenFilterFactory> tokenFilters) {
String tokenizerName = analyzerSettings.get("tokenizer");
if (tokenizerName == null) {
throw new IllegalArgumentException("Custom Analyzer [" + name + "] must be configured with a tokenizer");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we make sure the name is set elsewhere or is this not necessary anymore? Otherwise it might be good to keep, I haven'r checked what the follow-up exception would be if the name is "null" and the function is called. Might be more cryptic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check has been moved to AnalysisRegistry#buildMapping, line 414

@romseygeek
Copy link
Contributor Author

what does it change from the users side?

You can now use any char filter or token filter when defining normalizers. We have a number of issues asking to make such-and-such a filter available, see eg #37344
#27310
#23142

@jpountz jpountz added v7.4.0 and removed v7.3.0 labels Jul 3, 2019
@romseygeek
Copy link
Contributor Author

I'm closing this, as @jimczi points out that we need to prevent stacked tokens appearing in search-time normalizers. We still need to distinguish between partial-term normalization and whole-term normalization, but just allowing everything isn't going to help here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow all token and character filters to be used in normalizers
5 participants