Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure TokenFilters only produce single tokens when parsing synonyms #34331

Merged
merged 21 commits into from
Nov 29, 2018
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
a10b310
Ensure TokenFilters only produce single tokens when used for parsing …
romseygeek Oct 5, 2018
b5ca5fe
Merge remote-tracking branch 'origin/master' into synonymfilters
romseygeek Oct 15, 2018
075f2e3
Allow a specific set of filters to be used to parse synonyms; throw e…
romseygeek Oct 17, 2018
3bf6a1f
checkstyle
romseygeek Oct 17, 2018
dd90dbd
Merge remote-tracking branch 'origin/master' into synonymfilters
romseygeek Oct 24, 2018
7a444ce
Depend on leniency to determine whether or not to include filter for …
romseygeek Oct 24, 2018
9225478
Use to choose whether or not to apply filters
romseygeek Oct 24, 2018
61391a4
checkstyle
romseygeek Oct 24, 2018
087419c
feedback
romseygeek Oct 29, 2018
83b63d5
feedback
romseygeek Oct 29, 2018
906f747
Merge remote-tracking branch 'origin/master' into synonymfilters
romseygeek Nov 6, 2018
b135bd7
Merge remote-tracking branch 'origin/master' into synonymfilters
romseygeek Nov 16, 2018
3f99aca
Remove lenient option; add phonetic filter suppression
romseygeek Nov 16, 2018
bf0273a
Merge remote-tracking branch 'origin/master' into synonymfilters
romseygeek Nov 16, 2018
ac17329
Allow multiple synonym filters - WIP needs tests
romseygeek Nov 16, 2018
b482d7e
Merge remote-tracking branch 'origin/master' into synonymfilters
romseygeek Nov 19, 2018
97e96d9
checkstyle
romseygeek Nov 19, 2018
6e0ecb8
Allow chained synonym filters
romseygeek Nov 19, 2018
bf78b6a
Merge remote-tracking branch 'origin/master' into synonymfilters
romseygeek Nov 20, 2018
20acfcd
Multiplexer only returns IDENTITY if preserve_original=true
romseygeek Nov 28, 2018
4cb6a30
Merge remote-tracking branch 'origin/master' into synonymfilters
romseygeek Nov 28, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -175,3 +175,16 @@ PUT /test_index

Using `synonyms_path` to define WordNet synonyms in a file is supported
as well.

=== Parsing synonym files

Elasticsearch will use the token filters preceding the synonym filter
in a tokenizer chain to parse the entries in a synonym file. So, for example, if a
synonym filter is placed after a stemmer, then the stemmer will also be applied
to the synonym entries. Because entries in the synonym map cannot have stacked
positions, some token filters may cause issues here. Token filters that produce
multiple versions of a token may choose which version of the token to emit when
parsing synonyms, e.g. `asciifolding` will only produce the folded version of the
token. Others, e.g. `multiplexer`, `word_delimiter_graph` or `ngram` will throw an
error, unless `lenient` has been set to `true`, in which case a best-effort attempt
at applying the filter will be used.
14 changes: 14 additions & 0 deletions docs/reference/analysis/tokenfilters/synonym-tokenfilter.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -163,3 +163,17 @@ PUT /test_index

Using `synonyms_path` to define WordNet synonyms in a file is supported
as well.


=== Parsing synonym files

Elasticsearch will use the token filters preceding the synonym filter
in a tokenizer chain to parse the entries in a synonym file. So, for example, if a
synonym filter is placed after a stemmer, then the stemmer will also be applied
to the synonym entries. Because entries in the synonym map cannot have stacked
positions, some token filters may cause issues here. Token filters that produce
multiple versions of a token may choose which version of the token to emit when
parsing synonyms, e.g. `asciifolding` will only produce the folded version of the
token. Others, e.g. `multiplexer`, `word_delimiter_graph` or `ngram` will throw an
error, unless `lenient` has been set to `true`, in which case a best-effort attempt
at applying the filter will be used.
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ public TokenStream create(TokenStream tokenStream) {
}

@Override
public Object getMultiTermComponent() {
public TokenFilterFactory getSynonymFilter(boolean lenient) {
if (preserveOriginal == false) {
return this;
} else {
Expand All @@ -68,4 +68,9 @@ public TokenStream create(TokenStream tokenStream) {
};
}
}

@Override
public Object getMultiTermComponent() {
return getSynonymFilter(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this work since the synonym filter checks only the first token of each position so if the first token is the original one there is no chance that the modified one can match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The folded token is always emitted first, so that's the one we need to match against the synonym map. Hence, we return a filter that only emits folded tokens.

}
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
import org.elasticsearch.index.analysis.Analysis;
import org.elasticsearch.index.analysis.TokenFilterFactory;

/**
* Contains the common configuration settings between subclasses of this class.
Expand All @@ -50,4 +51,9 @@ protected AbstractCompoundWordTokenFilterFactory(IndexSettings indexSettings, En
throw new IllegalArgumentException("word_list must be provided for [" + name + "], either as a path to a file, or directly");
}
}

@Override
public TokenFilterFactory getSynonymFilter(boolean lenient) {
return IDENTITY_FILTER; // don't decompound synonym file
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
import org.elasticsearch.index.analysis.TokenFilterFactory;

import java.util.Arrays;
import java.util.HashSet;
Expand Down Expand Up @@ -89,4 +90,11 @@ public TokenStream create(TokenStream tokenStream) {
return filter;
}

@Override
public TokenFilterFactory getSynonymFilter(boolean lenient) {
if (outputUnigrams) {
return IDENTITY_FILTER; // don't combine for synonyms
romseygeek marked this conversation as resolved.
Show resolved Hide resolved
}
return this;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -427,7 +427,7 @@ public List<PreConfiguredTokenFilter> getPreConfiguredTokenFilters() {
filters.add(PreConfiguredTokenFilter.singleton("german_stem", false, GermanStemFilter::new));
filters.add(PreConfiguredTokenFilter.singleton("hindi_normalization", true, HindiNormalizationFilter::new));
filters.add(PreConfiguredTokenFilter.singleton("indic_normalization", true, IndicNormalizationFilter::new));
filters.add(PreConfiguredTokenFilter.singleton("keyword_repeat", false, KeywordRepeatFilter::new));
filters.add(PreConfiguredTokenFilter.singleton("keyword_repeat", false, false, KeywordRepeatFilter::new));
filters.add(PreConfiguredTokenFilter.singleton("kstem", false, KStemFilter::new));
filters.add(PreConfiguredTokenFilter.singleton("length", false, input ->
new LengthFilter(input, 0, Integer.MAX_VALUE))); // TODO this one seems useless
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
import org.elasticsearch.index.analysis.Analysis;
import org.elasticsearch.index.analysis.TokenFilterFactory;

public class CommonGramsTokenFilterFactory extends AbstractTokenFilterFactory {

Expand Down Expand Up @@ -58,5 +59,10 @@ public TokenStream create(TokenStream tokenStream) {
return filter;
}
}

@Override
public TokenFilterFactory getSynonymFilter(boolean lenient) {
return IDENTITY_FILTER;
romseygeek marked this conversation as resolved.
Show resolved Hide resolved
}
}

Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,24 @@

package org.elasticsearch.analysis.common;

import org.apache.logging.log4j.LogManager;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter;
import org.apache.lucene.analysis.reverse.ReverseStringFilter;
import org.elasticsearch.Version;
import org.elasticsearch.common.logging.DeprecationLogger;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
import org.elasticsearch.index.analysis.TokenFilterFactory;


public class EdgeNGramTokenFilterFactory extends AbstractTokenFilterFactory {

private static final DeprecationLogger DEPRECATION_LOGGER
= new DeprecationLogger(LogManager.getLogger(EdgeNGramTokenFilterFactory.class));

private final int minGram;

private final int maxGram;
Expand Down Expand Up @@ -77,4 +84,20 @@ public TokenStream create(TokenStream tokenStream) {
public boolean breaksFastVectorHighlighter() {
return true;
}

@Override
public TokenFilterFactory getSynonymFilter(boolean lenient) {
if (lenient) {
return IDENTITY_FILTER;
romseygeek marked this conversation as resolved.
Show resolved Hide resolved
}
if (indexSettings.getIndexVersionCreated().onOrAfter(Version.V_7_0_0_alpha1)) {
throw new IllegalArgumentException("Token filter [" + name() +
"] cannot be used to parse synonyms unless [lenient] is set to true");
}
else {
DEPRECATION_LOGGER.deprecatedAndMaybeLog("synonym_tokenfilters", "Token filter [" + name()
+ "] will not be usable to parse synonyms after v7.0 unless [lenient] is set to true");
return IDENTITY_FILTER;
romseygeek marked this conversation as resolved.
Show resolved Hide resolved
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -19,18 +19,25 @@

package org.elasticsearch.analysis.common;

import org.apache.logging.log4j.LogManager;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.miscellaneous.FingerprintFilter;
import org.elasticsearch.Version;
import org.elasticsearch.common.logging.DeprecationLogger;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
import org.elasticsearch.index.analysis.TokenFilterFactory;

import static org.elasticsearch.analysis.common.FingerprintAnalyzerProvider.DEFAULT_MAX_OUTPUT_SIZE;
import static org.elasticsearch.analysis.common.FingerprintAnalyzerProvider.MAX_OUTPUT_SIZE;

public class FingerprintTokenFilterFactory extends AbstractTokenFilterFactory {

private static final DeprecationLogger DEPRECATION_LOGGER
= new DeprecationLogger(LogManager.getLogger(FingerprintTokenFilterFactory.class));

private final char separator;
private final int maxOutputSize;

Expand All @@ -47,4 +54,20 @@ public TokenStream create(TokenStream tokenStream) {
return result;
}

@Override
public TokenFilterFactory getSynonymFilter(boolean lenient) {
if (lenient) {
return this;
}
if (indexSettings.getIndexVersionCreated().onOrAfter(Version.V_7_0_0_alpha1)) {
throw new IllegalArgumentException("Token filter [" + name() +
"] cannot be used to parse synonyms unless [lenient] is set to true");
}
else {
DEPRECATION_LOGGER.deprecatedAndMaybeLog("synonym_tokenfilters", "Token filter [" + name()
+ "] will not be usable to parse synonyms after v7.0 unless [lenient] is set to true");
return IDENTITY_FILTER;
romseygeek marked this conversation as resolved.
Show resolved Hide resolved
}
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,15 @@

package org.elasticsearch.analysis.common;

import org.apache.logging.log4j.LogManager;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.miscellaneous.ConditionalTokenFilter;
import org.apache.lucene.analysis.miscellaneous.RemoveDuplicatesTokenFilter;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.elasticsearch.Version;
import org.elasticsearch.common.Strings;
import org.elasticsearch.common.logging.DeprecationLogger;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
Expand All @@ -40,6 +43,9 @@

public class MultiplexerTokenFilterFactory extends AbstractTokenFilterFactory {

private static final DeprecationLogger DEPRECATION_LOGGER
= new DeprecationLogger(LogManager.getLogger(MultiplexerTokenFilterFactory.class));

private List<String> filterNames;
private final boolean preserveOriginal;

Expand All @@ -54,6 +60,22 @@ public TokenStream create(TokenStream tokenStream) {
throw new UnsupportedOperationException("TokenFilterFactory.getChainAwareTokenFilterFactory() must be called first");
}

@Override
public TokenFilterFactory getSynonymFilter(boolean lenient) {
if (lenient) {
return IDENTITY_FILTER;
romseygeek marked this conversation as resolved.
Show resolved Hide resolved
}
if (indexSettings.getIndexVersionCreated().onOrAfter(Version.V_7_0_0_alpha1)) {
throw new IllegalArgumentException("Token filter [" + name() +
"] cannot be used to parse synonyms unless [lenient] is set to true");
}
else {
DEPRECATION_LOGGER.deprecatedAndMaybeLog("synonym_tokenfilters", "Token filter [" + name()
+ "] will not be usable to parse synonyms after v7.0 unless [lenient] is set to true");
return IDENTITY_FILTER;
romseygeek marked this conversation as resolved.
Show resolved Hide resolved
}
}

@Override
public TokenFilterFactory getChainAwareTokenFilterFactory(TokenizerFactory tokenizer, List<CharFilterFactory> charFilters,
List<TokenFilterFactory> previousTokenFilters,
Expand Down Expand Up @@ -97,8 +119,19 @@ public TokenStream create(TokenStream tokenStream) {
}

@Override
public TokenFilterFactory getSynonymFilter() {
return IDENTITY_FILTER;
public TokenFilterFactory getSynonymFilter(boolean lenient) {
if (lenient) {
return IDENTITY_FILTER;
}
if (indexSettings.getIndexVersionCreated().onOrAfter(Version.V_7_0_0_alpha1)) {
throw new IllegalArgumentException("Token filter [" + name() +
"] cannot be used to parse synonyms unless [lenient] is set to true");
}
else {
DEPRECATION_LOGGER.deprecatedAndMaybeLog("synonym_tokenfilters", "Token filter [" + name()
+ "] will not be usable to parse synonyms after v7.0 unless [lenient] is set to true");
return IDENTITY_FILTER;
romseygeek marked this conversation as resolved.
Show resolved Hide resolved
}
}
};
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,23 +19,27 @@

package org.elasticsearch.analysis.common;

import org.apache.logging.log4j.LogManager;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.NGramTokenFilter;
import org.elasticsearch.common.logging.DeprecationLogger;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
import org.elasticsearch.Version;

import org.elasticsearch.index.analysis.TokenFilterFactory;


public class NGramTokenFilterFactory extends AbstractTokenFilterFactory {

private static final DeprecationLogger DEPRECATION_LOGGER
= new DeprecationLogger(LogManager.getLogger(NGramTokenFilterFactory.class));

private final int minGram;

private final int maxGram;


NGramTokenFilterFactory(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
super(indexSettings, name, settings);
int maxAllowedNgramDiff = indexSettings.getMaxNgramDiff();
Expand All @@ -60,4 +64,20 @@ public TokenStream create(TokenStream tokenStream) {
// TODO: Expose preserveOriginal
return new NGramTokenFilter(tokenStream, minGram, maxGram, false);
}

@Override
public TokenFilterFactory getSynonymFilter(boolean lenient) {
if (lenient) {
return IDENTITY_FILTER;
}
if (indexSettings.getIndexVersionCreated().onOrAfter(Version.V_7_0_0_alpha1)) {
throw new IllegalArgumentException("Token filter [" + name() +
"] cannot be used to parse synonyms unless [lenient] is set to true");
}
else {
DEPRECATION_LOGGER.deprecatedAndMaybeLog("synonym_tokenfilters", "Token filter [" + name()
+ "] will not be usable to parse synonyms after v7.0 unless [lenient] is set to true");
return IDENTITY_FILTER;
romseygeek marked this conversation as resolved.
Show resolved Hide resolved
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ public TokenStream create(TokenStream tokenStream) {
public TokenFilterFactory getChainAwareTokenFilterFactory(TokenizerFactory tokenizer, List<CharFilterFactory> charFilters,
List<TokenFilterFactory> previousTokenFilters,
Function<String, TokenFilterFactory> allFilters) {
final Analyzer analyzer = buildSynonymAnalyzer(tokenizer, charFilters, previousTokenFilters);
final Analyzer analyzer = buildSynonymAnalyzer(tokenizer, charFilters, previousTokenFilters, allFilters);
final SynonymMap synonyms = buildSynonyms(analyzer, getRulesFromSettings(environment));
final String name = name();
return new TokenFilterFactory() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ public TokenStream create(TokenStream tokenStream) {
public TokenFilterFactory getChainAwareTokenFilterFactory(TokenizerFactory tokenizer, List<CharFilterFactory> charFilters,
List<TokenFilterFactory> previousTokenFilters,
Function<String, TokenFilterFactory> allFilters) {
final Analyzer analyzer = buildSynonymAnalyzer(tokenizer, charFilters, previousTokenFilters);
final Analyzer analyzer = buildSynonymAnalyzer(tokenizer, charFilters, previousTokenFilters, allFilters);
final SynonymMap synonyms = buildSynonyms(analyzer, getRulesFromSettings(environment));
final String name = name();
return new TokenFilterFactory() {
Expand All @@ -85,14 +85,19 @@ public String name() {
public TokenStream create(TokenStream tokenStream) {
return synonyms.fst == null ? tokenStream : new SynonymFilter(tokenStream, synonyms, false);
}

@Override
public TokenFilterFactory getSynonymFilter(boolean lenient) {
return IDENTITY_FILTER; // Don't apply synonyms to a synonym file, this will just confuse things
romseygeek marked this conversation as resolved.
Show resolved Hide resolved
}
};
}

Analyzer buildSynonymAnalyzer(TokenizerFactory tokenizer, List<CharFilterFactory> charFilters,
List<TokenFilterFactory> tokenFilters) {
List<TokenFilterFactory> tokenFilters, Function<String, TokenFilterFactory> allFilters) {
return new CustomAnalyzer("synonyms", tokenizer, charFilters.toArray(new CharFilterFactory[0]),
tokenFilters.stream()
.map(TokenFilterFactory::getSynonymFilter)
.map(ts -> ts.getSynonymFilter(lenient))
.toArray(TokenFilterFactory[]::new));
}

Expand Down