HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled [LUCENE-8183] #9231

asfimport · 2018-02-23T11:22:23Z

The HyphenationCompoundWordTokenFilter creates overlapping tokens even if onlyLongestMatch is enabled.

Example:

Dictionary: gesellschaft, schaft
Hyphenator: de_DR.xml //from Apche Offo
onlyLongestMatch: true

text	gesellschaft	gesellschaft	schaft
raw_bytes	[67 65 73 65 6c 6c 73 63 68 61 66 74]	[67 65 73 65 6c 6c 73 63 68 61 66 74]	[73 63 68 61 66 74]
start	0	0	0
end	12	12	12
positionLength	1	1	1
type	word	word	word
position	1	1	1

IMHO this includes 2 unexpected Tokens

the 2nd 'gesellschaft' as it duplicates the original token
the 'schaft' as it is a sub-token 'gesellschaft' that is present in the dictionary

Migrated from LUCENE-8183 by Rupert Westenthaler, 1 vote, updated Jan 14 2021
Environment:

Configuration of the analyzer:

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.HyphenationCompoundWordTokenFilterFactory" 
        hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
         dictionary="lang/wordlist_de.txt" 
        onlyLongestMatch="true"/>

Attachments: LUCENE-8183_20180223_rwesten.diff, LUCENE-8183_20180227_rwesten.diff, lucene-8183.zip
Linked issues:

DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect [LUCENE-3022] #4096

The text was updated successfully, but these errors were encountered:

asfimport · 2018-02-23T11:29:25Z

Rupert Westenthaler (migrated from JIRA)

Added a patch that shows how the unexpected behaviour could be fixed.

asfimport · 2018-02-23T18:03:19Z

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi,
I have not noticed this with my dictionaries, I have to dig into that. Did you also check the German dictionary which is provided here: https://github.com/uschindler/german-decompounder

asfimport · 2018-02-23T19:35:49Z

Matthias Krueger (@mkr) (migrated from JIRA)

Rupert Westenthaler Quick question regarding your patch: What's the reasoning behind not decomposing terms that are part of the dictionary at all?

The onlyLongestMatch flag currently affects whether all matches or only the longest match should be returned per start character (in DictionaryCompoundWordTokenFilter) or per hyphenation start point (in HyphenationCompoundWordTokenFilter).

Example:
Dictionary "Schaft", "Wirt", "Wirtschaft", "Wissen", "Wissenschaft" for input "Wirtschaftswissenschaft" will return the original input plus tokens "Wirtschaft", "schaft", "wissenschaft", "schaft" but not "Wirt" or "Wissen". "schaft" is still returned (even twice) because it's the longest token starting at the respective position.

I like the idea of restricting this further to only the longest terms that touch a certain hyphenation point. This would exclude "schaft" in the example above (as "Wirtschaft" and "wissenschaft" are two longer terms encompassing the respective hyphenation point). On the other hand, there might be examples where you still want to include the "overlapping" tokens. For "Fußballpumpe" and dictionary "Ball", "Ballpumpe", "Pumpe", "Fuß", "Fußball" you would get the tokens "Fußball" and "pumpe" but not "Ballpumpe" as "Ball" has already been considered part of Fußball. Also, not sure if your change also improves the situation for languages other than German.

Regarding point 1: The current algorithm always returns the term itself again if it's part of the dictionary. I guess, this could be changed if we don't check against this.maxSubwordSize but against Math.min(this.maxSubwordSize), termAtt.length()-1)

Perhaps these kind of adjustments should rather be done in a TokenFilter similar to RemoveDuplicatesTokenFilter instead of complicating the decompounding algorithm?

asfimport · 2018-02-23T19:36:39Z

Rupert Westenthaler (migrated from JIRA)

AFAIK I use exactly this dictionary and hyphenator config. I will provide a Solr core config that can be used to reproduce the described issue.

asfimport · 2018-02-23T20:05:03Z

Rupert Westenthaler (migrated from JIRA)

I was not aware that this is the intended behaviour.

For "Fußballpumpe" and dictionary "Ball", "Ballpumpe", "Pumpe", "Fuß", "Fußball" you would get the tokens "Fußball" and "pumpe" but not "Ballpumpe" as "Ball" has already been considered part of Fußball. Also, not sure if your change also improves the situation for languages other than German.

Thats a good point. Maybe one should still consider parts that are not enclosed by an token that was already decomposed. So for Fußballpumpe: ball would be ignored as Fußball is already present, but ballpumpe would still be added as token. Finally pumpe is ignored as ballpumpe is present.

This reminds me to ALL, NO_SUB and LONGEST_DOMINANT_RIGHT as supported by the Solr Text Tagger

Perhaps these kind of adjustments should rather be done in a TokenFilter similar to RemoveDuplicatesTokenFilter instead of complicating the decompounding algorithm?

I am aware of this possibility. In fact I do use the RemoveDuplicatesTokenFilter to remove those tokens. My point was just why they are added in the first place.

asfimport · 2018-02-23T23:59:25Z

Uwe Schindler (@uschindler) (migrated from JIRA)

Rupert Westenthaler: I was not aware that this was my dictionary file! The names in your example (under "environment in your report) did not look like the example listed here: https://github.com/uschindler/german-decompounder

asfimport · 2018-02-24T00:02:14Z

Uwe Schindler (@uschindler) (migrated from JIRA)

I am aware of this possibility. In fact I do use the RemoveDuplicatesTokenFilter to remove those tokens. My point was just why they are added in the first place.

I think it's good to not add them in the first place. The change is quite simple, so it can be done here. And it does not really complicate the algorithm as its done at one separated place.

asfimport · 2018-02-24T00:08:51Z

Uwe Schindler (@uschindler) (migrated from JIRA)

I have to check the algorithm, but to make this patch into lucene, the test cases need to be adapted to check this new behaviour.

asfimport · 2018-02-26T12:07:12Z

Rupert Westenthaler (migrated from JIRA)

FYI: I pan to spend some time to implement a version of the DictionaryCompoundWordTokenFilter that adds options for

noSub: no tokens are added the are completely enclosed by an longer (fußballpumpe: fußball, ballpumpe)
noOverlap: no overlapping tokens (fußballpumpe; fußball, pumpe)

IMO the simplest way is to first emit all tokens and later filter those based on the active options (onlyLongestMatch, noSub, noOverlap).

Regarding the test:

Providing good test examples is hard as the current test cases are based on a Danish and I do not speak this language
Providing examples in German would be easy, but this would require a German hyphenator and the file is licensed under the LaTeX Project Public License and can therefore not be included in the source.
Given suitable examples the implementation of the actual test seams to be rather easy as they can be implemented similar to the existing test cases

asfimport · 2018-02-26T13:06:24Z

Matthias Krueger (@mkr) (migrated from JIRA)

You might want to have a look at "mocking" the HyphenationTree (see my patch for LUCENE-8185) which simplifies writing a decompounding test.

asfimport · 2018-02-26T13:50:00Z

Rupert Westenthaler (migrated from JIRA)

Thats helpful indeed! thx

asfimport · 2018-02-27T15:24:46Z

Rupert Westenthaler (migrated from JIRA)

Patch: LUCENE-8183_20180227_rwesten.diff

New Parameters:

noSubMatches: true/false
noOverlappingMatches: true/false

together with the existing onlyLongestMatch those can be used to define what subwords should be added as tokens. Functionality is as described above.

Typically users will only want to include one of the three attributes as enabling noOverlappingMatches is the most restrictive and noSubMatches is more restrictive as onlyLongestMatch. When enabling a more restrictive option the state of the less restrictive does not have any effect.

Because of that it would be an option to refactor this to an single attribute with different setting, but this would require to think about backward compatibility for configurations that do use onlyLongestMatch=true at the moment.

Algorithm

If processing of subWords is deactivated (any of onlyLongestMatch, noSubMatches, noOverlappingMatches is active) the algorithm first checks if the token is part of the dictionary. If so it returns immediately. This is to avoid adding tokens for subwords if the token itself is in the dictionary (see #testNoSubAndTokenInDictionary for more info).

I changed the iteration direction of the inner for loop to start with the longest possible subword as this simplified the code.

NOTE: that this also changes the order of the Tokens in the token stream but as all tokens are at the same position that should not make any difference. I had however to modify some existing tests as those where sensitive to the ordering

h3 Tests

I added two test methods in TestCompoundWordTokenFilter

#testNoSubAndNoOverlap() tests the expected behaviour of the noSubMatches and noOverlappingMatches options
#testNoSubAndTokenInDictionary() tests that no tokens for subwords are added in the case that the token in part of the dictionary

In addition TestHyphenationCompoundWordTokenFilterFactory#testLucene8183() asserts that the new configuration options are parsed.

h3 Environment

This patch is based on master from git@github.com:apache/lucene-solr.git commit: d512cd7604741a2f55deb0c840188f0342f1ba57

LUCENE-8183_20180227_rwesten.diff

diff --git a/lucene/analysis/common/src/java/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.java b/lucene/analysis/common/src/java/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.java
index 41f92c9..ccb8714 100644
--- a/lucene/analysis/common/src/java/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.java
+++ b/lucene/analysis/common/src/java/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.java
@@ -34,6 +34,9 @@ import org.xml.sax.InputSource;
 public class HyphenationCompoundWordTokenFilter extends
     CompoundWordTokenFilterBase {
   private HyphenationTree hyphenator;
+  private boolean noSubMatches;
+  private boolean noOverlappingMatches;
+  private boolean calcSubMatches;
 
   /**
    * Creates a new {@link HyphenationCompoundWordTokenFilter} instance.
@@ -48,7 +51,7 @@ public class HyphenationCompoundWordTokenFilter extends
   public HyphenationCompoundWordTokenFilter(TokenStream input,
                                             HyphenationTree hyphenator, CharArraySet dictionary) {
     this(input, hyphenator, dictionary, DEFAULT_MIN_WORD_SIZE,
-        DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MAX_SUBWORD_SIZE, false);
+        DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MAX_SUBWORD_SIZE, false, false, false);
   }
 
   /**
@@ -67,21 +70,54 @@ public class HyphenationCompoundWordTokenFilter extends
    * @param maxSubwordSize
    *          only subwords shorter than this get to the output stream
    * @param onlyLongestMatch
-   *          Add only the longest matching subword to the stream
+   *          Add only the longest matching subword for each hyphenation to the stream
+   */
+  public HyphenationCompoundWordTokenFilter(TokenStream input,
+      HyphenationTree hyphenator, CharArraySet dictionary, int minWordSize,
+      int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch) {
+    this(input, hyphenator, dictionary, minWordSize, minSubwordSize,
+        maxSubwordSize, onlyLongestMatch, false, false);
+  }
+  
+  /**
+   * Creates a new {@link HyphenationCompoundWordTokenFilter} instance.
+   *
+   * @param input
+   *          the {@link org.apache.lucene.analysis.TokenStream} to process
+   * @param hyphenator
+   *          the hyphenation pattern tree to use for hyphenation
+   * @param dictionary
+   *          the word dictionary to match against.
+   * @param minWordSize
+   *          only words longer than this get processed
+   * @param minSubwordSize
+   *          only subwords longer than this get to the output stream
+   * @param maxSubwordSize
+   *          only subwords shorter than this get to the output stream
+   * @param onlyLongestMatch
+   *          Add only the longest matching subword for each hyphenation to the stream
+   * @param noSubMatches
+   *          Excludes subwords that are enclosed by an other token
+   * @param noOverlappingMatches
+   *          Excludes subwords that overlap with an other subword
    */
   public HyphenationCompoundWordTokenFilter(TokenStream input,
                                             HyphenationTree hyphenator, CharArraySet dictionary, int minWordSize,
-                                            int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch) {
+                                            int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch,
+                                            boolean noSubMatches, boolean noOverlappingMatches) {
     super(input, dictionary, minWordSize, minSubwordSize, maxSubwordSize,
         onlyLongestMatch);
 
     this.hyphenator = hyphenator;
+    this.noSubMatches = noSubMatches;
+    this.noOverlappingMatches = noOverlappingMatches;
+    this.calcSubMatches = !onlyLongestMatch && !noSubMatches && !noOverlappingMatches;
   }
 
   /**
    * Create a HyphenationCompoundWordTokenFilter with no dictionary.
    * <p>
-   * Calls {@link #HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream, org.apache.lucene.analysis.compound.hyphenation.HyphenationTree, org.apache.lucene.analysis.CharArraySet, int, int, int, boolean)
+   * Calls {@link #HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream, org.apache.lucene.analysis.compound.hyphenation.HyphenationTree, org.apache.lucene.analysis.CharArraySet, int, int, int, boolean, boolean, boolean)
    * HyphenationCompoundWordTokenFilter(matchVersion, input, hyphenator,
    * null, minWordSize, minSubwordSize, maxSubwordSize }
    */
@@ -89,7 +125,7 @@ public class HyphenationCompoundWordTokenFilter extends
                                             HyphenationTree hyphenator, int minWordSize, int minSubwordSize,
                                             int maxSubwordSize) {
     this(input, hyphenator, null, minWordSize, minSubwordSize,
-        maxSubwordSize, false);
+        maxSubwordSize, false, false, false);
   }
 
   /**
@@ -133,26 +169,42 @@ public class HyphenationCompoundWordTokenFilter extends
 
   @Override
   protected void decompose() {
+    //if the token is in the dictionary and we are not interested in subMatches
+    //we can skip decomposing this token (see testNoSubAndTokenInDictionary unit test) 
+    //NOTE: 
+    //we check against token and the token that is one character
+    //shorter to avoid problems with genitive 's characters and other binding characters
+    if(dictionary != null && !this.calcSubMatches && 
+      (dictionary.contains(termAtt.buffer(), 0, termAtt.length()) ||
+          termAtt.length() > 1 && dictionary.contains(termAtt.buffer(), 0, termAtt.length() - 1))){
+    return; //the whole token is in the dictionary - do not decompose
+    }
+    
     // get the hyphenation points
     Hyphenation hyphens = hyphenator.hyphenate(termAtt.buffer(), 0, termAtt.length(), 1, 1);
     // No hyphen points found -> exit
     if (hyphens == null) {
       return;
     }
+    int maxSubwordSize = Math.min(this.maxSubwordSize, termAtt.length()-1);
 
     final int[] hyp = hyphens.getHyphenationPoints();
 
+    int consumed = -1; //hyp of the longest token added (for noSub)
+    
     for (int i = 0; i < hyp.length; ++i) {
-      int remaining = hyp.length - i;
+      if(noOverlappingMatches){ //if we do not want overlapping subwords
+        i = Math.max(i, consumed); //skip over consumed hyp
+      }
       int start = hyp[i];
-      CompoundToken longestMatchToken = null;
-      for (int j = 1; j < remaining; j++) {
-        int partLength = hyp[i + j] - start;
+      int until = noSubMatches ? Math.max(consumed, i) : i;
+      for (int j = hyp.length - 1; j > until; j--) {
+        int partLength = hyp[j] - start;
 
         // if the part is longer than maxSubwordSize we
         // are done with this round
-        if (partLength > this.maxSubwordSize) {
-          break;
+        if (partLength > maxSubwordSize) {
+          continue;
         }
 
         // we only put subwords to the token stream
@@ -160,42 +212,26 @@ public class HyphenationCompoundWordTokenFilter extends
         if (partLength < this.minSubwordSize) {
           // BOGUS/BROKEN/FUNKY/WACKO: somehow we have negative 'parts' according to the 
           // calculation above, and we rely upon minSubwordSize being >=0 to filter them out...
-          continue;
+          break; 
         }
 
         // check the dictionary
         if (dictionary == null || dictionary.contains(termAtt.buffer(), start, partLength)) {
-          if (this.onlyLongestMatch) {
-            if (longestMatchToken != null) {
-              if (longestMatchToken.txt.length() < partLength) {
-                longestMatchToken = new CompoundToken(start, partLength);
-              }
-            } else {
-              longestMatchToken = new CompoundToken(start, partLength);
-            }
-          } else {
             tokens.add(new CompoundToken(start, partLength));
-          }
+            consumed = j; //mark the current hyp as consumed
+            if(!calcSubMatches){
+              break; //do not search for shorter matches
+            }
         } else if (dictionary.contains(termAtt.buffer(), start, partLength - 1)) {
           // check the dictionary again with a word that is one character
-          // shorter
-          // to avoid problems with genitive 's characters and other binding
-          // characters
-          if (this.onlyLongestMatch) {
-            if (longestMatchToken != null) {
-              if (longestMatchToken.txt.length() < partLength - 1) {
-                longestMatchToken = new CompoundToken(start, partLength - 1);
-              }
-            } else {
-              longestMatchToken = new CompoundToken(start, partLength - 1);
-            }
-          } else {
-            tokens.add(new CompoundToken(start, partLength - 1));
+          // shorter to avoid problems with genitive 's characters and
+          // other binding characters
+          tokens.add(new CompoundToken(start, partLength - 1));
+          consumed = j; //mark the current hyp as consumed
+          if(!calcSubMatches){
+            break; //do not search for shorter matches
           }
-        }
-      }
-      if (this.onlyLongestMatch && longestMatchToken!=null) {
-        tokens.add(longestMatchToken);
+        } //else dictionary is present but does not contain the part
       }
     }
   }
diff --git a/lucene/analysis/common/src/java/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.java b/lucene/analysis/common/src/java/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.java
index 6b018d6..9f6e65c 100644
--- a/lucene/analysis/common/src/java/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.java
+++ b/lucene/analysis/common/src/java/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.java
@@ -69,6 +69,8 @@ public class HyphenationCompoundWordTokenFilterFactory extends TokenFilterFactor
   private final int minSubwordSize;
   private final int maxSubwordSize;
   private final boolean onlyLongestMatch;
+  private final boolean noSubMatches;
+  private final boolean noOverlappingMatches;
   
   /** Creates a new HyphenationCompoundWordTokenFilterFactory */
   public HyphenationCompoundWordTokenFilterFactory(Map<String, String> args) {
@@ -80,6 +82,8 @@ public class HyphenationCompoundWordTokenFilterFactory extends TokenFilterFactor
     minSubwordSize = getInt(args, "minSubwordSize", CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE);
     maxSubwordSize = getInt(args, "maxSubwordSize", CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE);
     onlyLongestMatch = getBoolean(args, "onlyLongestMatch", false);
+    noSubMatches = getBoolean(args, "noSubMatches", false);
+    noOverlappingMatches = getBoolean(args, "noOverlappingMatches", false);
     if (!args.isEmpty()) {
       throw new IllegalArgumentException("Unknown parameters: " + args);
     }
@@ -105,6 +109,6 @@ public class HyphenationCompoundWordTokenFilterFactory extends TokenFilterFactor
   
   @Override
   public TokenFilter create(TokenStream input) {
-    return new HyphenationCompoundWordTokenFilter(input, hyphenator, dictionary, minWordSize, minSubwordSize, maxSubwordSize, onlyLongestMatch);
+    return new HyphenationCompoundWordTokenFilter(input, hyphenator, dictionary, minWordSize, minSubwordSize, maxSubwordSize, onlyLongestMatch, noSubMatches, noOverlappingMatches);
   }
 }
diff --git a/lucene/analysis/common/src/java/org/apache/lucene/analysis/compound/hyphenation/Hyphenation.java b/lucene/analysis/common/src/java/org/apache/lucene/analysis/compound/hyphenation/Hyphenation.java
index 3fb1e04..c435eaa 100644
--- a/lucene/analysis/common/src/java/org/apache/lucene/analysis/compound/hyphenation/Hyphenation.java
+++ b/lucene/analysis/common/src/java/org/apache/lucene/analysis/compound/hyphenation/Hyphenation.java
@@ -23,12 +23,12 @@ package org.apache.lucene.analysis.compound.hyphenation;
  */
 public class Hyphenation {
 
-  private int[] hyphenPoints;
+  private final int[] hyphenPoints;
 
   /**
    * rawWord as made of alternating strings and {@link Hyphen Hyphen} instances
    */
-  Hyphenation(int[] points) {
+  public Hyphenation(int[] points) {
     hyphenPoints = points;
   }
 
diff --git a/lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestCompoundWordTokenFilter.java b/lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestCompoundWordTokenFilter.java
index 67a1bb4..88ffe8b 100644
--- a/lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestCompoundWordTokenFilter.java
+++ b/lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestCompoundWordTokenFilter.java
@@ -21,6 +21,8 @@ import java.io.IOException;
 import java.io.Reader;
 import java.io.StringReader;
 import java.util.Arrays;
+import java.util.Collections;
+import java.util.Map;
 
 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.analysis.BaseTokenStreamTestCase;
@@ -31,6 +33,7 @@ import org.apache.lucene.analysis.TokenStream;
 import org.apache.lucene.analysis.Tokenizer;
 import org.apache.lucene.analysis.charfilter.MappingCharFilter;
 import org.apache.lucene.analysis.charfilter.NormalizeCharMap;
+import org.apache.lucene.analysis.compound.hyphenation.Hyphenation;
 import org.apache.lucene.analysis.compound.hyphenation.HyphenationTree;
 import org.apache.lucene.analysis.core.KeywordTokenizer;
 import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
@@ -102,7 +105,7 @@ public class TestCompoundWordTokenFilter extends BaseTokenStreamTestCase {
     
     // min=2, max=4
     assertTokenStreamContents(tf,
-        new String[] { "basketballkurv", "ba", "sket", "bal", "ball", "kurv" }
+        new String[] { "basketballkurv", "ba", "sket", "ball", "bal", "kurv" }
     );
     
     tf = new HyphenationCompoundWordTokenFilter(
@@ -126,8 +129,8 @@ public class TestCompoundWordTokenFilter extends BaseTokenStreamTestCase {
     
     // min=4, max=10
     assertTokenStreamContents(tf,
-        new String[] { "basketballkurv", "basket", "basketbal", "basketball", "sket", 
-                       "sketbal", "sketball", "ball", "ballkurv", "lkurv", "kurv" }
+        new String[] { "basketballkurv", "basketball", "basketbal", "basket", 
+                       "sketball", "sketbal", "sket", "ballkurv", "ball", "lkurv", "kurv" }
     );
     
   }
@@ -273,10 +276,95 @@ public class TestCompoundWordTokenFilter extends BaseTokenStreamTestCase {
                 "Rindfleisch"),
         hyphenator);
 
-    // TODO Rindfleisch returned twice is another issue of the HyphenationCompoundTokenFilter 
-    assertTokenStreamContents(tf, new String[] { "Rindfleisch", "Rind", "Rindfleisch", "fleisch"});
+    assertTokenStreamContents(tf, new String[] { "Rindfleisch", "Rind", "fleisch"});
   }
-
+  
+  public void testNoSubAndNoOverlap() throws Exception { //LUCENE-8183
+    String input = "fußballpumpe";
+    Hyphenation hyphenation = new Hyphenation(new int[]{ 0, 3, 7, 10, 12});//fuß ball pum pe
+    HyphenationTree hyphenator = new MockHyphenator(Collections.singletonMap(input, hyphenation));
+    CharArraySet dictionary = makeDictionary("fußball", "ballpumpe", "fuß", "ball", "pumpe");
+  
+    //test the default configuration
+    HyphenationCompoundWordTokenFilter tf1 = new HyphenationCompoundWordTokenFilter(whitespaceMockTokenizer(input),
+      hyphenator, dictionary);
+    assertTokenStreamContents(tf1, new String[] { "fußballpumpe", "fußball", "fuß", "ballpumpe", "ball", "pumpe"});
+      
+    //test with onlyLongestMatch
+    HyphenationCompoundWordTokenFilter tf2 = new HyphenationCompoundWordTokenFilter(whitespaceMockTokenizer(input),
+      hyphenator, dictionary, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE,
+      CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE, 
+      CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, true);
+    assertTokenStreamContents(tf2, new String[] { "fußballpumpe", "fußball", "ballpumpe", "pumpe"});
+      
+    //test with noSub enabled and noOverlap disabled
+    HyphenationCompoundWordTokenFilter tf3 = new HyphenationCompoundWordTokenFilter(whitespaceMockTokenizer(input),
+      hyphenator, dictionary, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE,
+      CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE, 
+      CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, true, true, false);
+    assertTokenStreamContents(tf3, new String[] { "fußballpumpe", "fußball", "ballpumpe"});
+    //assert that the onlyLongestMatch state does not matter if noSub is active
+    HyphenationCompoundWordTokenFilter tf3b = new HyphenationCompoundWordTokenFilter(whitespaceMockTokenizer(input),
+      hyphenator, dictionary, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE,
+      CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE, 
+      CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, false, true, false);
+    assertTokenStreamContents(tf3b, new String[] { "fußballpumpe", "fußball", "ballpumpe"});
+
+    //test with noOverlap enabled
+    HyphenationCompoundWordTokenFilter tf4 = new HyphenationCompoundWordTokenFilter(whitespaceMockTokenizer(input),
+      hyphenator, dictionary, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE,
+      CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE, 
+      CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, true, true, true);
+    //NOTE: 'fußball' consumes 'ball' as possible start so 'ballpumpe' is not considered and 'pumpe' is added
+    assertTokenStreamContents(tf4, new String[] { "fußballpumpe", "fußball", "pumpe"});
+    //assert that the noSub and onlyLongestMatch states do not matter
+    HyphenationCompoundWordTokenFilter tf4b = new HyphenationCompoundWordTokenFilter(whitespaceMockTokenizer(input),
+      hyphenator, dictionary, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE,
+      CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE, 
+      CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, false, false, true);
+    assertTokenStreamContents(tf4b, new String[] { "fußballpumpe", "fußball", "pumpe"});
+    HyphenationCompoundWordTokenFilter tf4c = new HyphenationCompoundWordTokenFilter(whitespaceMockTokenizer(input),
+      hyphenator, dictionary, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE,
+      CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE, 
+      CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, true, false, true);
+    assertTokenStreamContents(tf4c, new String[] { "fußballpumpe", "fußball", "pumpe"});
+  }      
+
+  public void testNoSubAndTokenInDictionary() throws Exception { //LUCENE-8183
+    //test that no subwords are added if the token is part of the dictionary and 
+    //onlyLongestMatch or noSub is present
+    String input = "fußball";
+    Hyphenation hyphenation = new Hyphenation(new int[]{ 0, 3, 7});//fuß ball
+    HyphenationTree hyphenator = new MockHyphenator(Collections.singletonMap(input, hyphenation));
+    CharArraySet dictionary = makeDictionary("fußball", "fuß", "ball");
+      
+    //test the default configuration as baseline
+    HyphenationCompoundWordTokenFilter tf5 = new HyphenationCompoundWordTokenFilter(whitespaceMockTokenizer(input),
+      hyphenator, dictionary);
+    assertTokenStreamContents(tf5, new String[] { "fußball", "fuß", "ball"});
+
+    //when onlyLongestMatch is enabled fußball matches dictionary. So even so
+    //fußball is not added as token it MUST prevent shorter matches to be added
+    HyphenationCompoundWordTokenFilter tf6 = new HyphenationCompoundWordTokenFilter(whitespaceMockTokenizer(input),
+      hyphenator, dictionary, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE,
+      CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE, 
+      CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, true, false, false);
+    assertTokenStreamContents(tf6, new String[] { "fußball"});
+      
+    //when noSub is enabled fuß and ball MUST NOT be added as subwords as fußball is in the dictionary
+    HyphenationCompoundWordTokenFilter tf7 = new HyphenationCompoundWordTokenFilter(whitespaceMockTokenizer(input),
+      hyphenator, dictionary, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE,
+      CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE, 
+      CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, false, true, false);
+    assertTokenStreamContents(tf7, new String[] { "fußball"});
+
+    //when noOverlap is enabled fuß and ball MUST NOT be added as subwords as fußball is in the dictionary
+    HyphenationCompoundWordTokenFilter tf8 = new HyphenationCompoundWordTokenFilter(whitespaceMockTokenizer(input),
+      hyphenator, dictionary, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE,
+      CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE, 
+      CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, false, false, true);
+    assertTokenStreamContents(tf8, new String[] { "fußball"});
+}
 
   public static interface MockRetainAttribute extends Attribute {
     void setRetain(boolean attr);
@@ -331,7 +419,22 @@ public class TestCompoundWordTokenFilter extends BaseTokenStreamTestCase {
       }
     }
   }
-  
+
+  // Hyphenator that has prior knowledge of hyphenation points for terms
+  private static class MockHyphenator extends HyphenationTree {
+
+    private final Map<String, Hyphenation> hyphenations;
+
+    MockHyphenator(Map<String, Hyphenation> hyphenations) {
+      this.hyphenations = hyphenations;
+    }
+
+    @Override
+    public Hyphenation hyphenate(char[] w, int offset, int len, int remainCharCount, int pushCharCount) {
+      return hyphenations.get(new String(w, offset, len));
+    }
+  }
+
   // [SOLR-2891](https://issues.apache.org/jira/browse/SOLR-2891)
   // *CompoundWordTokenFilter blindly adds term length to offset, but this can take things out of bounds
   // wrt original text if a previous filter increases the length of the word (in this case ü -> ue)
diff --git a/lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestHyphenationCompoundWordTokenFilterFactory.java b/lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestHyphenationCompoundWordTokenFilterFactory.java
index 0039e20..2dfc63c 100644
--- a/lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestHyphenationCompoundWordTokenFilterFactory.java
+++ b/lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestHyphenationCompoundWordTokenFilterFactory.java
@@ -47,6 +47,27 @@ public class TestHyphenationCompoundWordTokenFilterFactory extends BaseTokenStre
   }
 
   /**
+   * just tests that the two no configuration options are correctly processed
+   * tests for the functionality are part of {@link TestCompoundWordTokenFilter}
+   */
+  public void testLucene8183() throws Exception {
+    Reader reader = new StringReader("basketballkurv");
+    TokenStream stream = new MockTokenizer(MockTokenizer.WHITESPACE, false);
+    ((Tokenizer)stream).setReader(reader);
+    stream = tokenFilterFactory("HyphenationCompoundWord", 
+        "hyphenator", "da_UTF8.xml",
+        "dictionary", "compoundDictionary_lucene8183.txt",
+        "onlyLongestMatch", "false",
+        "noSubMatches", "true",
+        "noOverlappingMatches", "false").create(stream);
+    
+    assertTokenStreamContents(stream, 
+        new String[] { "basketballkurv", "basketball", "kurv"},
+        new int[] { 1, 0, 0}
+    );
+  }
+
+  /**
    * Ensure the factory works with no dictionary: using hyphenation grammar only.
    * Also change the min/max subword sizes from the default. When using no dictionary,
    * it's generally necessary to tweak these, or you get lots of expansions.
@@ -61,7 +82,7 @@ public class TestHyphenationCompoundWordTokenFilterFactory extends BaseTokenStre
         "maxSubwordSize", "4").create(stream);
     
     assertTokenStreamContents(stream,
-        new String[] { "basketballkurv", "ba", "sket", "bal", "ball", "kurv" }
+        new String[] { "basketballkurv", "ba", "sket", "ball", "bal", "kurv" }
     );
   }
   
diff --git a/lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/compoundDictionary_lucene8183.txt b/lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/compoundDictionary_lucene8183.txt
new file mode 100644
index 0000000..1e62f89
--- /dev/null
+++ b/lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/compoundDictionary_lucene8183.txt
@@ -0,0 +1,20 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# A set of words for testing LUCENE-8183
+basketball
+basket
+ball
+kurv

asfimport · 2020-10-21T11:23:24Z

Martin Demberger (migrated from JIRA)

Hi @uschindler,

I have created a pull request apache/lucene-solr#2014 with the changes from Rupert Westenthaler.

asfimport · 2020-10-21T22:40:13Z

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi,
thanks Martin! Let me get uptodate with the patch and let's see. To me the latest changes look fine, but I wasn't able to think about it.

asfimport · 2020-11-02T11:36:15Z

Martin Demberger (migrated from JIRA)

Hi Uwe,

is there any way I can fasten the merge. We have a few customer which wait for this fix because our search very often falls into this pithole.

asfimport · 2021-01-14T07:03:35Z

Martin Demberger (migrated from JIRA)

Is there any update on this? I have updated the PR so it approves to the new checks.

pstrsr · 2022-12-14T17:36:03Z

Are there still plans to merge this?

I would also be very interested in this feature.

MartinDemberger · 2023-07-13T09:06:47Z

Hi @uschindler,
I have adopted my PR for this repository.

uschindler · 2023-07-13T11:57:09Z

Hi thanks @MartinDemberger ,
the PR looks good - you also added tests and an hyphenation XML file, although I have not closely looked into the internals of what you are actually doing.

I think it should be fine to merge this into head, but I'd like to get another look by @rmuir who was one of the committers working on that TokenFilter. If this also fixes the problems with my dictionary and the configuration presented on its repository (https://github.com/uschindler/german-decompounder) I am more than happy.

To be clear: Except reordering tokens there aren't any backwards compatibility issues by the new features? From what I understood it only removes useless tokens - order of tokens with same position does not matter. So actually somebody having an index that was created with the older version of that filter won't see any serious issues, just some inprecise matches may no longer be returned (because either the token is no longer in new documents of the index or the generated query no longer contains the token). So it would only return less matches, but no wrong matches.

To me this looks fine.

uschindler · 2023-07-13T12:14:45Z

Does this also fix #4096 ?

MartinDemberger · 2023-07-18T20:50:25Z

Does this also fix #4096 ?

I'm sorry but no. #4096 notifies DictionaryCompoundWordTokenFilter but this one is about HyphenationCompoundWordTokenFilter
Maybe the change can be adopted but currently I don't have time and knowlege to do this. Maybe I have a look on it in a few weeks.

uschindler · 2023-07-19T10:40:19Z

Does this also fix #4096 ?

I'm sorry but no. #4096 notifies DictionaryCompoundWordTokenFilter but this one is about HyphenationCompoundWordTokenFilter Maybe the change can be adopted but currently I don't have time and knowlege to do this. Maybe I have a look on it in a few weeks.

You're right. I did not notice that this is the dictionary one. I have just seen that this issue was linked to the other one and did not read the whole issue.

uschindler · 2023-07-19T11:25:59Z

Hi,
i backported (cherry-picked) the PR also to Lucene 9.8.

Would it be possible that you maybe give a PR for the german-decompounder with an updated configuration in README.txt.
We should also open an issue at Elasticsearch/Opensearch that the new ctor parameters / factory are recognized in their code. Otherwise the new features cannot be used in ES/OS.

Solr should not be an issue, maybe some documentation fixes there, too.

MartinDemberger · 2023-07-19T11:32:52Z

Hi,
thank you very much. My company is waiting for this fix for several years.

I will create the ES/OS issue and try to update the README in the next days.

asfimport assigned uschindler Aug 23, 2022

asfimport mentioned this issue Aug 24, 2022

DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect [LUCENE-3022] #4096

Open

MartinDemberger mentioned this issue Jul 13, 2023

LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches #12437

Merged

uschindler linked a pull request Jul 13, 2023 that will close this issue

LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches #12437

Merged

uschindler added this to the 9.8.0 milestone Jul 13, 2023

uschindler closed this as completed in #12437 Jul 19, 2023

This was referenced Jul 20, 2023

Provide access to new settings for HyphenationCompoundWordTokenFilter elastic/elasticsearch#97849

Open

Provide access to new settings for HyphenationCompoundWordTokenFilter opensearch-project/OpenSearch#8796

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled [LUCENE-8183] #9231

HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled [LUCENE-8183] #9231

asfimport commented Feb 23, 2018 •

edited

asfimport commented Feb 23, 2018

asfimport commented Feb 23, 2018

asfimport commented Feb 23, 2018

asfimport commented Feb 23, 2018

asfimport commented Feb 23, 2018

asfimport commented Feb 23, 2018

asfimport commented Feb 24, 2018

asfimport commented Feb 24, 2018

asfimport commented Feb 26, 2018

asfimport commented Feb 26, 2018

asfimport commented Feb 26, 2018

asfimport commented Feb 27, 2018 •

edited

asfimport commented Oct 21, 2020

asfimport commented Oct 21, 2020

asfimport commented Nov 2, 2020

asfimport commented Jan 14, 2021

pstrsr commented Dec 14, 2022

MartinDemberger commented Jul 13, 2023

uschindler commented Jul 13, 2023

uschindler commented Jul 13, 2023

MartinDemberger commented Jul 18, 2023

uschindler commented Jul 19, 2023

uschindler commented Jul 19, 2023

MartinDemberger commented Jul 19, 2023

HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled [LUCENE-8183] #9231

HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled [LUCENE-8183] #9231

Comments

asfimport commented Feb 23, 2018 • edited

asfimport commented Feb 23, 2018

asfimport commented Feb 23, 2018

asfimport commented Feb 23, 2018

asfimport commented Feb 23, 2018

asfimport commented Feb 23, 2018

asfimport commented Feb 23, 2018

asfimport commented Feb 24, 2018

asfimport commented Feb 24, 2018

asfimport commented Feb 26, 2018

asfimport commented Feb 26, 2018

asfimport commented Feb 26, 2018

asfimport commented Feb 27, 2018 • edited

New Parameters:

Algorithm

asfimport commented Oct 21, 2020

asfimport commented Oct 21, 2020

asfimport commented Nov 2, 2020

asfimport commented Jan 14, 2021

pstrsr commented Dec 14, 2022

MartinDemberger commented Jul 13, 2023

uschindler commented Jul 13, 2023

uschindler commented Jul 13, 2023

MartinDemberger commented Jul 18, 2023

uschindler commented Jul 19, 2023

uschindler commented Jul 19, 2023

MartinDemberger commented Jul 19, 2023

asfimport commented Feb 23, 2018 •

edited

asfimport commented Feb 27, 2018 •

edited