Skip to content

add NGramSynonymTokenizer [LUCENE-5253] #6317

@asfimport

Description

@asfimport

I'd like to propose that we have another n-gram tokenizer which can process synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram size is fixed, i.e. minGramSize = maxGramSize.

Today, I think we have the following problems when using SynonymFilter with NGramTokenizer.
For purpose of illustration, we have a synonym setting "ABC, DEFG" w/ expand=true and N = 2 (2-gram).

  1. There is no consensus (I think :-) how we assign offsets to generated synonym tokens DE, EF and FG when expanding source token AB and BC.
  2. If the query pattern looks like XABC or ABCY, it cannot be matched even if there is a document "…XABCY…" in index when autoGeneratePhraseQueries set to true, because there is no "XA" or "CY" tokens in the index.

NGramSynonymTokenizer can solve these problems by providing the following methods.

  • NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't tokenize registered words. e.g.
source text NGramTokenizer+SynonymFilter NGramSynonymTokenizer
ABC AB/DE/BC/EF/FG ABC/DEFG
  • The back and forth of the registered words, NGramSynonymTokenizer generates extra tokens w/ posInc=0. e.g.
source text NGramTokenizer+SynonymFilter NGramSynonymTokenizer
XYZABC123 XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23 XY/YZ/Z/ABC/DEFG/1/12/23

In the above sample, "Z" and "1" are the extra tokens.


Migrated from LUCENE-5253 by Koji Sekiguchi (@kojisekig), resolved Oct 02 2013
Linked issues:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions